Introduction
Context engineering is about choosing and structuring what goes into the context window so LLMs answer better with lower cost.
In 2025, modern retrieval for AI is not "doing RAG". The winning stack blends careful ingestion, generous hybrid recall, strong re‑ranking and disciplined context assembly to resist "context rot". The aim: give the LLM only what matters, exactly when it matters.
Context
Modern AI search differs from classic search across tools, workloads, developers and consumers (often an LLM, not a human). Chroma's work highlights two practical pillars: understand context rot and measure progress with small golden sets and generative benchmarking.
"Don’t ship "RAG." Ship retrieval. Name the primitives (dense, lexical, filters, re‑rank, assembly, eval loop)."
Jeff Huber, CEO / Chroma
Five plays for effective retrieval
These practices cut errors and waste in context selection.
- Name your primitives: dense, lexical/regex, filters, re‑rank, assembly, eval loop
- Win first stage with hybrid recall (≈200–300 candidates)
- Always re‑rank before assembling context
- Respect context rot: tight, structured contexts beat jumbo windows
- Build a small golden set; wire it into CI and dashboards
Operational pipeline
Ingest
Transform and enrich once; index for fast queries later.
- Parse and domain‑aware chunking (headings, code, tables)
- Enrichment: titles, anchors, symbols, metadata
- Optional LLM chunk summaries (NL glosses for code/API)
- Dense embeddings plus optional sparse signals
- Write to DB (text, vectors, metadata)
Query
Blend signals, then prune and order precisely.
- First‑stage hybrid: vectors + lexical/regex + metadata filters
- Candidate pool: ~100–300
- Re‑rank (LLM or cross‑encoder) → top ~20–40
- Context assembly: instructions first, dedupe/merge, diversify sources, hard token cap
Outer loop
Measure continuously, control cost, compact memory.
- Cache/cost guardrails
- Generative benchmarking on small golden sets
- Error analysis → re‑chunk, retune filters, re‑rank prompt
- Memory/compaction: summarize traces into retrievable facts
The challenge: context rot
As tokens grow, attention and reasoning can degrade. Huge windows don’t imply effective use; compact, structured contexts with strict caps tend to win.
"LLM performance is not invariant to token count: with more tokens, models attend less and reason less effectively."
Jeff Huber, CEO / Chroma
Solution: applied context engineering
Favor generous hybrid recall, then robust re‑ranking before context assembly. Order matters: system instructions, dedupe, source diversity, hard token caps. Caching helps cost/latency but doesn’t fix context quality.
- Re‑rank with an LLM or a lightweight re‑ranker; LLMs are flexible via prompts
- LLMs can scan 200–300 candidates, enabling smart brute‑force
- Weigh tail latency of parallel re‑ranks against quality gains
Code: indexing, regex and embeddings
In code search, indexing trades write‑time work for fast queries—vital on large or versioned repos. Regex remains powerful; code embeddings can add 5–15% when queries are semantic.
- Native, indexed regex is a strong first layer
- Embeddings help when the querier doesn’t know the code terms
- Index forking enables fast per‑commit/branch versions with quick re‑index
Memory and compaction
Memory is the payoff of context engineering: compact, retrievable facts from interactions improve future answers.
Offline compaction (merge/split/rewrites, new metadata) and interaction summaries keep memory useful and cheap. Signals that improve retrieval also inform what to remember.
Evaluation: golden sets and generative benchmarking
A small, high‑quality golden set beats guesswork. If you have chunks but no queries, generate coherent queries with an LLM and use query→chunk pairs to measure models, filters and prompts.
- Bring tests into CI and dashboards
- Balance quality with cost, latency and API reliability
- One evening of labeling often unlocks months of progress
Conclusion
Winning LLM retrieval is disciplined context work: hybrid recall, re‑rank before assembly, strict token caps and continuous eval loops. Context engineering turns fragile demos into resilient systems that don’t rot as context grows.
FAQ
Short, practical answers on AI search and LLMs.
- What is context engineering for LLMs?
It’s choosing and structuring the context window per generation step, with a continuous evaluation loop. - Why does context rot matter in AI search?
Bigger windows can reduce attention and reasoning; compact, curated contexts perform better. - What’s a good first‑stage recall strategy?
Hybrid: vector + lexical/regex + metadata filters to gather ~200–300 candidates. - Should I always re‑rank before assembling context?
Yes. Re‑ranking improves precision and reduces noise before applying token caps. - How to apply context engineering to code?
Use indexed regex as the base, add embeddings for semantic queries, and fork indexes for versions. - How do I measure retrieval improvements?
Build a golden set and run generative benchmarking to compare models, filters and prompts in CI. - Does caching fix context issues?
It helps cost and latency, but not the core problem of context selection quality. - How big should a golden set be?
A few hundred well‑labeled examples are often enough to drive clear engineering choices.