News

Context engineering: 5 plays to beat RAG in 2025

Article Highlights:
  • Name primitives: dense, lexical, filters, re‑rank, assembly
  • Hybrid recall with 200–300 candidates leverages LLM reading
  • Re‑rank before context assembly to cut noise
  • Tight, structured contexts mitigate context rot
  • Assembly order: instructions, dedupe, diversify, hard token cap
  • Regex is strong for code search; embeddings add semantics
  • Index forking for branches/commits with fast re‑index
  • Small golden sets and generative benchmarking in CI
  • Cache and cost guardrails without losing quality
  • Offline compaction turns traces into useful memory
Context engineering: 5 plays to beat RAG in 2025

Introduction

Context engineering is about choosing and structuring what goes into the context window so LLMs answer better with lower cost.

In 2025, modern retrieval for AI is not "doing RAG". The winning stack blends careful ingestion, generous hybrid recall, strong re‑ranking and disciplined context assembly to resist "context rot". The aim: give the LLM only what matters, exactly when it matters.

Context

Modern AI search differs from classic search across tools, workloads, developers and consumers (often an LLM, not a human). Chroma's work highlights two practical pillars: understand context rot and measure progress with small golden sets and generative benchmarking.

"Don’t ship "RAG." Ship retrieval. Name the primitives (dense, lexical, filters, re‑rank, assembly, eval loop)."

Jeff Huber, CEO / Chroma

Five plays for effective retrieval

These practices cut errors and waste in context selection.

  • Name your primitives: dense, lexical/regex, filters, re‑rank, assembly, eval loop
  • Win first stage with hybrid recall (≈200–300 candidates)
  • Always re‑rank before assembling context
  • Respect context rot: tight, structured contexts beat jumbo windows
  • Build a small golden set; wire it into CI and dashboards

Operational pipeline

Ingest

Transform and enrich once; index for fast queries later.

  • Parse and domain‑aware chunking (headings, code, tables)
  • Enrichment: titles, anchors, symbols, metadata
  • Optional LLM chunk summaries (NL glosses for code/API)
  • Dense embeddings plus optional sparse signals
  • Write to DB (text, vectors, metadata)

Query

Blend signals, then prune and order precisely.

  • First‑stage hybrid: vectors + lexical/regex + metadata filters
  • Candidate pool: ~100–300
  • Re‑rank (LLM or cross‑encoder) → top ~20–40
  • Context assembly: instructions first, dedupe/merge, diversify sources, hard token cap

Outer loop

Measure continuously, control cost, compact memory.

  • Cache/cost guardrails
  • Generative benchmarking on small golden sets
  • Error analysis → re‑chunk, retune filters, re‑rank prompt
  • Memory/compaction: summarize traces into retrievable facts

The challenge: context rot

As tokens grow, attention and reasoning can degrade. Huge windows don’t imply effective use; compact, structured contexts with strict caps tend to win.

"LLM performance is not invariant to token count: with more tokens, models attend less and reason less effectively."

Jeff Huber, CEO / Chroma

Solution: applied context engineering

Favor generous hybrid recall, then robust re‑ranking before context assembly. Order matters: system instructions, dedupe, source diversity, hard token caps. Caching helps cost/latency but doesn’t fix context quality.

  • Re‑rank with an LLM or a lightweight re‑ranker; LLMs are flexible via prompts
  • LLMs can scan 200–300 candidates, enabling smart brute‑force
  • Weigh tail latency of parallel re‑ranks against quality gains

Code: indexing, regex and embeddings

In code search, indexing trades write‑time work for fast queries—vital on large or versioned repos. Regex remains powerful; code embeddings can add 5–15% when queries are semantic.

  • Native, indexed regex is a strong first layer
  • Embeddings help when the querier doesn’t know the code terms
  • Index forking enables fast per‑commit/branch versions with quick re‑index

Memory and compaction

Memory is the payoff of context engineering: compact, retrievable facts from interactions improve future answers.

Offline compaction (merge/split/rewrites, new metadata) and interaction summaries keep memory useful and cheap. Signals that improve retrieval also inform what to remember.

Evaluation: golden sets and generative benchmarking

A small, high‑quality golden set beats guesswork. If you have chunks but no queries, generate coherent queries with an LLM and use query→chunk pairs to measure models, filters and prompts.

  • Bring tests into CI and dashboards
  • Balance quality with cost, latency and API reliability
  • One evening of labeling often unlocks months of progress

Conclusion

Winning LLM retrieval is disciplined context work: hybrid recall, re‑rank before assembly, strict token caps and continuous eval loops. Context engineering turns fragile demos into resilient systems that don’t rot as context grows.

 

FAQ

Short, practical answers on AI search and LLMs.

  • What is context engineering for LLMs?
    It’s choosing and structuring the context window per generation step, with a continuous evaluation loop.
  • Why does context rot matter in AI search?
    Bigger windows can reduce attention and reasoning; compact, curated contexts perform better.
  • What’s a good first‑stage recall strategy?
    Hybrid: vector + lexical/regex + metadata filters to gather ~200–300 candidates.
  • Should I always re‑rank before assembling context?
    Yes. Re‑ranking improves precision and reduces noise before applying token caps.
  • How to apply context engineering to code?
    Use indexed regex as the base, add embeddings for semantic queries, and fork indexes for versions.
  • How do I measure retrieval improvements?
    Build a golden set and run generative benchmarking to compare models, filters and prompts in CI.
  • Does caching fix context issues?
    It helps cost and latency, but not the core problem of context selection quality.
  • How big should a golden set be?
    A few hundred well‑labeled examples are often enough to drive clear engineering choices.
Introduction Context engineering is about choosing and structuring what goes into the context window so LLMs answer better with lower cost. In 2025, [...] Evol Magazine