What is context engineering for LLMs?

Selecting and structuring the context per generation step, with continuous quality evaluation.

Why does context rot matter in AI search?

Bigger windows can harm attention and reasoning; compact, curated contexts work better.

What’s a good first‑stage recall strategy?

Hybrid recall: vector, lexical/regex and metadata filters to get 100–300 candidates.

Should I re‑rank before context assembly?

Yes. Re‑ranking boosts precision and reduces noise before applying token caps.

How to apply context engineering to code?

Use indexed regex first, add embeddings for semantics, and fork indexes per branch/commit.

How to evaluate retrieval improvements?

With a golden set and generative benchmarking integrated into CI and dashboards.

Context engineering for LLMs: a practical guide

Introduction

Context engineering is about choosing and structuring what goes into the context window so LLMs answer better with lower cost.

In 2025, modern retrieval for AI is not "doing RAG". The winning stack blends careful ingestion, generous hybrid recall, strong re‑ranking and disciplined context assembly to resist "context rot". The aim: give the LLM only what matters, exactly when it matters.

Context

Modern AI search differs from classic search across tools, workloads, developers and consumers (often an LLM, not a human). Chroma's work highlights two practical pillars: understand context rot and measure progress with small golden sets and generative benchmarking.

"Don’t ship "RAG." Ship retrieval. Name the primitives (dense, lexical, filters, re‑rank, assembly, eval loop)."

Jeff Huber, CEO / Chroma

Five plays for effective retrieval

These practices cut errors and waste in context selection.

Name your primitives: dense, lexical/regex, filters, re‑rank, assembly, eval loop
Win first stage with hybrid recall (≈200–300 candidates)
Always re‑rank before assembling context
Respect context rot: tight, structured contexts beat jumbo windows
Build a small golden set; wire it into CI and dashboards

Operational pipeline

Ingest

Transform and enrich once; index for fast queries later.

Parse and domain‑aware chunking (headings, code, tables)
Enrichment: titles, anchors, symbols, metadata
Optional LLM chunk summaries (NL glosses for code/API)
Dense embeddings plus optional sparse signals
Write to DB (text, vectors, metadata)

Query

Blend signals, then prune and order precisely.

First‑stage hybrid: vectors + lexical/regex + metadata filters
Candidate pool: ~100–300
Re‑rank (LLM or cross‑encoder) → top ~20–40
Context assembly: instructions first, dedupe/merge, diversify sources, hard token cap

Outer loop

Measure continuously, control cost, compact memory.

Cache/cost guardrails
Generative benchmarking on small golden sets
Error analysis → re‑chunk, retune filters, re‑rank prompt
Memory/compaction: summarize traces into retrievable facts

The challenge: context rot

As tokens grow, attention and reasoning can degrade. Huge windows don’t imply effective use; compact, structured contexts with strict caps tend to win.

"LLM performance is not invariant to token count: with more tokens, models attend less and reason less effectively."

Jeff Huber, CEO / Chroma

Solution: applied context engineering

Favor generous hybrid recall, then robust re‑ranking before context assembly. Order matters: system instructions, dedupe, source diversity, hard token caps. Caching helps cost/latency but doesn’t fix context quality.

Re‑rank with an LLM or a lightweight re‑ranker; LLMs are flexible via prompts
LLMs can scan 200–300 candidates, enabling smart brute‑force
Weigh tail latency of parallel re‑ranks against quality gains

Code: indexing, regex and embeddings

In code search, indexing trades write‑time work for fast queries—vital on large or versioned repos. Regex remains powerful; code embeddings can add 5–15% when queries are semantic.

Native, indexed regex is a strong first layer
Embeddings help when the querier doesn’t know the code terms
Index forking enables fast per‑commit/branch versions with quick re‑index

Memory and compaction

Memory is the payoff of context engineering: compact, retrievable facts from interactions improve future answers.

Offline compaction (merge/split/rewrites, new metadata) and interaction summaries keep memory useful and cheap. Signals that improve retrieval also inform what to remember.

Evaluation: golden sets and generative benchmarking

A small, high‑quality golden set beats guesswork. If you have chunks but no queries, generate coherent queries with an LLM and use query→chunk pairs to measure models, filters and prompts.

Bring tests into CI and dashboards
Balance quality with cost, latency and API reliability
One evening of labeling often unlocks months of progress

Conclusion

Winning LLM retrieval is disciplined context work: hybrid recall, re‑rank before assembly, strict token caps and continuous eval loops. Context engineering turns fragile demos into resilient systems that don’t rot as context grows.

FAQ

Short, practical answers on AI search and LLMs.

What is context engineering for LLMs?
It’s choosing and structuring the context window per generation step, with a continuous evaluation loop.
Why does context rot matter in AI search?
Bigger windows can reduce attention and reasoning; compact, curated contexts perform better.
What’s a good first‑stage recall strategy?
Hybrid: vector + lexical/regex + metadata filters to gather ~200–300 candidates.
Should I always re‑rank before assembling context?
Yes. Re‑ranking improves precision and reduces noise before applying token caps.
How to apply context engineering to code?
Use indexed regex as the base, add embeddings for semantic queries, and fork indexes for versions.
How do I measure retrieval improvements?
Build a golden set and run generative benchmarking to compare models, filters and prompts in CI.
Does caching fix context issues?
It helps cost and latency, but not the core problem of context selection quality.
How big should a golden set be?
A few hundred well‑labeled examples are often enough to drive clear engineering choices.

Context engineering: 5 plays to beat RAG in 2025

Introduction

Context

Five plays for effective retrieval

Operational pipeline

Ingest

Query

Outer loop

The challenge: context rot

Solution: applied context engineering

Code: indexing, regex and embeddings

Memory and compaction

Evaluation: golden sets and generative benchmarking

Conclusion

FAQ

Tag:

Related links:

Introduction

Context

Five plays for effective retrieval

Operational pipeline

Ingest

Query

Outer loop

The challenge: context rot

Solution: applied context engineering

Code: indexing, regex and embeddings

Memory and compaction

Evaluation: golden sets and generative benchmarking

Conclusion

FAQ

Tag:

Related links:

Related Articles

Context Engineering: Architecting Scalable AI Agents with Google ADK

agents.md GitHub Copilot: Lessons from 2,500 Repositories

Building Effective Harnesses for Long-Running AI Agents