source

Context engineering represents the emerging discipline of designing, managing, and optimizing the information available to AI agents during task execution. While Prompt Engineering focuses on crafting effective instructions, context engineering addresses what information reaches the model and how it’s structured within the Context Window.

Context Size Paradox

Large context windows degrade performance when poorly organized. Architecture matters more than capacity.

The field emerged from a practical observation: poorly organized information or Context Rot undermines even million-token windows.

The Five Pillars of Context Engineering

Offloading Context: Rather than cramming everything into the context window, AI agents can use file systems as extended memory. The Manus team discovered that treating the file system as “unlimited in size, persistent by nature” enabled agents to maintain structured notes, planning documents, and memories outside immediate context. This mirrors how humans use external memory aids - notebooks, documentation, reference materials - to augment limited working memory. The strategy transforms context from a scarce resource into a hierarchical system where “warm” content stays in the window while “cold” content lives externally.

Reducing Context: Strategies like summarizing conversation history, pruning irrelevant information, and compressing verbose content address context bloat. The key insight is selective retention - preserving critical information while discarding noise. Multi-Agent Research Systems demonstrate this by having agents produce focused summaries rather than dumping raw research into shared context. Effective reduction requires understanding what’s essential versus incidental, similar to how Attention mechanism weights different input elements by relevance.

Retrieving Context: Dynamic context retrieval strategies fetch relevant information as needed rather than preloading everything. This includes semantic search with re-ranking, tool retrieval patterns that selectively load function definitions, and targeted information fetching based on task requirements. The challenge is ensuring high-quality retrieval that brings genuinely useful context without introducing distractors - better to retrieve three highly relevant documents than twenty marginally relevant ones.

Isolating Context: Multi-agent systems give each agent its own context window, preventing context pollution where one agent’s exploration clutters another’s workspace. This “context quarantine” approach enables parallel exploration while maintaining focus. The Orchestrator-Worker Pattern coordinates these isolated contexts, synthesizing insights without cross-contamination - similar to how operating systems use process isolation to prevent failures from spreading.

Caching Context: Prompt caching strategies reduce latency and cost by reusing previously processed context. KV-Cache Optimization keeps prompt prefixes stable and makes context append-only to maximize cache hit rates. Well-cached systems can be 10x faster than those constantly recomputing attention over full context, transforming context management from a purely semantic problem into a performance engineering challenge.

Why Context Engineering Matters

The discipline emerged from observing How Contexts Fail in practice. Four primary failure modes drive the need for sophisticated context engineering:

Hallucinations Compound Through Context

Errors entering context get reinforced through references. Traditional prompting cannot remove persistent misinformation.

Context Poisoning occurs when hallucinations or errors enter context and spiral through subsequent references. An agent might hallucinate a “goals” section, then repeatedly consult this fictional content. The solution requires treating context as a potentially hostile environment requiring validation.

Context Distraction happens when accumulated context overwhelms the model’s training signal. The model starts mimicking patterns from verbose context rather than applying its trained capabilities. Smaller models hit this distraction ceiling around 32k tokens, making context management critical for sustained performance. This connects to the fundamental constraint that language models have finite attention - what gets lost in accumulated history can’t contribute to current reasoning.

Models Cannot Ignore Context

Unlike humans, language models incorporate everything in context into probability distributions. Irrelevance degrades performance.

Context Confusion emerges when superfluous content degrades response quality. Function-calling benchmarks show decreased performance when models face many tool definitions simultaneously. Selective tool loading based on semantic similarity addresses this by presenting only relevant tools for each task.

Context Clash occurs when conflicting information creates ambiguity. This commonly happens in multi-stage interactions where early incorrect attempts influence later responses. One study observed a 39% performance drop when prompts were “sharded” across multiple messages rather than presented as coherent blocks. The clash creates a question of precedence - which information should the model trust when earlier content carries implicit weight from appearing first, but later content represents updated information?

Design Principles from Practice

The Manus team’s experience building an AI coding agent revealed actionable principles that work across different AI systems:

Design Around KV-Cache: Make prompt prefixes stable and context append-only to maximize cache hit rates. Mark cache breakpoints explicitly to optimize reuse patterns. The performance gains compound significantly - well-cached systems run 10x faster while costing dramatically less per request.

Maintain Stable Tool Availability: Avoid dynamically adding or removing tools mid-conversation, which breaks cache and confuses the model. Instead, use token logit masking to constrain action selection within a stable tool space. This preserves both performance optimization and model coherence - the model sees a consistent capability landscape rather than shifting options.

Manipulate Attention Strategically: Create artifacts like “todo.md” files that “recite objectives” within the model’s recent attention span. This pushes global plans into immediate awareness, preventing “lost-in-the-middle” problems where critical information buried in long context gets overlooked. Position matters - what comes last often matters as much as what appears first, since models show recency bias in their attention patterns.

Preserve Error Information: Leave “wrong turns in the context” so models learn from failed actions. This helps agents implicitly update their beliefs about what works. Scrubbing failures creates an artificially clean context that prevents learning from mistakes - similar to how humans benefit from understanding what doesn’t work, not just what does.

Introduce Controlled Variation: Prevent repetitive patterns by varying actions and responses. Models can fall into ruts where they mimic past behavior too closely, essentially overfitting to their own conversation history. Breaking these patterns maintains creative problem-solving rather than rigid pattern-matching.

Research-Specific Patterns

Deep research systems reveal context engineering at scale. Deep Research Systems achieve competitive performance through sophisticated context orchestration that extends beyond general principles into domain-specific patterns.

The Research Brief as Context Anchor: Research Scoping Patterns generate structured research briefs that serve as stable context anchors throughout multi-agent exploration. Unlike conversation history that accumulates noise, the brief provides focused reference point. Agents consult the brief to maintain alignment with user intent without accumulating verbose conversation context. This mirrors how Caching Context uses stable prefixes - the brief caches efficiently while remaining constantly relevant.

Per-Agent Isolation Boundaries: Research Workflow Architecture implements context isolation at architectural level. Each sub-agent operates within dedicated context window, preventing cross-contamination between parallel explorations. The orchestrator maintains coordination context separately from worker contexts. This quarantine enables true parallelization - agents don’t block on each other or pollute each other’s reasoning spaces. Isolating Context becomes not just strategy but foundational architecture.

Progressive Compression Through Pipeline: Research Compression Pipeline demonstrates multi-stage context reduction in practice. Stage 1: raw search results → filtered findings. Stage 2: verbose findings → focused summaries. Stage 3: multiple summaries → supervisor synthesis. Stage 4: synthesis → final report. Each stage serves different purpose - compression for efficiency vs expansion for quality. The pipeline prevents any single context from accumulating overwhelming detail while maintaining information flow.

Token Usage Predicts Quality

80% of performance variance explained by token usage. Context quality correlates with task complexity.

Token Usage and Research Quality: Anthropic’s research revealed this correlation not because more tokens automatically improve quality, but because simple questions need minimal context while PhD-level research requires extensive exploration. What information reaches models, when, and how it’s structured matters as much as model capabilities. See Multi-Agent Research Systems for quantitative analysis.

Research vs Code Agent Patterns: Research agents and code agents require different context strategies. Research agents benefit from broad exploration with aggressive compression - cast wide net, then filter heavily. Code agents benefit from focused context with high fidelity - precise requirements, minimal distractions. Research tolerates some information loss through compression; code requires exactness. Heterogeneous Model Strategies adapt to these different needs through model selection.

Integration with Multi-Agent Systems

Context engineering becomes particularly critical in multi-agent architectures where agents operate simultaneously. Anthropic’s research system demonstrated that token usage alone explains 80% of performance variance - context quality directly determines research quality. This makes context engineering not just an optimization, but a fundamental determinant of system capability.

The architecture employs context isolation where each subagent explores within its own context window. The orchestrator synthesizes these isolated explorations without each agent seeing the others’ full context. This prevents exponential context growth while enabling parallel exploration - five agents with 50k token contexts remain manageable, whereas a single agent accumulating 250k tokens would suffer from severe degradation.

LangGraph Workflows provide the infrastructure for implementing these patterns, offering state management across conversation turns, conditional routing between processing stages, and tool integration with dynamic loading. The framework enables granular control over context flow between agents and stages, making the transformations explicit rather than implicit.

Emerging Research and Tools

Context Engineering Strategies catalogues practical implementations of core techniques: RAG for selective information retrieval, tool loadout optimization, context quarantine for isolation, context pruning to remove noise, summarization for compression, and offloading for external storage. LangChain’s “How to Fix Your Context” repository provides working code for each pattern.

The Open Deep Research project showcases these principles in a production system, achieving competitive performance on PhD-level research tasks through sophisticated context orchestration. The system uses different LLMs for different stages - summarization, research generation, content compression, and final synthesis - optimizing context usage at each phase. This heterogeneous model strategy treats context engineering as a resource allocation problem: which model gets which context at what fidelity?

The Attention Economy

Context engineering fundamentally addresses an attention economy problem. Language models have finite attention - measured in tokens, computational budget, and architectural constraints. Just as humans can’t hold unlimited information in working memory, models can’t effectively process arbitrarily long contexts without degradation. This creates the central tension: complex tasks require rich context, but rich context degrades performance.

The discipline mirrors insights about managing cognitive load. We chunk information, create hierarchies, use external memory, and apply selective attention. Effective AI systems require similar strategies adapted to neural architecture constraints. The difference lies in how attention works - humans can consciously ignore irrelevant information, while models incorporate everything in context into their probability distributions. This makes context curation more critical for AI than for human cognition.

Context engineering represents a shift from viewing LLMs as stateless text predictors to treating them as cognitive systems requiring careful information diet. What enters context, when it arrives, how it’s structured, and what gets pruned determines system capabilities as much as model parameters themselves. Token usage explains 80% of performance variance in research systems not because larger contexts are better, but because context quality correlates with task complexity.

The field remains in early stages - more craft than science, more pattern collection than theory. As models scale and applications grow more sophisticated, context engineering will likely develop rigorous methodologies, standardized benchmarks, and principled design frameworks. For now, it exists as accumulated wisdom from practitioners discovering what works through experimentation and careful observation of failure modes. The patterns are emerging, but the theory remains incomplete.