Reducing Context

source

Reducing context encompasses strategies that minimize Context Window size through aggressive filtering, compression, and pruning. Unlike Offloading Context which relocates information externally, reducing strategies discard or compress content to maintain lean, focused contexts that stay within models’ effective operating range.

The motivation stems from Context Rot - the progressive degradation of model performance as context length increases. Even with multi-million token context windows, models perform best with compact, well-curated contexts. Reducing context addresses this by treating context as a precious resource requiring active management rather than unlimited storage.

Two Reduction Philosophies

Pruning: Complete removal of content deemed irrelevant or redundant. This aggressive approach permanently discards information, betting that it won’t be needed. Pruning prevents Context Distraction by eliminating material that might mislead or overwhelm.

Summarization: Compression of verbose content into concise form while preserving key information. This conservative approach maintains awareness of content existence while reducing token consumption. Summarization balances information preservation against space efficiency.

The choice between pruning and summarization depends on information characteristics. Clearly irrelevant content deserves pruning. Potentially relevant but verbose content benefits from summarization. Critical information stays verbatim.

Context Pruning Mechanics

Implementation typically employs a smaller, faster model as a pruning filter. This filter reads verbose content and extracts only information relevant to current objectives. The filtered subset becomes context for the primary model.

For conversation history pruning might retain:

System prompt (establishes agent capabilities and behavior)
Current user query (defines immediate objective)
Previous exchanges that established critical context
Recent tool uses and their results
User preferences or requirements stated earlier
Successful solution patterns worth remembering

Everything else gets discarded:

Pleasantries and conversational overhead
Tangential discussions unrelated to current work
Failed attempts after successful completion
Verbose tool outputs after key findings extracted
Repeated information that doesn’t add new signal

This creates a compressed narrative containing just the essential story. The primary model sees clean context without distracting detail.

Progressive Summarization

Rather than treating all context equally, progressive summarization maintains multiple resolution levels based on recency:

Recent content (last 3-5 turns): Full fidelity with complete conversation detail Medium-age content (6-15 turns): Sentence-level summaries preserving key points Old content (16-30 turns): Paragraph-level summaries of major themes Ancient content (31+ turns): High-level overview of conversation arc

This mirrors human memory’s recency bias. Recent information stays vivid; distant memories fade to essentials. The model maintains detailed access to immediate context while retaining awareness of broader history.

Implementation requires determining summarization boundaries. When does content transition from full-fidelity to sentence-level summary? Fixed turn counts work but ignore information density. Adaptive boundaries based on topic shifts or token thresholds provide better results.

Summarization Challenges

The critical challenge is lossy compression. Summaries inevitably discard nuance, potentially losing information that becomes relevant later. This creates an irreversibility problem - once content is summarized, recovering lost detail requires returning to source material.

Effective summarization requires understanding what’s essential versus incidental. This is itself a challenging inference problem. The summarizer must predict which details matter for future reasoning without knowing what questions will arise.

Different information types require different summarization approaches:

Factual content: Extract key facts, discard supporting detail
Reasoning traces: Preserve conclusions, compress intermediate steps
Tool results: Keep actionable findings, drop raw data
User preferences: Maintain exact specifications, remove explanatory context

Poor summarization can be worse than no summarization. Summaries that distort meaning or introduce inaccuracies create Context Poisoning risk. The compression must preserve semantic fidelity or it corrupts understanding.

Two-Stage Filtering

LangChain’s implementation demonstrates a powerful pattern: use a smaller model to filter for a larger model. The small model reads verbose inputs and extracts relevant portions. The large model processes only filtered content.

This architecture leverages cost-performance tradeoffs. Small models are fast and cheap but less capable. Large models are expensive but powerful. By having the small model handle high-volume filtering, the system optimizes resource allocation.

The filter model can be highly specialized. Fine-tuned on examples of good versus bad context extraction, it learns domain-specific relevance patterns. This specialization produces better filtering than general-purpose models.

The tradeoff involves sequential processing overhead. The filter must complete before the primary model starts. This adds latency compared to direct processing. For tasks where quality matters more than speed, the tradeoff favors filtering.

Conversation Memory Management

Long-running agents require active conversation memory management to prevent Context Distraction. Strategies include:

Sliding Window: Maintain fixed-size context containing the N most recent turns. Old content falls off the beginning as new content appears at the end. Simple but loses potentially important historical context.

Importance-Based Retention: Score conversation turns for importance and retain high-scoring content regardless of recency. Requires effective importance detection but preserves critical context.

Hierarchical Summary: Maintain multiple summary levels (detailed recent, compressed medium, overview old) providing graceful degradation from high to low fidelity.

Checkpointing: Periodically create comprehensive summaries that become new conversation baselines. Subsequent context builds from checkpoint rather than original conversation start.

Each approach makes different tradeoffs around information loss, implementation complexity, and computational overhead.

Integration with Other Strategies

Reducing context creates destinations for Offloading Context. Rather than loading all external content into context, load summaries. Detailed information stays offloaded, retrieved only when summaries prove insufficient.

Retrieving Context benefits from reduction. Retrieved documents often contain verbose content. Summarizing retrievals before adding to context prevents retrieval from overwhelming the Context Window. This combination - retrieve then summarize - appears frequently in production systems.

Caching Context works better with reduced contexts. Smaller contexts mean more content fits in cache. Stable summarized content caches well, while verbose original content creates cache churn.

Isolating Context reduces the need for aggressive reduction. When each agent maintains a small, focused context, individual context reduction becomes less critical. But even isolated contexts benefit from pruning irrelevant content.

Preventing Context Failures

Reduction directly addresses Context Distraction by preventing contexts from growing into distraction-inducing sizes. Keeping context lean ensures models can focus on current tasks rather than getting lost in vast histories.

It helps with Context Confusion by removing irrelevant content that would create interference. Only information relevant to current objectives enters context, minimizing decision space and reducing semantic noise.

For Context Clash, reduction can remove outdated information that would contradict current understanding. Pruning stale content prevents confusion about which information represents current truth.

Context Poisoning risk decreases when pruning removes potentially hallucinated content. However, aggressive pruning might remove evidence needed to detect poisoning. The balance requires careful consideration.

Performance Characteristics

Token Reduction: Primary benefit is dramatic token reduction. Conversations that would consume 100k tokens might compress to 10k through aggressive reduction. This directly improves cost, latency, and performance.

Quality Risk: Lossy compression risks information loss that degrades output quality. The challenge is reducing tokens while preserving essential signal. Poor reduction trades token savings for capability loss.

Computational Overhead: Two-stage filtering and progressive summarization require additional computation. This overhead is usually small compared to processing verbose contexts with large models.

Latency Tradeoff: Filtering adds sequential processing steps. For latency-sensitive applications, this overhead matters. For quality-sensitive applications, the delay is worthwhile.

Implementation Frameworks

LangGraph Workflows enables sophisticated reduction through:

State transformation nodes that implement pruning
Conditional routing based on context size
Integration of summarization models in processing graphs
Checkpointing that supports conversation snapshots

The framework makes reduction patterns explicit and testable. Developers can see exactly where pruning occurs, what summarization strategy applies, and how context evolves through processing stages.

Open Deep Research uses reduction extensively. Research agents produce summaries of findings rather than raw search results. The orchestrator receives compressed insights, not verbose research history. This multi-stage compression keeps any single context manageable.

Design Principles

Effective reduction follows several principles:

Preserve Signal: Aggressively discard noise but protect signal. The goal is removing distractors while maintaining information needed for reasoning.

Graceful Degradation: Compression should degrade gracefully from high fidelity to low fidelity based on age or importance. Avoid sharp transitions where critical information suddenly disappears.

Maintain Coherence: Reduced context must remain coherent. Summaries that create gaps or logical jumps harm understanding more than helping token reduction.

Context-Aware Filtering: What counts as “relevant” depends on current objectives. Reduction strategies should adapt based on task requirements rather than applying fixed rules.

Reversibility When Possible: Maintain original content externally even when summarized in active context. This enables recovering detail if summaries prove insufficient.

The Fundamental Tradeoff

Reduction embodies the central Context Engineering tension: completeness versus focus. Complete contexts provide all information but risk Context Rot through size. Focused contexts maintain performance but risk missing important content through aggressive filtering.

This tradeoff has no universal solution. The optimal balance depends on:

Task complexity (complex tasks need more context)
Context window size (larger windows tolerate more verbose contexts)
Model capabilities (better models handle longer contexts)
Performance requirements (quality-sensitive work favors completeness)
Cost constraints (token-limited scenarios favor aggressive reduction)

Effective systems dynamically adjust reduction aggressiveness based on these factors rather than applying fixed strategies.

Gradual Notes

Recent Writing

Revisited

Space is Not Barrenness

Study the Canon

Recent Notes

Caching Context

Kubernetes Batch Jobs

Sidekiq Architecture

Sidekiq Capsules