Context Window

The context window defines the maximum amount of text (measured in tokens) that a language model can process in a single interaction. It includes both the input prompt and the model’s generated response. Think of it as the model’s “working memory” - everything outside this window is effectively invisible to the model during generation.

How It Works

When you interact with a language model:

Your input prompt consumes some portion of the context window
The model’s response consumes additional space
In multi-turn conversations, the entire conversation history fits within the window
Once the window fills, older content must be truncated or summarized

Modern models have varying context window sizes:

Early GPT models: ~4,000 tokens (roughly 3,000 words)
Current production models: 32,000 to 200,000+ tokens
Specialized long-context models: 1,000,000+ tokens

Why It Matters

The context window fundamentally constrains what tasks a model can perform. Analyzing a 500-page document requires either:

A large enough context window to fit the entire document
Chunking strategies that process sections separately
Multi-agent approaches where different agents handle different portions

In Multi-Agent Research Systems, giving each agent its own context window enables parallel processing. Rather than a single agent sequentially processing multiple research directions (which would quickly exhaust its context), specialized agents can explore different topics simultaneously, each within its own context boundary.

Context Management Strategies

The field of Context Engineering has emerged to address the challenges of effectively utilizing context windows. While having large context capacity is valuable, Context Rot demonstrates that performance degrades as context grows, even when staying within theoretical limits.

Summarization: Compress earlier conversation turns into concise summaries. This preserves key information while freeing space for new content. Part of broader Context Engineering Strategies that manage information flow.

Sliding Windows: Maintain the most recent N tokens while discarding older content. Simple but loses important earlier context.

Selective Retention: Identify and preserve critical information (like system prompts, key facts, user preferences) while discarding less important content. How Contexts Fail reveals why this curation matters - models struggle with irrelevant information.

External Memory: Store information outside the context window in databases or vector stores, retrieving relevant pieces as needed. This underpins retrieval-augmented generation (RAG) approaches and context offloading patterns.

Agent Specialization: In multi-agent systems, give each agent a focused domain requiring less context rather than a single agent trying to maintain broad context about everything. This context isolation prevents the cross-contamination described in How Contexts Fail.

Architectural Implications

Context windows create fundamental tradeoffs in system design:

Larger Windows: Enable handling complex, lengthy tasks in single interactions but consume more compute and increase latency. The model must process more tokens with each generation.

Smaller Windows with Orchestration: Require more sophisticated coordination but enable parallel processing and can be more cost-effective. Orchestrator-worker patterns manage this complexity.

Hybrid Approaches: Use large contexts for synthesis and coordination while employing smaller contexts for focused subtasks.

Cognitive Parallel

The concept mirrors human working memory limitations. We can’t hold unlimited information in active attention - we chunk, summarize, and offload to external memory (notes, documents, tools). Effective AI systems similarly combine limited active context with sophisticated memory management strategies.

Gradual Notes

Recent Writing

Revisited

Space is Not Barrenness

Study the Canon

Recent Notes

Caching Context

Kubernetes Batch Jobs

Sidekiq Architecture

Sidekiq Capsules