The context window defines the maximum amount of text (measured in tokens) that a language model can process in a single interaction. It includes both the input prompt and the model’s generated response. Think of it as the model’s “working memory” - everything outside this window is effectively invisible to the model during generation.
How It Works
When you interact with a language model:
- Your input prompt consumes some portion of the context window
- The model’s response consumes additional space
- In multi-turn conversations, the entire conversation history fits within the window
- Once the window fills, older content must be truncated or summarized
Modern models have varying context window sizes:
- Early GPT models: ~4,000 tokens (roughly 3,000 words)
- Current production models: 32,000 to 200,000+ tokens
- Specialized long-context models: 1,000,000+ tokens
Why It Matters
The context window fundamentally constrains what tasks a model can perform. Analyzing a 500-page document requires either:
- A large enough context window to fit the entire document
- Chunking strategies that process sections separately
- Multi-agent approaches where different agents handle different portions
In Multi-Agent Research Systems, giving each agent its own context window enables parallel processing. Rather than a single agent sequentially processing multiple research directions (which would quickly exhaust its context), specialized agents can explore different topics simultaneously, each within its own context boundary.
Context Management Strategies
The field of Context Engineering has emerged to address the challenges of effectively utilizing context windows. While having large context capacity is valuable, Context Rot demonstrates that performance degrades as context grows, even when staying within theoretical limits.
Summarization: Compress earlier conversation turns into concise summaries. This preserves key information while freeing space for new content. Part of broader Context Engineering Strategies that manage information flow.
Sliding Windows: Maintain the most recent N tokens while discarding older content. Simple but loses important earlier context.
Selective Retention: Identify and preserve critical information (like system prompts, key facts, user preferences) while discarding less important content. How Contexts Fail reveals why this curation matters - models struggle with irrelevant information.
External Memory: Store information outside the context window in databases or vector stores, retrieving relevant pieces as needed. This underpins retrieval-augmented generation (RAG) approaches and context offloading patterns.
Agent Specialization: In multi-agent systems, give each agent a focused domain requiring less context rather than a single agent trying to maintain broad context about everything. This context isolation prevents the cross-contamination described in How Contexts Fail.
Architectural Implications
Context windows create fundamental tradeoffs in system design:
Larger Windows: Enable handling complex, lengthy tasks in single interactions but consume more compute and increase latency. The model must process more tokens with each generation.
Smaller Windows with Orchestration: Require more sophisticated coordination but enable parallel processing and can be more cost-effective. Orchestrator-worker patterns manage this complexity.
Hybrid Approaches: Use large contexts for synthesis and coordination while employing smaller contexts for focused subtasks.
Cognitive Parallel
The concept mirrors human working memory limitations. We can’t hold unlimited information in active attention - we chunk, summarize, and offload to external memory (notes, documents, tools). Effective AI systems similarly combine limited active context with sophisticated memory management strategies.