Caching Context

source

Caching context involves reusing previously computed attention states to improve performance and reduce cost when prompt prefixes remain stable across requests. This Context Engineering strategy transforms context management from a purely semantic challenge into a performance optimization problem with architectural implications throughout system design.

When prompt prefixes stay consistent - system prompts, established conversation context, reference documentation - models can cache the key-value pairs from attention computations and reuse them rather than recomputing from scratch. This creates dramatic speedups and cost reductions, but requires specific context management patterns to maximize cache hit rates.

KV-Cache Mechanics

The Attention mechanism computes key and value tensors for each token in context during forward passes. These K and V tensors enable the model to attend to different parts of the input when generating outputs. Computing these tensors is expensive - it requires full model inference over the context.

When context changes slightly - adding new tokens to the end while keeping earlier content stable - the K and V tensors for stable content remain valid. The model can reuse cached tensors for unchanged content, computing fresh tensors only for new content. This dramatically reduces computation.

The performance gains compound significantly. A request with 10,000 cached tokens and 100 new tokens requires computing attention for 100 tokens rather than 10,100. That’s a 100x reduction in computation for the cached portion. Well-cached systems can be 10x faster than constantly recomputing over full context.

Cost reductions mirror speed improvements. Anthropic’s Claude API charges differently for cached versus uncached tokens - cached tokens are dramatically cheaper. Systems designed for caching can achieve 5-10x cost reductions compared to cache-naive implementations.

Design Patterns for Caching

Stable Prefixes: Keep invariant content at the beginning of context where it forms consistent prefixes across requests. System prompts, coding guidelines, API documentation - anything that doesn’t change between requests should appear early.

Variable content goes after cached sections. User queries, conversation history, task-specific information appear after the stable prefix. This enables caching the expensive invariant content while processing only the variable tail.

Append-Only Context: Design context to grow through appending rather than modifying earlier content. Modification breaks cache validity - any change to cached content invalidates the cache. Append-only structures maximize cache hit rates by ensuring earlier content stays stable.

This pattern influences conversation memory management. Rather than editing earlier messages, append new messages. Rather than updating summaries in-place, append updated versions. Immutability enables caching.

Explicit Cache Breakpoints: Systems like Claude support cache control annotations that mark positions where cache can be safely reused. These explicit boundaries guide the model’s caching infrastructure to optimize reuse patterns.

LangChain’s context management emphasizes designing prompt structures with caching in mind from the start. Put high-volume, stable content (documentation, examples, guidelines) in cacheable positions. Reserve non-cacheable space for dynamic content.

Anti-Patterns That Break Caching

Dynamic Tool Loading: Adding or removing tools mid-conversation invalidates cache. The Manus team discovered this creates both performance problems and model confusion. Solution: maintain stable tool availability throughout conversations, using other techniques like logit masking to constrain tool selection rather than removing tools from context.

In-Place Updates: Modifying earlier context content breaks cache validity. Updating a summary, correcting a mistake, or refining earlier content all force cache invalidation. This makes edit-based context management expensive from a caching perspective.

Unstable Prefixes: Randomized examples, timestamps in prompts, or other sources of variability in early context prevent caching. Even small changes propagate through attention computations, invalidating downstream cache.

Frequent Context Resets: Starting fresh conversations discards accumulated cache. While sometimes necessary for coherence, resets eliminate caching benefits. Balance between context quality and cache preservation.

Integration with Context Strategies

Caching creates tensions with other Context Engineering Strategies:

Caching vs. Reduction: Reducing Context through pruning or summarization typically modifies context, breaking cache. But reduction might enable fitting more requests within cache limits. Tradeoff: cache stability versus context quality.

Caching vs. Retrieval: Retrieving Context dynamically changes what appears in context, limiting caching opportunities. But retrieval infrastructure itself can cache - vector stores cache embeddings, re-ranking caches evaluations. Move caching from model inference to retrieval layer.

Caching vs. Isolation: Isolating Context across agents prevents cache sharing between contexts. Each isolated context must build its own cache. But within isolated contexts, stable prefixes still benefit from caching.

Caching vs. Offloading: Offloading Context removes content from the Context Window, but offloaded content can’t be cached. This tradeoff favors offloading for rarely accessed content, caching for frequently accessed content.

Access Frequency Patterns

Optimal caching strategy depends on content access frequency:

Always Accessed: System prompts, core guidelines, fundamental documentation. These belong in cached stable prefixes. Maximum ROI on caching investment.

Frequently Accessed: Common code patterns, frequently referenced APIs, recent conversation context. These benefit from caching but might not justify massive cache sizes. Cache selectively.

Occasionally Accessed: Historical conversation, edge-case documentation, rarely used tools. These create poor cache hit rates. Better to Offloading Context or Retrieving Context on demand.

Rarely Accessed: Ancient history, exhaustive documentation, comprehensive examples. Don’t waste cache capacity. Retrieve if needed or prune entirely.

This creates a memory hierarchy resembling operating system memory management:

Cache: Hot content with highest access frequency
Context Window: Warm content needed for current task
External Storage: Cold content retrieved on demand

Multi-Agent Caching

Multi-Agent Research Systems create complex caching scenarios. Each agent maintains isolated context, but agents might share common prefixes:

System prompts explaining orchestration pattern
Shared research guidelines
Common tool definitions
Base knowledge about research methodology

Naive isolation would cache these prefixes separately for each agent. Sophisticated infrastructure could share cache across agents for common prefixes. This requires coordinated cache management where shared prefixes are explicitly identified and reused.

Open Deep Research could benefit from shared caching of research guidelines while maintaining isolated contexts for actual research. Implementation complexity must be weighed against caching benefits.

Cost-Performance Modeling

Caching transforms context engineering cost models. Without caching:

Cost proportional to total tokens processed
Latency proportional to total context length
No benefit from stable content

With caching:

Fixed cost to build cache (first request pays full price)
Marginal cost for cached content dramatically lower
Latency driven by uncached content length
Strong incentive to maximize stable prefixes

This changes optimization focus. Rather than minimizing total context size, optimize for maximizing cacheable content ratio. A 20k context with 18k cached and 2k uncached can be cheaper and faster than a 10k context with no caching.

The break-even analysis considers:

Cache build cost (first request with new content)
Per-request savings from cache hits
Number of requests before context changes
Cost difference between cached and uncached tokens

For high-frequency, stable-prefix scenarios, caching provides massive ROI. For one-off queries with unstable prefixes, caching overhead might not pay off.

Prompt Engineering for Caching

Effective caching requires prompt engineering discipline:

Structure for Stability: Organize prompts with stable content first, variable content last. Resist temptation to sprinkle variable content throughout - this breaks caching.

Version Carefully: When stable content must change, consider versioning that enables gradual migration. Don’t invalidate all user caches simultaneously with prompt updates.

Minimize Variability: Remove unnecessary sources of instability. Timestamps, randomized examples, dynamic greetings - these add variation that breaks caching without adding value.

Document Cache Boundaries: Make explicit which prompt sections are stable versus variable. This enables reasoning about cache behavior and debugging cache problems.

The Manus team’s experience emphasizes designing around KV-cache from the start. Retrofitting caching into cache-naive prompts requires significant restructuring. Better to build cache-friendly architectures initially.

Implementation Frameworks

LangGraph Workflows supports caching through careful state management. The framework enables:

Structured prompts with stable prefixes
Append-only conversation histories
Explicit checkpointing where cache boundaries make sense
State isolation that supports per-agent caching

Provider APIs like Claude’s prompt caching require opt-in through cache control markers. The application explicitly annotates which context sections should be cached. This gives fine-grained control but requires understanding caching mechanics.

OpenAI’s implementation caches automatically based on heuristics. Less control but lower cognitive overhead for developers. The system infers caching opportunities from prompt structure.

Observability and Debugging

Caching creates observability challenges:

Cache hit rates determine actual performance
Cache misses might indicate prompt instability
Unexpected cache invalidation suggests prompt changes
Cost variance reflects caching effectiveness

Effective systems monitor:

Cache hit percentage per request
Time spent computing versus reusing cache
Cost breakdown between cached and uncached tokens
Correlation between prompt changes and cache invalidation

This visibility enables optimizing for caching. Identify stability problems, measure caching benefits, and justify architectural decisions based on cache performance.

The Fundamental Tension

Caching creates tension between flexibility and performance. Cache-optimized systems must maintain stable contexts, but task variety demands context adaptation. Resolution approaches:

Tiered Stability: Highly stable core (system prompt), moderately stable middle (conversation context), variable tail (current query). Optimize each tier differently.

Cached Templates: Define cacheable templates with slots for variable content. The template structure caches; variable content fills slots. This enables both stability and flexibility.

Strategic Invalidation: Accept cache invalidation when value justifies cost. Major prompt improvements warrant invalidating user caches. Minor tweaks might not.

Multi-Level Caching: Cache at multiple levels - model inference, retrieval infrastructure, application state. Optimize each level for its access patterns and stability characteristics.

Evolution and Future

As context windows expand to millions of tokens, caching becomes more critical, not less. Computing attention over million-token contexts is prohibitively expensive without caching. The performance gap between cached and uncached processing will widen.

Future systems might employ:

Semantic caching: Cache based on semantic similarity, not exact match
Hierarchical caching: Cache at multiple granularities (word, sentence, paragraph, document)
Predictive caching: Anticipate likely next contexts and pre-cache
Shared caching: Pool cache across users for common content

These advances will make caching more powerful and easier to leverage. But the fundamental principle persists: stable content enables efficient reuse, and Context Engineering must design for this stability.

Gradual Notes

Recent Writing

Revisited

Space is Not Barrenness

Study the Canon

Recent Notes

Caching Context

Reentrant Code

Zeigarnik Effect

Migrating from Resque to Sidekiq

Caching Context

KV-Cache Mechanics

Design Patterns for Caching

Anti-Patterns That Break Caching

Integration with Context Strategies

Access Frequency Patterns

Multi-Agent Caching

Cost-Performance Modeling

Prompt Engineering for Caching

Implementation Frameworks

Observability and Debugging

The Fundamental Tension

Evolution and Future

Recent Writing

Revisited

Space is Not Barrenness

Study the Canon

Recent Notes

Caching Context

Reentrant Code

Zeigarnik Effect

Migrating from Resque to Sidekiq

Kubernetes Batch Jobs

Graph View

Table of Contents

Backlinks