Back to research
MemoryArchitecture

Why long context windows aren’t memory

Expanding token limits gives the illusion of memory but attention degrades predictably. Persistent structured memory is a separate, necessary primitive.

February 202510 min read

The scratchpad illusion

When Claude supports 200k tokens and Gemini advertises 1M, it is tempting to conclude that the memory problem is solved: just put everything in the context window. This conflates two fundamentally different capabilities. A context window is a scratchpad: a fixed-size buffer that the model can attend to during a single forward pass. Memory is a persistent store that survives across sessions, supports targeted retrieval, and degrades gracefully under load.

The distinction matters because scratchpads have a physics problem: transformer attention is O(n²) in sequence length, and even with optimizations like FlashAttention, the practical cost of attending to 200k tokens is substantial. More importantly, the quality of attention degrades before the quantity runs out.

Attention degradation curves

Liu et al. (2023) demonstrated the 'lost in the middle' effect: information placed in the middle of a long context is recalled significantly less accurately than information at the beginning or end. We extended their needle-in-haystack methodology specifically for code-related facts and measured accuracy at four context lengths.

Context LengthNeedle at StartNeedle at 25%Needle at 50%Needle at 75%Needle at End
4k tokens98%97%96%97%99%
32k tokens96%89%72%85%95%
128k tokens94%71%48%63%92%
200k tokens91%58%31%52%89%

At 200k tokens, a code fact placed in the middle of the context was recalled only 31% of the time. This is not a minor degradation: it means that two-thirds of the time, the model effectively cannot see information in the center of its own context. For an autonomous agent making multi-step edits, this is a reliability floor, not a ceiling.

Why more tokens won’t fix this

The degradation is not a bug that will be patched with the next model release. It is a consequence of how self-attention distributes probability mass across positions. As the sequence grows, the attention weight on any single position shrinks. Positional encodings (RoPE, ALiBi) help at the extremes but cannot overcome the fundamental dilution in the interior.

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

As n → ∞, max(softmax(QKᵀ / √d_k)) → 0
for interior positions

This does not mean long context is useless. It is extremely valuable for tasks like summarization, translation, and single-pass analysis. But it is not memory. Treating it as memory leads to silent failures: the agent appears to 'forget' a type definition it saw 80k tokens ago, introduces a duplicate function, or contradicts a decision it made earlier in the session.

Persistent structured memory

Real memory for an autonomous agent requires three properties that context windows lack: (1) persistence across sessions, so the agent does not re-learn the codebase every conversation; (2) targeted retrieval, so only the relevant subset of knowledge enters the context; and (3) graceful degradation, so a large memory store does not linearly degrade generation quality.

Trevec implements memory as a persistent, versioned symbol graph. When an agent begins a task, Trevec retrieves the precise subgraph of symbols relevant to that task (typically 2-4k tokens) and injects it at the top of the context. The agent works with a small, high-relevance context window rather than a large, noisy one.

// Agent session with Trevec memory
const memory = await trevec.loadGraph("project-abc");
const context = await memory.retrieve({
  task: "Refactor payment processing to use Stripe v2 API",
  maxTokens: 4096,
});
// context contains: relevant function signatures,
// type definitions, recent edit history, and
// architectural constraints, across sessions

The architecture of forgetting

A memory system also needs to forget. As a codebase evolves, old facts become stale: renamed variables, deleted files, refactored APIs. Trevec versions its symbol graph alongside git commits, automatically marking symbols as deprecated when their source is deleted or significantly modified. The retrieval layer filters stale symbols by default, preventing the agent from acting on outdated information.

Memory is not the ability to store everything. Memory is the ability to store the right things, retrieve them when needed, and discard them when they become false.

Conclusion

Long context windows are a remarkable engineering achievement, but they solve a different problem than memory. An autonomous software agent needs persistent, structured, version-aware memory with targeted retrieval, not a larger scratchpad. Conflating the two leads to agents that silently forget, hallucinate stale facts, and produce inconsistent outputs. Building memory as a first-class primitive, separate from the context window, is a prerequisite for reliable autonomous work.

References

  1. [1]Liu, N. F. et al. "Lost in the Middle: How Language Models Use Long Contexts." TACL 2023.
  2. [2]Vaswani, A. et al. "Attention Is All You Need." NeurIPS 2017.
  3. [3]Dao, T. et al. "FlashAttention: Fast and Memory-Efficient Exact Attention." NeurIPS 2022.
  4. [4]Su, J. et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." 2021.
  5. [5]Press, O. et al. "Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization." ICLR 2022.