ContextRetrieval

Deterministic context retrieval vs chunk-RAG

Why chunking-based retrieval loses the structural information that matters most for code understanding, and how graph-indexed AST-aware retrieval recovers it.

January 202512 min read

The chunking fallacy

Retrieval-augmented generation (RAG) has become the default pattern for grounding language models in external knowledge. The standard pipeline (chunk documents, embed chunks, retrieve top-k by cosine similarity) works reasonably well for prose. But code is not prose. Code has structure: syntax trees, scope chains, import graphs, type hierarchies. When you chunk a source file into 512-token windows, you shatter exactly the relationships an agent needs to make correct edits.

Consider a function that calls three helpers defined in different files. A chunk-based system might retrieve the function body but miss the helper signatures. The language model then hallucinates parameter types, invents return values, or silently drops error-handling paths. This is not a rare edge case: in our evaluation across three production TypeScript codebases, chunk-RAG failed to retrieve at least one critical dependency in 66% of multi-file edit tasks.

Graph-indexed retrieval

The alternative is to index code the way compilers see it: as an abstract syntax tree (AST) with edges for imports, calls, type references, and data flow. Given a query (for example, 'all functions that transitively affect the webhook retry path'), the retrieval system walks the graph rather than scanning vectors. The result is deterministic: the same query on the same codebase always returns the same subgraph.

Determinism matters because it makes retrieval auditable. If an agent produces a bad edit, you can inspect exactly which context it received. With vector search, a slight change in the embedding model or a re-indexed chunk boundary can silently alter results, making post-hoc debugging nearly impossible.

Benchmark methodology

We evaluated three retrieval strategies across three codebases: a 140k-line TypeScript monorepo (e-commerce), a 90k-line Python data pipeline, and a 60k-line Go microservices backend. For each codebase we constructed 50 edit tasks requiring cross-file context, then measured precision (fraction of retrieved items that were relevant) and recall (fraction of relevant items that were retrieved).

Strategy	Precision	Recall	F1	p95 Latency
Chunk-RAG (512 tokens)	0.41	0.34	0.37	220ms
Chunk-RAG (1024 tokens)	0.38	0.29	0.33	245ms
Hybrid (chunk + keyword)	0.52	0.48	0.50	310ms
Graph-indexed (AST)	0.89	0.85	0.87	180ms

Graph-indexed retrieval achieved an F1 of 0.87 compared to 0.37 for standard chunk-RAG: a 2.4x improvement. Notably, the graph approach was also faster at p95 because it avoids the embedding-similarity computation entirely, replacing it with a bounded graph traversal.

AST-aware indexing

Our indexing pipeline parses every source file into a language-specific AST using tree-sitter, then extracts a normalized symbol graph: functions, classes, methods, type aliases, imports, and call sites. Edges are typed (calls, imports, extends, implements, reads, writes) and weighted by confidence. The resulting graph is stored in an adjacency-list format optimized for k-hop neighborhood queries.

// Simplified retrieval query
const context = await trevec.retrieve({
  entry: "src/webhooks/retry.ts:retryWithBackoff",
  depth: 2,            // 2-hop neighborhood
  edgeTypes: ["calls", "imports", "type_refs"],
  maxNodes: 40,
});
// Returns: deterministic subgraph with
// function bodies, signatures, and type defs

Latency across codebases

Retrieval latency scales with graph density rather than corpus size. The Go microservices codebase, despite being the smallest by line count, had the densest call graph and produced the highest retrieval times. In all cases, p95 latency remained under 250ms, well within the budget for interactive agent loops.

Codebase	Files	Lines	Symbols	p50 Latency	p95 Latency
E-commerce (TS)	1,240	140k	8,900	45ms	180ms
Data pipeline (Py)	680	90k	5,200	38ms	145ms
Microservices (Go)	420	60k	6,800	52ms	210ms

Implications for agent accuracy

Better retrieval directly translates to better agent output. In a follow-up experiment, we gave GPT-4 the same 50 edit tasks with context from each retrieval strategy. With chunk-RAG context, 34% of generated patches applied cleanly and passed tests. With graph-indexed context, that number rose to 78%. The model was identical. Only the context changed.

P(correct edit) = P(relevant context retrieved) \times P(correct generation | relevant context) \approx 0.85 \times 0.92 = 0.78 (graph-indexed) \approx 0.34 \times 1.00 = 0.34 (chunk-RAG, upper bound)

The bottleneck in AI-assisted code editing is not generation quality: it is retrieval quality. Fix retrieval and generation follows.

Conclusion

Chunk-RAG was designed for documents, not for code. When applied to software codebases, it systematically drops the structural relationships that determine correctness. Graph-indexed, AST-aware retrieval recovers those relationships deterministically, yielding dramatically higher precision, recall, and downstream agent accuracy, at lower latency. This is the approach we have implemented in Trevec and use across all Beaverise systems.

References

[1]Lewis, P. et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.
[2]Liu, N. F. et al. "Lost in the Middle: How Language Models Use Long Contexts." TACL 2023.
[3]Microsoft Research. "GraphRAG: Unlocking LLM Discovery on Narrative Private Data." 2024.
[4]Alon, U. et al. "code2vec: Learning Distributed Representations of Code." POPL 2019.
[5]Zhang, F. et al. "RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation." EMNLP 2023.