We Tried to Reduce LLM Context Usage in a Multi-Repo Codebase. The AI Used More Tokens, Not Less. Here's Why That's Correct.

# ai# programming# productivity# machinelearning

Amariah Kamau

When we set out to build Blueprint — Atlarix's structural codebase retrieval system — the hypothesis...

When we set out to build Blueprint — Atlarix's structural codebase retrieval system — the hypothesis was simple: give the AI a map of the codebase upfront, and it will need to read fewer files. Fewer files means less context. Less context means lower cost and faster responses.

We ran a controlled benchmark. The AI with Blueprint used 54% more context than the AI without it.

Here's why that's not a failure.

The problem with how most AI coding tools handle context

If you've used Cursor, Claude Code, or GitHub Copilot on a large codebase, you've hit this wall: the AI either reads too much (dumping raw files into context until you hit the limit) or reads too little (making confident wrong assumptions about files it hasn't seen).

The root cause is navigation. Without a structural map of the codebase, the AI is exploring blind — making guesses about which files matter, following import chains manually, or relying on whatever files happen to be open. In a multi-repository workspace, this gets worse fast. You might have 25 separate projects with thousands of files. The AI has no idea where it is.

The standard solutions are:

Raw file injection — dump everything relevant into context upfront. Expensive and doesn't scale past a few hundred files.
Dense vector search — embed the codebase and retrieve by semantic similarity. Loses structural relationships (call chains, import graphs, HTTP routes).
Agentic search — let the model call search/read tools and figure it out. Works, but slow and token-hungry as the model searches blindly.

We built Blueprint to try a fourth approach: give the model a symbolic structural graph before it starts exploring.

What Blueprint actually is

Blueprint is a four-layer index:

Layer 1 — Universal Ctags (symbol index)
Extracts every function, class, type, and method across 18 languages. Line-accurate positions. Cached to .atlarix/symbols.json.

Layer 2 — ast-grep (structural edges)
AST-level pattern matching for import edges, call edges, and HTTP route edges. Express app.get, Fastify fastify.post, Next.js export async function GET — all become first-class nodes in the graph.

Layer 3 — BM25 (semantic symbol ranking)
Ranks ctags symbols by concept query. "Authentication middleware" finds the right functions without requiring an exact name match.

Layer 4 — ripgrep (text fallback)
Exact string search for when you know precisely what you're looking for.

The output is a compact Markdown slice — rooms (directory-scoped clusters), beacons (individual symbols), and edges (structural relationships). Section-scoped: the agent requests one folder at a time, not the whole workspace.

The benchmark

We ran two arms of the same task on a production multi-repository workspace:

25 sections across the workspace root
3,250 tracked files total
Target section: a TypeScript CLI package, 99 files, ~9,500 lines

Task: Trace an event-driven HTTP-ingress-to-webhook-reply pipeline. Both arms had identical deliverables — narrative of the flow, key file paths, Mermaid sequence diagram.

Arm A (with Blueprint): Prescribed tool order — explore folder → get_blueprint → text search → read_file on 2-3 central files.

Arm B (without Blueprint): Same task, no get_blueprint — only explore folder → text search → read_file.

Model: Kimi K2.6 (268K context window) via OpenRouter. Same model, same provider, both arms.

The results

	With Blueprint	Without Blueprint
Blueprint slice	~6,500 tokens	0
Final billed input	63,541 tokens	41,327 tokens
Output tokens	2,671	2,534
Task completion	✅	✅

Blueprint arm used 54% more total context.

Context growth per turn:

With Blueprint: 8,661 → 13,966 → 24,771 → 25,012 → 31,717 → 54,188 → 63,541

Without Blueprint: 2,253 → 3,567 → 8,629 → 13,934 → 14,175 → 37,876 → 41,327

The Blueprint arm took six tool-call turns. The no-Blueprint arm took five.

Why this is the correct result

Here's what we found in the qualitative output comparison:

The Blueprint arm named 7 specific internal functions by exact identifier — the auth validator, mention detector, memory clamp, post-processor, card builder, and two others. It surfaced a section-specific post-processor module not explicitly requested.

The no-Blueprint arm found a client module in an eval/ subdirectory that Blueprint's section scope hadn't included. It named specific environment variables and API constants the text search found directly.

Both arms completed the task correctly. But the type of knowledge was different.

Blueprint gave the model a symbol-level map before any file was read. With that map, the model knew which files were worth reading and went deeper — more function names, more architectural detail, more thorough coverage. Without the map, the model explored more conservatively: followed fewer paths, read fewer files, stopped sooner.

The no-Blueprint arm used fewer tokens partly because it was less certain about what to look for next.

For a read-only exploration task, "explored less" isn't obviously worse. Both arms got the answer. But for write tasks — bug fixes, refactors, feature implementation — a model that stops exploring because it's navigationally lost is not saving tokens. It's missing dependencies, and those missing dependencies become production bugs.

The real finding: structural understanding and execution context are separable problems

The honest framing isn't "Blueprint reduces total context." It's that these are two different problems:

Structural understanding cost — how many tokens does it take to know where you are in the codebase?

With Blueprint: ~6,500 tokens, regardless of section complexity, in ~3 seconds.
Without Blueprint: amortised across many search/read tool calls over multiple turns.

Execution context — how many tokens accumulate as the model actually does the work?

This is determined by exploration depth — how many files the model reads, how many tool calls it makes. Blueprint increases this by making the model more confident. But it's bounded and manageable.

We address the execution context problem with a separate mechanism: post-turn tool-result summarisation. After each turn, large tool outputs in the persisted transcript are rewritten by a fast compaction model — keeping paths, symbol names, and key values, dropping JSON noise and repetition. In the benchmark runs, individual read_file results compressed from 2,500–3,500 tokens to 60–110 tokens. ~95–98% reduction per qualifying block.

Two mechanisms, two layers, two different problems.

What this means if you're building on top of LLMs

If you're building an AI coding tool, an agentic system, or anything that needs to navigate a large codebase:

Don't chase "total context reduction" as a single metric. It conflates structural overhead (knowable upfront, bounded by your retrieval design) with execution noise (determined by task complexity and model confidence).

Give the model a map before it explores. Not raw files — a structural graph. The model will use more total context because it will explore more thoroughly. That's the right trade for write tasks.

Compress history, not retrieval. Post-turn summarisation on tool outputs is more effective than trying to cram less information into the initial retrieval. The model needs the full file during the turn. Future turns don't.

The full paper

This benchmark is documented in a technical paper published on Zenodo with full methodology, exact prompts, provider-billed token counts, and an honest discussion of limitations:

Blueprint: Section-Scoped Structural Graph Retrieval and Post-Turn Compression for Agentic LLM Coding in Multi-Repository Workspaces

zenodo.org/records/20381860 · DOI: 10.5281/zenodo.20381860

Atlarix is available at atlarix.dev. The MCP server registry is open-source at github.com/AmariahAK/atlarix-mcps.

Built in Nairobi. Questions or thoughts? Drop them in the comments.