Nova ElvarisI was burning through tokens like they were free. Then I started measuring. My average coding session...
I was burning through tokens like they were free. Then I started measuring. My average coding session used ~80K tokens, and most of it was wasted context the model didn't need.
After a month of experimenting, I cut that to ~35K tokens per session with zero loss in output quality. Here are the five tricks that made the difference.
Instead of pasting an entire file into context, I prepend a 3-line summary:
// FILE: src/auth/middleware.ts (147 lines)
// PURPOSE: Express middleware for JWT validation + role-based access
// EXPORTS: authMiddleware, requireRole, extractUser
Then I only include the specific function I need help with. The model gets enough context to understand the architecture without reading 147 lines of boilerplate.
Token savings: ~60% per file reference.
When the model needs to understand how a function interacts with other modules, don't paste the full dependency — paste a stub:
// STUB: database.ts
interface DB {
query<T>(sql: string, params: any[]): Promise<T[]>;
transaction(fn: (tx: Transaction) => Promise<void>): Promise<void>;
}
The model only needs the interface contract, not the 500-line implementation with connection pooling and retry logic.
Token savings: ~80% per dependency.
For multi-turn sessions, I reset context every 3-4 turns with a summary:
Context reset. Here's where we are:
- We're building a rate limiter for the /api/upload endpoint
- We've decided on a sliding window algorithm with Redis
- The function signature is: rateLimit(userId: string, windowMs: number, maxRequests: number)
- Current blocker: handling Redis connection failures gracefully
Continue from here.
This prevents the "context sludge" problem where the model drags along 20 turns of outdated conversation.
Token savings: ~40% on sessions longer than 5 turns.
Tell the model what to ignore explicitly:
Focus only on the error handling logic in processPayment().
Ignore: logging, metrics, the retry wrapper, input validation.
These are tested and working — don't modify or comment on them.
Without this, the model will "helpfully" refactor your logging, suggest improvements to your validation, and burn tokens on things you didn't ask about.
Token savings: ~30% on modification tasks.
Constrain the response format upfront:
Return ONLY:
1. The modified function (no surrounding code)
2. A 2-line summary of what changed
3. One potential edge case to test
Do NOT include: explanations of existing code, import statements,
or alternative approaches.
I started doing this after noticing that ~40% of most AI responses was explanation I didn't need. The code was fine — the commentary was the waste.
Token savings: ~40% on output tokens.
Using all five together on a typical refactoring task:
| Technique | Before | After |
|---|---|---|
| File references | 12K tokens | 4K tokens |
| Dependencies | 8K tokens | 2K tokens |
| Conversation history | 25K tokens | 15K tokens |
| Unfocused responses | 15K tokens | 8K tokens |
| Verbose output | 20K tokens | 6K tokens |
| Total | 80K | 35K |
That's not just cheaper — it's faster. Smaller context means faster inference, fewer hallucinations, and more focused output.
Context windows are not "how much the model can read." They're a budget. Every token you spend on unnecessary context is a token not available for reasoning about your actual problem.
Treat your context window like RAM: measure it, manage it, and stop assuming more is better.
Start with trick #1 (file summary headers) — it's the easiest to adopt and has the highest payoff. Then layer in the others as they feel natural.
Your wallet and your response quality will both thank you.