Venu gopal varma BhupathirajuLLM debates usually focus on parameter counts and benchmark scores. But in production, a much simpler...
LLM debates usually focus on parameter counts and benchmark scores. But in production, a much simpler constraint dictates performance and cost: tokens.
Every prompt, system message, tool call, and history block consumes tokens. If you do not manage them, you get a system that works in testing but is too expensive to run.
Here is how tokens behave under the hood, why they cost so much, and how to reduce consumption.
The mechanics: how models process text
To do anything with text, a model has to convert it to numbers. This is tokenization.
Most modern models use Byte Pair Encoding (BPE). It merges common character pairs into subwords. Common words get their own tokens; rare words get chopped up.
The problem is that BPE is semantically blind. It merges based on character frequency, not meaning. A word like "intelligently" might split into "intelligent" and "ly," or into an odd set of characters depending on what the model was trained on.
New methods like Attention Guided BPE try to fix this by adding semantic rules to the merges, which keeps word boundaries intact and makes the vocabulary more efficient.
Once a token has an ID, the model maps it to a dense vector (an embedding) that represents its meaning. Words with similar usage, like "king" and "queen," end up near each other in vector space. Transformers adjust these embeddings during training to capture context.
Inside the model, processing happens in parallel. For each token, the transformer calculates query, key, and value vectors. The query represents what the token wants, the key represents what it offers, and the value holds the actual content. Attention scores weight how much each token should focus on every other token. Because this parallel work loses track of sequence, positional encodings are added to preserve the order of words.
Even though models process subwords, they build full words internally. Research into "detokenization" shows that early and middle layers group subwords back into whole concepts. If you pass a split word like "un" + "h" + "appiness," you can decode the full word "unhappiness" from the representation in the later layers. This suggests models hold a latent lexicon. You can exploit this to expand a vocabulary without retraining by fusing multi token representations.
The math: why the bills spike
Model APIs charge per million input and output tokens. That looks cheap, but agentic loops compound the cost.
A 10,000 token system prompt sent on every turn of a 50 turn session consumes half a million tokens before the model generates a single word of output.
When agents run in a loop, every step is a new API call. If a five step task runs 500 times, you are paying for the history over and over. One fintech startup ended up spending $4,100 a month on Q&A inference. As traffic grew, the projected bill hit $13,000.
Ten ways to cut token usage
Prompt caching
Resending the same system prompt, tool schemas, or documentation on every turn is wasteful. Caching those blocks on Anthropic or OpenAI saves about 90% on input costs.
You just have to structure the prompt so the stable blocks sit at the very beginning. In a test of 500 agent sessions, caching cut API bills by 40% to 80% and speeded up response times for the first token by up to 30%.
Model routing
You do not need Claude Opus or GPT-4 for simple jobs. A router can inspect incoming queries and send easy tasks to cheaper models.
The cost gap is huge: Claude Haiku is $1 per million input tokens, while Opus is $15.
In that fintech setup, 61% of queries were simple tasks like database lookups or text formatting. Routing those to smaller models cut overall costs by 57%.
Semantic caching
Many users ask the same questions using different words. "What is the KYC rule?" and "Explain the know-your-customer process" want the same answer.
A semantic cache embeds incoming questions and checks them against previous ones. If a new query is close enough to an old one, the system returns the cached answer instantly, bypassing the model entirely.
In the fintech system, the cache hit rate hit 34% within two weeks. That is a third of all queries answered for free.
Compacting history
Sending the entire conversation history on every turn is a bad default. By turn twelve, you are paying for turns one through eleven again.
Instead, try these approaches:
Keep only the last few turns (sliding window).
Summarize older turns using a cheap model.
Save facts in a database and only inject the specific rows you need.
Prompt compression
You can trim prompts before they reach the API. LLMLingua removes low information tokens while preserving the meaning. On fintech queries, it compressed prompts by 38% with almost no loss in quality.
For outputs, techniques like CROP penalize output length during prompt optimization. This cut output tokens by 80% in tests without hurting accuracy.
Auditing tool schemas
Tool definitions are just tokens. Descriptions are often bloated. One team audited 11 tools and found descriptions averaged 180 tokens. Shortening them saved 1,400 tokens on every turn.
Loop caps and budget guards
Runaway agent loops are the fastest way to blow a budget. An agent retrying a broken tool 14 times is worse than 200 normal runs.
Set a hard limit on loops, use exponential backoff, and show a clear error to the user when things fail.
Short-circuiting outputs
If your system parses the model's output programmatically, you do not need to wait for the full response. Stream the tokens and abort the call as soon as you have the fields you need. This saved 22% of output tokens in tests.
Self hosting simple tasks
Small tasks like classification, embedding generation, and reranking run fine on open models. You can host these yourself for a flat monthly fee instead of paying API rates.
Optimizing structured data
JSON is heavy on syntax: braces, quotes, and repeated keys. Using simpler formats for input context can shrink the token count.
The results
Here is how one team got a 73% cost reduction:
Another team dropped their bill from $4,100 to $1,560:
Where to start
If you are optimizing an LLM system from scratch:
Token optimization is not a single fix; it is a set of small adjustments. Five small wins are easier to build and ship than one massive change, and they add up quickly. Measure where the waste is and start there.