Claude! Stop Burning Tokens on Your Agent's Tool Output!

# ai# programming# tooling# mcp

marcosomma

A Two-Stage Curator That Pays for Itself I watched Claude Code feed 108,894 bytes of seq 1 20000...

A Two-Stage Curator That Pays for Itself

I watched Claude Code feed 108,894 bytes of seq 1 20000 back into its own context window. That output contained 20,000 integers.
No errors. No signal. No insight. Just counting.

And yet the system still had to tokenize it, send it back to the model, and bill for it. This is not an edge case. It is the default failure mode of agent tooling.

Tools produce output. The output goes back to the model. You pay for it. Logs, test runs, ps listings, git history, build spam, progress bars, boilerplate, decorative separators, repeated warnings, repeated success lines, repeated everything. A depressing amount of it is useless.

A lot of agent users are quietly paying premium model rates to process terminal confetti. That is the real problem!

My first fix was the obvious one. I added a PreToolUse hook and pushed large Bash output through a cheaper model before it reached Opus.

That worked, technically. Then I noticed I was still being stupid, just in a more optimized way.

On seq 1 20000, I was paying Haiku to read 20,000 integers and tell me they were 20,000 sequential integers.

Yes, that is cheaper than letting Opus read them.

No, that is not a good design.

If a 40-line awk script can identify the pattern for free, paying any model to summarize it is already waste. So the architecture changed. Not “small model before big model.” That idea sounds clever, but by itself it is just cost reshuffling.

The real pattern is simpler and better: extract free signal first, and only pay for a model when deterministic tools run out of leverage.

That led to a two-stage curator.

Stage 1 is deterministic cleanup. It strips ANSI escape sequences, removes carriage-return junk from progress bars, collapses consecutive duplicate lines, and compresses monotonic integer runs. It is effectively free.

Stage 2 is LLM extraction, but only if stage 1 still leaves too much output. That is where tokens are spent. That means it should fire rarely, and only when stage 1 could not do enough.

That distinction matters! Because once you actually measure it, a lot of tool output turns out to be compressible by embarrassingly simple logic.

The benchmark

Here is the benchmark across five scenarios, using the pricing assumptions from the benchmark runner: Opus input at $15 per million tokens, Haiku input at $1 per million, Haiku output at $5 per million, and a rough estimate of 4 bytes per token.

Scenario	Raw bytes	Stage 1	Final	LLM?	Tokens saved	Haiku cost	Net savings
`seq 1 20000`	108,894	37	103	no	27,197	$0.000	+$0.408
5,000 repeated log lines	145,000	38	104	no	36,224	$0.000	+$0.543
ANSI + progress-bar spam	54,000	27	92	no	13,477	$0.000	+$0.202
`ps auxww` (unique lines)	213,230	213,230	1,008	yes	53,055	$0.055	+$0.741
`echo hello world`	12	12	12	no	0	$0.000	$0.000

The pattern is blunt.

Three of the four large-output cases were handled for free.

seq collapsed from 108,894 bytes to 37.
Repeated log spam dropped from 145,000 to 38.
ANSI and progress-bar noise fell from 54,000 to 27.

No LLM call was needed in any of those cases.

The only scenario that needed stage 2 was ps auxww, which is exactly what you would want. That output is genuinely varied. There is not much for awk to compress. That is the moment when paying a smaller model to extract the useful facts is justified.

Small output remained untouched, which is also correct. If the command only produced hello world, the system paid nothing and moved on.

This is the whole point.

A cheap LLM is not the first line of defense.

Deterministic cleanup is.

The LLM should be the escalation path, not the reflex.

Stage 1: deterministic cleanup

This is the part that does the real work more often than people expect.

#!/usr/bin/awk -f

function flush_int_run() {
    if (int_count >= 3) {
        printf "[%d sequential integers %s..%s]\n", int_count, int_start, int_end
    } else if (int_count > 0) {
        for (i = int_start; i <= int_end; i++) print i
    }
    int_count = 0
}

function flush_dupe() {
    if (dupe_count > 1) {
        printf "%s [×%d]\n", dupe_line, dupe_count
    } else if (dupe_count == 1) {
        print dupe_line
    }
    dupe_count = 0
}

{
    gsub(/\033\[[0-9;]*[a-zA-Z]/, "", $0)
    gsub(/\r/, "", $0)

    if ($0 ~ /^-?[0-9]+$/) {
        n = $0 + 0
        if (int_count > 0 && n == int_end + 1) {
            int_end = n
            int_count++
            next
        }
        flush_int_run()
        flush_dupe()
        int_start = n
        int_end = n
        int_count = 1
        next
    }

    flush_int_run()

    if (dupe_count > 0 && $0 == dupe_line) {
        dupe_count++
    } else {
        flush_dupe()
        dupe_line = $0
        dupe_count = 1
    }
}

END {
    flush_int_run()
    flush_dupe()
}

There is nothing magical here.

It strips terminal paint.
It collapses repeated lines.
It compresses obvious integer sequences.

That is enough to destroy huge amounts of waste.

This is a useful reminder for AI tooling in general. A lot of expensive “reasoning” problems are not reasoning problems. They are preprocessing failures.

Stage 2: only escalate if stage 1 failed to shrink enough

The wrapper runs the command, checks the raw size, and passes small output through untouched.

If the raw output is large, it runs the deterministic cleaner.

If the cleaned output is now small enough, it returns that cleaned output and stops.

Only if the cleaned output is still large does it call Haiku.

#!/usr/bin/env bash
set -o pipefail

RAW_THRESHOLD="${CLAUDE_BASH_SUMMARIZE_THRESHOLD:-8000}"
LLM_THRESHOLD="${CLAUDE_BASH_LLM_THRESHOLD:-8000}"
MODEL="${CLAUDE_BASH_SUMMARIZE_MODEL:-claude-haiku-4-5}"

cmd="$1"
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
CLEAN_AWK="$SCRIPT_DIR/deterministic-clean.awk"

raw=$(mktemp); cleaned=$(mktemp)
trap 'rm -f "$raw" "$cleaned"' EXIT

bash -c "$cmd" >"$raw" 2>&1
rc=$?

raw_size=$(wc -c <"$raw" | tr -d ' ')

if [ "$raw_size" -le "$RAW_THRESHOLD" ]; then
    cat "$raw"
    exit "$rc"
fi

awk -f "$CLEAN_AWK" <"$raw" >"$cleaned"
cleaned_size=$(wc -c <"$cleaned" | tr -d ' ')

if [ "$cleaned_size" -le "$LLM_THRESHOLD" ]; then
    printf '=== CURATED %d→%d bytes (stage 1 deterministic, no LLM) ===\n' \
        "$raw_size" "$cleaned_size"
    cat "$cleaned"
    exit "$rc"
fi

summary=$(claude -p --model "$MODEL" \
    "Extract signal from this command output. KEEP: errors, warnings, stack traces, file paths with line numbers, numeric results, unique events, final status. DROP: decorative separators, boilerplate. Preserve exact error text verbatim. Be terse but faithful on key facts. Plain text only." \
    <"$cleaned" 2>/dev/null)

summary_size=${#summary}
printf '=== CURATED %d→%d→%d bytes (stage 1 + %s extraction) ===\n' \
    "$raw_size" "$cleaned_size" "$summary_size" "$MODEL"
printf '%s\n' "$summary"
exit "$rc"

That second threshold check is the entire point.

Without it, the “cheap model” becomes a permanent tax on output that deterministic logic had already made cheap.

With it, the LLM only gets called when the dumb tools genuinely ran out of leverage.

That is how this stops being a cute hack and starts becoming a sensible pipeline.

Hooking it into Claude Code

I wired it into Claude Code with a PreToolUse hook on Bash. The hook rewrites the original command so it runs through the wrapper first.

{
  "hooks": {
    "PreToolUse": [{
      "matcher": "Bash",
      "hooks": [{
        "type": "command",
        "command": "/path/to/.claude/scripts/bash-wrap-hook.sh",
        "timeout": 5
      }]
    }]
  }
}

The hook script itself is tiny. It reads the incoming JSON, extracts tool_input.command, and swaps in the wrapped version. Nothing here is conceptually hard. That is precisely why it is worth doing. Too much agent engineering right now is really just people tolerating waste because it looks sophisticated once wrapped in model calls.

The cost math

Here is the clean version. Let the raw output be N tokens.
If you send it directly to Opus, the input cost is:

15N / 1,000,000

Now suppose stage 1 reduces that output to K tokens. If K is still above threshold, stage 2 fires. Haiku reads K tokens, emits a summary of M tokens, and then Opus receives those M tokens.
So the escalated path costs:

(1K + 5M + 15M) / 1,000,000 = (K + 20M) / 1,000,000

Break-even is therefore:

15N > K + 20M

That is the actual condition for the two-stage system. If stage 1 barely helps, then K ≈ N, and the inequality becomes:

15N > N + 20M

which simplifies to:

14N > 20M

or:

M < 0.7N

So in the worst case, where deterministic cleanup did almost nothing, the LLM stage still pays off if it compresses the remaining content by roughly 1.43x or better. That is not a demanding threshold.

And in the real pipeline, stage 1 often shrinks the input before the LLM ever sees it, which makes the economics even more comfortable.
This is why the design works. Not because “small model then big model” is automatically clever. Because the model is only invited in after cheap tools have already done what they can.

What I would change next

This version already works, but there are obvious next steps.

Latency should probably be part of the gate, not just bytes. Saving a few cents is not impressive if it adds a few seconds to every interactive tool call.

Stage 1 could be extended to catch more patterns, especially timestamp-heavy logs where the message repeats but the prefix changes.

The system should probably hard-cap verbose LLM summaries, because “cheap extraction” can still become noisy if the prompt drifts.

And the current implementation buffers until the command finishes. That is fine for benchmarking, but worse for real long-running workflows. A streaming version would be much better.

But none of that changes the core lesson.

The actual lesson

The interesting idea here is not “use an LLM to compress LLM inputs.” That is the shallow reading. The more useful pattern is this:

before you spend tokens to extract signal, try extracting signal for free.

A lot of the mess we hand to expensive models is not difficult. It is just noisy. And noisy is not the same as complex. That distinction matters. Because once you see it clearly, the same pattern starts showing up everywhere.

Retrieval pipelines that rerank garbage before filtering it.
Scrapers that pass repeated boilerplate into embeddings.
Log processors that ask a model to summarize progress-bar sludge.
Agent systems that burn premium tokens on output a shell one-liner could have collapsed immediately.

Cheap filter first.
Expensive model second.
Measure the break-even.
Then stop paying premium rates for repetition, boilerplate, terminal paint, and counting.

That is not an AI breakthrough.

It is just basic engineering discipline, which is exactly why so many agent stacks are currently missing it.