We Cut Our MCP Token Spend in Half. Here's the Architecture

# mcp# ai# webdev# programming

Arindam Majumder

When we started scaling our MCP workflows, token usage was something we barely tracked. The system...

When we started scaling our MCP workflows, token usage was something we barely tracked. The system worked well, responses were accurate, and adding more tools felt like the right next step. Over time, the cost began rising in ways that did not align with how much the system was actually used.

At first, we assumed this was due to higher usage or more complex queries. The data showed something else. Even simple requests were using more tokens than expected. This led us to ask a basic question. What exactly are we sending to the LLM on every call?

A closer look made things clearer. The issue came from how the system was built. We handled context, tool definitions, and execution flow by adding extra tokens at every step.

This article explains how we found the root cause and redesigned the architecture to fix it. The changes cut our MCP token usage by nearly half and gave us better control over how the system behaves.

Understanding Token Usage in MCP Systems

Once we started examining token usage, a clear pattern showed up. The LLM was receiving far more context than most requests actually needed. A large part of this came from tool definitions being sent repeatedly on every call.

Each request included the full list of tools, even when only one or two were needed. On top of that, earlier outputs and intermediate results were passed back into the model. The context kept growing, even for simple queries.

The execution flow added to the problem. The LLM would choose a tool, call it, process the result, and then repeat the same cycle if another step was needed. Each step added more tokens, and the same data often appeared many times across calls.

This setup worked at a small scale. As the number of tools increased, the cost grew quickly. More tools meant more context. More steps meant repeated processing. The system was doing extra work without adding real value. At this point, the cause was clear. Token usage came from how the system handled context and execution. The design itself was driving the overhead.

Introducing Bifrost

We started looking for a way to change how the system handled tool execution. The goal was simple. Reduce the amount of context sent to the LLM and avoid repeated processing across steps.

During this process, we came across Bifrost, an open source MCP gateway. It works between the application, the model, and the tools. It brings structure for how tools are discovered and executed, so the LLM receives only what is needed on each call.

This changed how we thought about the system. Tool access became more controlled. Context stayed limited to what was required for each request. The overall flow of execution became easier to follow and reason about.

These changes directly addressed the issues we were seeing. Tool definitions were sent only when required. Repeated decision loops were reduced. The system handled execution in a more controlled and predictable way.

From here, the focus moved away from adjusting prompts and toward changing how the system runs end-to-end.

Architectural Changes with Bifrost Code Mode

The main change came from how execution was handled inside Bifrost. Code Mode is a Bifrost feature that changes how the LLM interacts with MCP tools. Earlier, the LLM handled both planning and step-by-step tool interaction. Each step required another call, and each call carried a growing context.

Code Mode separates these responsibilities. The LLM focuses on planning. It generates executable code that defines the full workflow for a task.

Code Mode works best when multiple MCP servers are involved, workflows have several steps, or tools need to share data. For simpler setups with one or two tools, Classic MCP works well.

A mixed setup also works. Use Code Mode for heavier workflows like search or databases, and keep simple tools as direct calls.

This includes:

Selecting the right tools
Passing data between tools
Defining how the final output is produced

The system exposes a minimal interface to the LLM. It can list available tools, read tool details, and, when required, understand how each tool works. Tool definitions are accessed on demand, which keeps the initial context small.

Once the plan is generated, execution moves to a runtime environment. The code runs in a sandbox and interacts directly with tools. All intermediate steps, tool responses, and data transformations stay within this layer.

This removes the need for repeated LLM calls during execution. The workflow runs in one pass, guided by the generated code. The LLM is involved mainly at the planning stage and for producing the final response if required.

The flow becomes more structured. A request comes in, relevant tools are identified, code is generated, and execution happens in a controlled environment. The system handles state and intermediate data outside the LLM.

This approach improves clarity in how tasks are executed. The generated code can be inspected, debugged, and understood directly. Each request follows a defined path, which makes behavior easier to track and reason about.

Using Bifrost CLI in Our Workflow

Getting started required two commands. First, start the gateway:

npx -y @maximhq/bifrost

Then launch the CLI from a separate terminal:

npx -y @maximhq/bifrost-cli

MCP servers are registered once through the API. The key flag is is_code_mode_client, which tells Bifrost to handle that server through Code Mode instead of sending its tool definitions on every request:

curl -X POST http://localhost:8080/api/mcp/client \
  -H "Content-Type: application/json" \
  -d '{
    "name": "youtube",
    "connection_type": "http",
    "connection_string": "http://localhost:3001/mcp",
    "tools_to_execute": ["*"],
    "is_code_mode_client": true
  }'

Once registered, the LLM discovers tools on demand using listToolFiles and readToolFile, then submits a full execution plan through executeToolCode. A workflow that previously took six LLM turns now completes in three to four.

Bifrost organizes tool definitions using two binding levels. Server-level (default) groups all tools from a server into one .pyi file. Tool-level gives each tool its own file — better for servers with 30+ tools. Set it once in config.json:

{
  "mcp": {
    "tool_manager_config": {
      "code_mode_binding_level": "server"
    }
  }
}

Debugging became simpler because the generated code is the execution plan. When something went wrong, the issue was visible directly in the code rather than buried in prompt chains. This setup also made execution easier to inspect.

results = youtube.search(query="AI infrastructure", maxResults=5)
titles = [item["snippet"]["title"] for item in results["items"]]
result = {"titles": titles, "count": len(titles)}

The execution runs in a Starlark interpreter, a restricted subset of Python. A few constraints to keep in mind:

No import statements, file I/O, or network access
Classes are not supported, use dictionaries
Tool calls run synchronously; async handling is not required
Each tool call has a default timeout of 30 seconds

Code Mode also works with Agent Mode for automated workflows. The listToolFiles and readToolFile tools are always auto-executable since they are read-only.

The executeToolCode tool only auto-executes if every tool call within the generated code is on the approved list. If any call falls outside that list, Bifrost returns it to the user for approval before running.

Impact on Token Usage and System Efficiency

The reduction in token usage came from four specific changes:

Tool schemas were sent only when required
Intermediate outputs stayed within the execution layer
Repeated context across steps was removed
Fewer LLM calls were needed, since execution moved to a sandbox and ran in a single flow

These changes had a clear effect. Token usage dropped by nearly half. Latency reduced along with it. Execution became more predictable, since each request followed a defined path with fewer moving parts.

The broader takeaway is clear. Token cost comes from system design. Small changes in prompts or outputs help at the edges. The main overhead comes from the system's structure.

LLMs work best when they focus on planning. Managing execution through repeated loops adds cost and introduces variability. A separate execution layer keeps the flow stable and easier to understand. Context also needs careful control. It should be built for each request with only the required information. Letting it grow across steps results in unnecessary overhead and increased token usage.

Conclusion

Token inefficiency in MCP workflows comes from system design. Bifrost and Code Mode introduced a clear separation between planning and execution. The LLM handles planning, and the runtime handles execution. This brought immediate and measurable improvements in both cost and system behavior.

If you are working with MCP workflows at scale, Bifrost is worth exploring. The documentation provides a good starting point to set up the gateway, connect servers, and run workflows using Code Mode.