The Ghost in the CLI: Why Claude Code Kills Local Inference

The Ghost in the CLI: Why Claude Code Kills Local Inference

# llm# ai# claudecode# localai
The Ghost in the CLI: Why Claude Code Kills Local InferenceMegan Folsom

Claude Code supports custom API endpoints, but ghost Haiku calls, missing endpoints, and request floods crash local servers before your prompt ever runs. Here's what's happening and how to fix it.

It was a rainy sunday. The kind of Sunday that makes you want to stay inside with Claude Code and a good book. But I knew this wasn't going to be an ordinary Sunday the minute GLM-5 showed up and sweet talked its way into my Claude Code CLI. But Claude Code wasn't having it. You see, when you point Claude Code at a local base URL for local inference, you're inviting a poltergeist into your terminal.

For a weekend project, my mission was to set up a local version of GLM-5 as a coding agent on my new M3 Ultra. My reasons for deciding to run my local quantized GLM-5 in Claude Code are documented in my companion article, Finding my Frontier: Cloud free coding on GLM-5.

I thought this would be straight forward. Part of what makes it possible is that Claude Code has an ANTHROPIC_BASE_URL env var. Llama-server has an Anthropic Messages API endpoint. I thought this would be a walk in the park. But once I had it setup, it segfaulted immediately before my prompt even reached the model. My technical investigation lead to some very interesting findings.

Claude Code and Open Source Models

Claude Code has increasing support for running open source models and the open source community is embracing it too. Ollama allows you to launch models directly in Claude Code. Some frontier-class open source models recommend it as the primary way to access their models. These integrations are typically optimized for cloud-hosted versions of the models though, not local inference. I love the Claude Code CLI and the idea of having some of its coolest features already baked into your open source model coding setup is so very tempting. But my job today is to dampen your enthusiasm.

The Setup

  • Machine: M3 Ultra Mac Studio, 512GB unified RAM
  • Model: GLM-5 IQ2_XXS (225GB, GGUF via unsloth/GLM-5-GGUF)
  • Server: llama-server (llama.cpp) with Metal support
  • Goal: Use Claude Code with a local model instead of the Anthropic API

See my companion article, Finding my Frontier: Cloud free coding on GLM-5, for the full OpenCode setup guide and the MLX vs GGUF performance story.

The Model Works Fine on Its Own

After it crashed, I ran GLM-5 through llama-server's Anthropic Messages API and it handles tool calling no problem:

curl -s 'http://localhost:8080/v1/messages' \
  -H 'Content-Type: application/json' \
  -H 'x-api-key: none' \
  -H 'anthropic-version: 2023-06-01' \
  -d '{
    "model": "test",
    "max_tokens": 50,
    "tools": [{
      "name": "get_weather",
      "description": "Get weather for a location",
      "input_schema": {
        "type": "object",
        "properties": {
          "location": {"type": "string"}
        },
        "required": ["location"]
      }
    }],
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}]
  }'
Enter fullscreen mode Exit fullscreen mode

This is 164 input tokens, 50 output tokens, and a prompt reply (pun intended) in 4.7 seconds. A 744B model doing structured tool calling on consumer hardware. The model isn't the problem here.

Then I Plugged In Claude Code

ANTHROPIC_BASE_URL="http://localhost:8080" \
ANTHROPIC_API_KEY="none" \
claude --model GLM-5-UD-IQ2_XXS-00001-of-00006.gguf
Enter fullscreen mode Exit fullscreen mode

Dead server. Not even a useful error message.

The Forensic Evidence

Going full Detectorist

To see what was happening under the surface, I dropped a logging proxy between Claude Code and llama-server. I needed to see the exact moment the handshake turned into a death spiral.

The logs revealed a massacre.

[1]  POST /v1/messages | model=claude-haiku-4-5-20251001 | tools=0    → 200 OK
[2]  POST /v1/messages | model=claude-haiku-4-5-20251001 | tools=0    → 200 OK
[3]  POST /v1/messages/count_tokens | model=GLM-5...     | tools=1    → intercepted
[4]  POST /v1/messages/count_tokens | model=GLM-5...     | tools=1    → intercepted
...
[8]  POST /v1/messages | model=claude-haiku-4-5-20251001 | tools=1    → CRASH (segfault)
[9+] Everything after → Connection refused
Enter fullscreen mode Exit fullscreen mode

This revealed three separate problems. Any one of them kills the server on its own.

Ghost Haiku Calls

What on earth was Haiku doing there? I checked every configuration file; I knew for sure I hadn’t invited it.

As it turns out, Claude Code is a creature of habit. It sends internal requests to claude-haiku-4-5-20251001 for housekeeping stuff (things like generating conversation titles, filtering tools, other background tasks). When you set ANTHROPIC_BASE_URL, all of those get routed to your local server.

In one session I counted 37 Haiku requests before the actual inference request even got sent. Title generation, tool checking for each of 30+ MCP tools, all hitting a server that has never even heard of Haiku.

Token Counting Preflight

But that wasn't all. Before the actual inference request, Claude Code hits /v1/messages/count_tokens with one request per tool group. This endpoint doesn't exist in llama-server, so it returns a 404 that Claude Code doesn't handle gracefully.

Concurrent Request Flood

The gasoline that lights the fire is one of Claude Code's best features, but a concurrency mis-match for poor little llama-server. Haiku calls to the ether, count_tokens calls, and a parallel request to run the inference for your prompt. A single-slot llama-server can't handle concurrent requests which result in, you guessed it, a croaked out "se-egfault" just before the server's untimely demise (I might have watched too many British Police Procedurals).

The GLM-5 inference request (in this case a simple "hello"), which is actually the one I cared about, never made it to the server. It was stuck behind crashed Haiku calls and preflight requests hitting endpoints that aren't there.

Here's what that looks like:

Without Proxy: Claude Code fires Haiku, count_tokens, and GLM-5 requests in parallel at a single-slot llama-server, resulting in segfault

Exorcism by Proxy: 180 Lines of Python

Okay, I admit, this was a hacky fix. But it worked. Instead of waiting for upstream fixes, I wrote a proxy that sits between Claude Code and llama-server. It does three things: fakes all Haiku responses, intercepts count_tokens, and serializes real requests so they don't flood the server. Here's the walkthrough.

The plumbing

Standard library only. The proxy listens on port 9090 and forwards real requests to llama-server on 8080. All real inference requests go through a single-threaded queue so the server only ever sees one at a time.

#!/usr/bin/env python3
"""
Smart proxy for Claude Code -> llama-server.
Serializes requests, intercepts count_tokens, fakes Haiku calls.
"""
import json, threading, queue, time
from http.server import HTTPServer, BaseHTTPRequestHandler
from urllib.request import Request, urlopen
from urllib.error import HTTPError

TARGET = "http://127.0.0.1:8080"

request_queue = queue.Queue()
response_slots = {}
slot_lock = threading.Lock()
request_timestamps = {}
Enter fullscreen mode Exit fullscreen mode

The worker thread

This is the single-file-line to llama-server. Requests go into the queue, this thread sends them one at a time, and stashes the response so the original handler can pick it up.

def worker():
    while True:
        req_id, method, path, headers, body = request_queue.get()
        t_start = time.time()
        try:
            req = Request(f"{TARGET}{path}", data=body, method=method)
            for k, v in headers.items():
                req.add_header(k, v)
            resp = urlopen(req, timeout=600)
            resp_data = resp.read()
            resp_headers = dict(resp.getheaders())
            elapsed = time.time() - t_start
            print(f"[{req_id}] <- {resp.status} | {elapsed:.1f}s", flush=True)
            with slot_lock:
                response_slots[req_id] = ("ok", resp.status, resp_headers, resp_data)
        except HTTPError as e:
            error_body = e.read() if e.fp else b""
            with slot_lock:
                response_slots[req_id] = ("http_error", e.code, {}, error_body)
        except Exception as e:
            with slot_lock:
                response_slots[req_id] = ("error", 502, {}, str(e).encode())
        finally:
            request_timestamps.pop(req_id, None)
            request_queue.task_done()

threading.Thread(target=worker, daemon=True).start()
req_counter = 0
counter_lock = threading.Lock()
Enter fullscreen mode Exit fullscreen mode

Faking Haiku responses

When Claude Code sends a Haiku request (title generation, tool filtering, etc.), we don't bother the model. We just send back a minimal valid Anthropic Messages API response. Claude Code gets what it needs, the model never knows it happened.

def fake_response(handler, req_id, model, text):
    """Return a minimal Anthropic Messages API response."""
    fake = {
        "id": f"msg_{req_id}", "type": "message", "role": "assistant",
        "content": [{"type": "text", "text": text}],
        "model": model, "stop_reason": "end_turn", "stop_sequence": None,
        "usage": {"input_tokens": 10, "output_tokens": 1}
    }
    body = json.dumps(fake).encode()
    handler.send_response(200)
    handler.send_header("Content-Type", "application/json")
    handler.send_header("Content-Length", str(len(body)))
    handler.end_headers()
    handler.wfile.write(body)
Enter fullscreen mode Exit fullscreen mode

The main proxy handler

This is where the routing logic lives. Every POST gets inspected and sent down one of three paths:

  1. count_tokens requests get a fake estimate and never touch the server.
  2. Haiku requests get a fake response. Title generation requests get a slightly smarter fake that includes a JSON title so Claude Code's UI still works.
  3. Everything else (your actual GLM-5 inference) goes into the queue and waits for the worker thread to process it.
class SmartProxy(BaseHTTPRequestHandler):
    def do_POST(self):
        global req_counter
        with counter_lock:
            req_counter += 1
            req_id = req_counter

        length = int(self.headers.get("Content-Length", 0))
        body = self.rfile.read(length)
        data = json.loads(body)
        model = data.get("model", "?")
        tools = data.get("tools", [])

        # 1. Intercept count_tokens
        if "count_tokens" in self.path:
            estimated = 500 * max(len(tools), 1)
            resp = json.dumps({"input_tokens": estimated}).encode()
            self.send_response(200)
            self.send_header("Content-Type", "application/json")
            self.send_header("Content-Length", str(len(resp)))
            self.end_headers()
            self.wfile.write(resp)
            return

        # 2. Fake ALL Haiku calls
        if "haiku" in model.lower():
            system = data.get("system", [])
            is_title = False
            if isinstance(system, list):
                for b in system:
                    if isinstance(b, dict) and "new topic" in b.get("text", "").lower():
                        is_title = True
            elif isinstance(system, str) and "new topic" in system.lower():
                is_title = True

            if is_title:
                fake_response(self, req_id, model,
                    '{"isNewTopic": true, "title": "GLM-5 Chat"}')
            else:
                fake_response(self, req_id, model, "OK")
            return

        # 3. Real requests: serialize through queue
        print(f"[{req_id}] {model[:30]} | {len(tools)} tools -> queued", flush=True)
        headers_dict = {}
        for h in ["Content-Type", "Authorization", "x-api-key", "anthropic-version"]:
            if self.headers.get(h):
                headers_dict[h] = self.headers[h]

        request_timestamps[req_id] = time.time()
        request_queue.put((req_id, "POST", self.path, headers_dict, body))

        while True:
            time.sleep(0.05)
            with slot_lock:
                if req_id in response_slots:
                    result = response_slots.pop(req_id)
                    break

        status_type, code, resp_headers, resp_data = result
        self.send_response(code)
        for k, v in resp_headers.items():
            if k.lower() not in ("transfer-encoding", "content-length"):
                self.send_header(k, v)
        self.send_header("Content-Length", str(len(resp_data)))
        self.end_headers()
        self.wfile.write(resp_data)

    def log_message(self, *args):
        pass
Enter fullscreen mode Exit fullscreen mode

Start it up

HTTPServer(("127.0.0.1", 9090), SmartProxy).serve_forever()
Enter fullscreen mode Exit fullscreen mode

Save the whole thing as claude-proxy.py and run it with python3 claude-proxy.py. That's it.

How It Looks Now

With the proxy in place, the picture changes completely:

With Proxy: Proxy intercepts Haiku and count_tokens with instant fakes, forwards only GLM-5 inference one at a time to llama-server

Claude Code's request flow goes from 42 chaotic requests to this:

[1] haiku title gen → fake response (instant)
[2] GLM-5 | 23 tools → queued
[2] ← 200 | 17.8s
[3] haiku title gen → fake response (instant)
Enter fullscreen mode Exit fullscreen mode

Performance

Turn TTFT (prefill) Generation Total Notes
1st (cold cache) 336.6s / 24,974 tokens 13.7s / 133 tok 350.3s Full prefill, tool defs + system prompt
2nd (warm cache) 0.1s / 1 token 17.0s / 165 tok 17.1s Prompt cache hit
3rd 2.2s / 14 tokens 15.6s / 151 tok 17.8s Near-instant prefill
4th 3.4s / 96 tokens 10.8s / 104 tok 14.1s Stable

First turn is 5.6 minutes. Every turn after that: 2-3 seconds to first token.

The first turn is slower than OpenCode (350s vs 100s) because Claude Code sends ~25K tokens of tool definitions (23 tools including Playwright, Figma, and the built-in ones like Read, Write, Bash, Glob, Grep, etc.) compared to OpenCode's ~10K. But llama-server's prompt cache means you only pay that cost once. After the first turn the server sees the 25K token prefix hasn't changed and skips straight to the new tokens.

Usage

Three terminals:

# Terminal 1: llama-server
llama-server --model GLM-5-UD-IQ2_XXS-00001-of-00006.gguf \
  --ctx-size 65536 --parallel 1 --port 8080

# Terminal 2: proxy
python3 claude-proxy.py

# Terminal 3: Claude Code
ANTHROPIC_BASE_URL="http://localhost:9090" \
ANTHROPIC_API_KEY="none" \
claude --model GLM-5-UD-IQ2_XXS-00001-of-00006.gguf
Enter fullscreen mode Exit fullscreen mode

First turn will take ~6 minutes. Be patient. After that: ~15 seconds.

Why I Still Don't Recommend It

Claude Code's ANTHROPIC_BASE_URL feature technically supports custom endpoints. But the implementation assumes a cloud-scale API server on the other end. One that can handle parallel requests, implements every endpoint in the Anthropic spec, and doesn't mind servicing dozens of lightweight Haiku calls alongside heavyweight inference.

That's fine for cloud infrastructure. It's a completely broken assumption for a single-slot local server running a 225GB model. Local model support exists on paper but crashes in practice, and the failure mode (immediate segfault, no useful error message) makes it nearly impossible to diagnose without building your own proxy.

This proxy is a workaround, not a fix. The real solution would be for coding agents to detect local endpoints and skip the background services that assume cloud-scale infrastructure. Until then, 180 lines of Python bridge the gap.

But even with the proxy working, I still wouldn't recommend this as your daily coding setup. Claude Code was purpose-built for a specialized agentic flow that works really well with Anthropic models. Giving it to your local LLM as a hand-me-down is going to end in tears and segfaults (which you now hopefully know how to fix). Coding with this setup felt janky at best. If you want to run a local model as a coding agent, OpenCode is a much better fit. I wrote about that setup here.

Your Turn

So, is this the future of development? Will cloud models always be ahead of the open source local community?

Is anyone else running Claude Code with local LLMs for production work, or do you still fall back to the cloud when the "poltergeists" start acting up?

Drop your setup and your survival stories in the comments.

Hardware: M3 Ultra Mac Studio, 512GB | Model: unsloth/GLM-5-GGUF IQ2_XXS (225GB) | Server: llama.cpp with Metal | Proxy: claude-proxy.py