
Megan FolsomClaude Code supports custom API endpoints, but ghost Haiku calls, missing endpoints, and request floods crash local servers before your prompt ever runs. Here's what's happening and how to fix it.
It was a rainy sunday. The kind of Sunday that makes you want to stay inside with Claude Code and a good book. But I knew this wasn't going to be an ordinary Sunday the minute GLM-5 showed up and sweet talked its way into my Claude Code CLI. But Claude Code wasn't having it. You see, when you point Claude Code at a local base URL for local inference, you're inviting a poltergeist into your terminal.
For a weekend project, my mission was to set up a local version of GLM-5 as a coding agent on my new M3 Ultra. My reasons for deciding to run my local quantized GLM-5 in Claude Code are documented in my companion article, Finding my Frontier: Cloud free coding on GLM-5.
I thought this would be straight forward. Part of what makes it possible is that Claude Code has an ANTHROPIC_BASE_URL env var. Llama-server has an Anthropic Messages API endpoint. I thought this would be a walk in the park. But once I had it setup, it segfaulted immediately before my prompt even reached the model. My technical investigation lead to some very interesting findings.
Claude Code has increasing support for running open source models and the open source community is embracing it too. Ollama allows you to launch models directly in Claude Code. Some frontier-class open source models recommend it as the primary way to access their models. These integrations are typically optimized for cloud-hosted versions of the models though, not local inference. I love the Claude Code CLI and the idea of having some of its coolest features already baked into your open source model coding setup is so very tempting. But my job today is to dampen your enthusiasm.
See my companion article, Finding my Frontier: Cloud free coding on GLM-5, for the full OpenCode setup guide and the MLX vs GGUF performance story.
After it crashed, I ran GLM-5 through llama-server's Anthropic Messages API and it handles tool calling no problem:
curl -s 'http://localhost:8080/v1/messages' \
-H 'Content-Type: application/json' \
-H 'x-api-key: none' \
-H 'anthropic-version: 2023-06-01' \
-d '{
"model": "test",
"max_tokens": 50,
"tools": [{
"name": "get_weather",
"description": "Get weather for a location",
"input_schema": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}],
"messages": [{"role": "user", "content": "What is the weather in Paris?"}]
}'
This is 164 input tokens, 50 output tokens, and a prompt reply (pun intended) in 4.7 seconds. A 744B model doing structured tool calling on consumer hardware. The model isn't the problem here.
ANTHROPIC_BASE_URL="http://localhost:8080" \
ANTHROPIC_API_KEY="none" \
claude --model GLM-5-UD-IQ2_XXS-00001-of-00006.gguf
Dead server. Not even a useful error message.
To see what was happening under the surface, I dropped a logging proxy between Claude Code and llama-server. I needed to see the exact moment the handshake turned into a death spiral.
The logs revealed a massacre.
[1] POST /v1/messages | model=claude-haiku-4-5-20251001 | tools=0 → 200 OK
[2] POST /v1/messages | model=claude-haiku-4-5-20251001 | tools=0 → 200 OK
[3] POST /v1/messages/count_tokens | model=GLM-5... | tools=1 → intercepted
[4] POST /v1/messages/count_tokens | model=GLM-5... | tools=1 → intercepted
...
[8] POST /v1/messages | model=claude-haiku-4-5-20251001 | tools=1 → CRASH (segfault)
[9+] Everything after → Connection refused
This revealed three separate problems. Any one of them kills the server on its own.
What on earth was Haiku doing there? I checked every configuration file; I knew for sure I hadn’t invited it.
As it turns out, Claude Code is a creature of habit. It sends internal requests to claude-haiku-4-5-20251001 for housekeeping stuff (things like generating conversation titles, filtering tools, other background tasks). When you set ANTHROPIC_BASE_URL, all of those get routed to your local server.
In one session I counted 37 Haiku requests before the actual inference request even got sent. Title generation, tool checking for each of 30+ MCP tools, all hitting a server that has never even heard of Haiku.
But that wasn't all. Before the actual inference request, Claude Code hits /v1/messages/count_tokens with one request per tool group. This endpoint doesn't exist in llama-server, so it returns a 404 that Claude Code doesn't handle gracefully.
The gasoline that lights the fire is one of Claude Code's best features, but a concurrency mis-match for poor little llama-server. Haiku calls to the ether, count_tokens calls, and a parallel request to run the inference for your prompt. A single-slot llama-server can't handle concurrent requests which result in, you guessed it, a croaked out "se-egfault" just before the server's untimely demise (I might have watched too many British Police Procedurals).
The GLM-5 inference request (in this case a simple "hello"), which is actually the one I cared about, never made it to the server. It was stuck behind crashed Haiku calls and preflight requests hitting endpoints that aren't there.
Here's what that looks like:
Okay, I admit, this was a hacky fix. But it worked. Instead of waiting for upstream fixes, I wrote a proxy that sits between Claude Code and llama-server. It does three things: fakes all Haiku responses, intercepts count_tokens, and serializes real requests so they don't flood the server. Here's the walkthrough.
Standard library only. The proxy listens on port 9090 and forwards real requests to llama-server on 8080. All real inference requests go through a single-threaded queue so the server only ever sees one at a time.
#!/usr/bin/env python3
"""
Smart proxy for Claude Code -> llama-server.
Serializes requests, intercepts count_tokens, fakes Haiku calls.
"""
import json, threading, queue, time
from http.server import HTTPServer, BaseHTTPRequestHandler
from urllib.request import Request, urlopen
from urllib.error import HTTPError
TARGET = "http://127.0.0.1:8080"
request_queue = queue.Queue()
response_slots = {}
slot_lock = threading.Lock()
request_timestamps = {}
This is the single-file-line to llama-server. Requests go into the queue, this thread sends them one at a time, and stashes the response so the original handler can pick it up.
def worker():
while True:
req_id, method, path, headers, body = request_queue.get()
t_start = time.time()
try:
req = Request(f"{TARGET}{path}", data=body, method=method)
for k, v in headers.items():
req.add_header(k, v)
resp = urlopen(req, timeout=600)
resp_data = resp.read()
resp_headers = dict(resp.getheaders())
elapsed = time.time() - t_start
print(f"[{req_id}] <- {resp.status} | {elapsed:.1f}s", flush=True)
with slot_lock:
response_slots[req_id] = ("ok", resp.status, resp_headers, resp_data)
except HTTPError as e:
error_body = e.read() if e.fp else b""
with slot_lock:
response_slots[req_id] = ("http_error", e.code, {}, error_body)
except Exception as e:
with slot_lock:
response_slots[req_id] = ("error", 502, {}, str(e).encode())
finally:
request_timestamps.pop(req_id, None)
request_queue.task_done()
threading.Thread(target=worker, daemon=True).start()
req_counter = 0
counter_lock = threading.Lock()
When Claude Code sends a Haiku request (title generation, tool filtering, etc.), we don't bother the model. We just send back a minimal valid Anthropic Messages API response. Claude Code gets what it needs, the model never knows it happened.
def fake_response(handler, req_id, model, text):
"""Return a minimal Anthropic Messages API response."""
fake = {
"id": f"msg_{req_id}", "type": "message", "role": "assistant",
"content": [{"type": "text", "text": text}],
"model": model, "stop_reason": "end_turn", "stop_sequence": None,
"usage": {"input_tokens": 10, "output_tokens": 1}
}
body = json.dumps(fake).encode()
handler.send_response(200)
handler.send_header("Content-Type", "application/json")
handler.send_header("Content-Length", str(len(body)))
handler.end_headers()
handler.wfile.write(body)
This is where the routing logic lives. Every POST gets inspected and sent down one of three paths:
count_tokens requests get a fake estimate and never touch the server.class SmartProxy(BaseHTTPRequestHandler):
def do_POST(self):
global req_counter
with counter_lock:
req_counter += 1
req_id = req_counter
length = int(self.headers.get("Content-Length", 0))
body = self.rfile.read(length)
data = json.loads(body)
model = data.get("model", "?")
tools = data.get("tools", [])
# 1. Intercept count_tokens
if "count_tokens" in self.path:
estimated = 500 * max(len(tools), 1)
resp = json.dumps({"input_tokens": estimated}).encode()
self.send_response(200)
self.send_header("Content-Type", "application/json")
self.send_header("Content-Length", str(len(resp)))
self.end_headers()
self.wfile.write(resp)
return
# 2. Fake ALL Haiku calls
if "haiku" in model.lower():
system = data.get("system", [])
is_title = False
if isinstance(system, list):
for b in system:
if isinstance(b, dict) and "new topic" in b.get("text", "").lower():
is_title = True
elif isinstance(system, str) and "new topic" in system.lower():
is_title = True
if is_title:
fake_response(self, req_id, model,
'{"isNewTopic": true, "title": "GLM-5 Chat"}')
else:
fake_response(self, req_id, model, "OK")
return
# 3. Real requests: serialize through queue
print(f"[{req_id}] {model[:30]} | {len(tools)} tools -> queued", flush=True)
headers_dict = {}
for h in ["Content-Type", "Authorization", "x-api-key", "anthropic-version"]:
if self.headers.get(h):
headers_dict[h] = self.headers[h]
request_timestamps[req_id] = time.time()
request_queue.put((req_id, "POST", self.path, headers_dict, body))
while True:
time.sleep(0.05)
with slot_lock:
if req_id in response_slots:
result = response_slots.pop(req_id)
break
status_type, code, resp_headers, resp_data = result
self.send_response(code)
for k, v in resp_headers.items():
if k.lower() not in ("transfer-encoding", "content-length"):
self.send_header(k, v)
self.send_header("Content-Length", str(len(resp_data)))
self.end_headers()
self.wfile.write(resp_data)
def log_message(self, *args):
pass
HTTPServer(("127.0.0.1", 9090), SmartProxy).serve_forever()
Save the whole thing as claude-proxy.py and run it with python3 claude-proxy.py. That's it.
With the proxy in place, the picture changes completely:
Claude Code's request flow goes from 42 chaotic requests to this:
[1] haiku title gen → fake response (instant)
[2] GLM-5 | 23 tools → queued
[2] ← 200 | 17.8s
[3] haiku title gen → fake response (instant)
| Turn | TTFT (prefill) | Generation | Total | Notes |
|---|---|---|---|---|
| 1st (cold cache) | 336.6s / 24,974 tokens | 13.7s / 133 tok | 350.3s | Full prefill, tool defs + system prompt |
| 2nd (warm cache) | 0.1s / 1 token | 17.0s / 165 tok | 17.1s | Prompt cache hit |
| 3rd | 2.2s / 14 tokens | 15.6s / 151 tok | 17.8s | Near-instant prefill |
| 4th | 3.4s / 96 tokens | 10.8s / 104 tok | 14.1s | Stable |
First turn is 5.6 minutes. Every turn after that: 2-3 seconds to first token.
The first turn is slower than OpenCode (350s vs 100s) because Claude Code sends ~25K tokens of tool definitions (23 tools including Playwright, Figma, and the built-in ones like Read, Write, Bash, Glob, Grep, etc.) compared to OpenCode's ~10K. But llama-server's prompt cache means you only pay that cost once. After the first turn the server sees the 25K token prefix hasn't changed and skips straight to the new tokens.
Three terminals:
# Terminal 1: llama-server
llama-server --model GLM-5-UD-IQ2_XXS-00001-of-00006.gguf \
--ctx-size 65536 --parallel 1 --port 8080
# Terminal 2: proxy
python3 claude-proxy.py
# Terminal 3: Claude Code
ANTHROPIC_BASE_URL="http://localhost:9090" \
ANTHROPIC_API_KEY="none" \
claude --model GLM-5-UD-IQ2_XXS-00001-of-00006.gguf
First turn will take ~6 minutes. Be patient. After that: ~15 seconds.
Claude Code's ANTHROPIC_BASE_URL feature technically supports custom endpoints. But the implementation assumes a cloud-scale API server on the other end. One that can handle parallel requests, implements every endpoint in the Anthropic spec, and doesn't mind servicing dozens of lightweight Haiku calls alongside heavyweight inference.
That's fine for cloud infrastructure. It's a completely broken assumption for a single-slot local server running a 225GB model. Local model support exists on paper but crashes in practice, and the failure mode (immediate segfault, no useful error message) makes it nearly impossible to diagnose without building your own proxy.
This proxy is a workaround, not a fix. The real solution would be for coding agents to detect local endpoints and skip the background services that assume cloud-scale infrastructure. Until then, 180 lines of Python bridge the gap.
But even with the proxy working, I still wouldn't recommend this as your daily coding setup. Claude Code was purpose-built for a specialized agentic flow that works really well with Anthropic models. Giving it to your local LLM as a hand-me-down is going to end in tears and segfaults (which you now hopefully know how to fix). Coding with this setup felt janky at best. If you want to run a local model as a coding agent, OpenCode is a much better fit. I wrote about that setup here.
So, is this the future of development? Will cloud models always be ahead of the open source local community?
Is anyone else running Claude Code with local LLMs for production work, or do you still fall back to the cloud when the "poltergeists" start acting up?
Hardware: M3 Ultra Mac Studio, 512GB | Model: unsloth/GLM-5-GGUF IQ2_XXS (225GB) | Server: llama.cpp with Metal | Proxy: claude-proxy.py