Finding my Frontier: Cloud free coding on GLM-5

# llm# ai# localai# machinelearning

Megan Folsom

GLM-5 runs locally on an M3 Ultra as a coding agent through OpenCode, but the default tooling made it unusable. Here's what actually worked.

I unboxed my new M3 Ultra Mac Studio over the weekend and the first thing I wanted to do with it was try to fit a frontier-sized model on it. I watched some youtube videos (too many.. youtube videos of people in their homelabs. Hello Network Chuck and Alex Ziskind.) and got all fired up with visions of pretty beehive dashboards and the sound of MLX metal screaming down 512GB RAM highways.

When Z.ai unexpectedly dropped GLM-5, within hours all the home-labbers (is that a word?) were headlining things like: GLM-5 replaces OPUS 4.6! I saw those headlines and thought: this is my frontier. GLM-5 would be the model to break in my new hardware and spawn minion models that would herald a new era of local agentic coding and lifestyle enhancements (no. not through openclaw. Of course not through openclaw. I'd use something...else...secure and clawless...refuses to make eye contact).

My opening searches for how to run this model on my hardware were almost immediately rewarded by the discovery of a community/mlx build of GLM-5! I pulled it down to my local drive and set it up to run on OpenCode. It took days. What seemed like days. Not to set it up. That was super quick. To get the response back from my first prompt (which was the word "hello"). I persisted and actually sat there while the hours turned to days and GLM-5 took upwards of 30 minutes to respond to each and every prompt. I stuck it out though and prompted it to build its own WebUI front loading the prompt with a detailed waterfall style requirements doc. This was actually painful to watch and required me to don my infinite patience cloak, but all the magic cloaks in the world couldn't hide the truth. The MLX path was a dead end for agent use.

One inarguable trait of mine is that I'm stubborn. I've always believed "Dead End" signs are just suggestions if you have the right vehicle. For my next move, I decided to try Unsloth's guide to running locally with a quantized GGUF. The guide was geared towards a GPU setup and not an MLX macOS setup so I had to improvise and ask Claude Code for help in spots.

Once I got this running on OpenCode, I clocked 20 full minutes for the time to first token. My immediate thought was that my beefy Mac Studio just wasn't enough to run this model, quantized or no. But I was seeing a few clues that told me this wasn't an issue with the hardware or the model. For one thing, a direct request to the llama-server via Curl came back in under 5 seconds. This pointed to OpenCode as the culprit. But it was actually more nuanced than that. I decided to try running it in the Claude Code CLI to see if another CLI would be better. That proved to be its own dead end, but what I learned from it actually helped me figure out what the issue was with OpenCode. If you're curious you can find that writeup here: The Ghost in the CLI: Why Claude Code Kills Local Inference.

The proxy server I ran to capture data on my Claude Code run hinted at what was happening with OpenCode too. It was pre-processing over 10K tokens of invisible overhead (tool calls, policy prompts, etc.) and though OpenCode does have some support for parallelization of the tool definitions, llama.cpp apparently doesn't. These were getting processed as a giant 10K prompt on the server side. Once I figured this out, I realized GLM-5 might respond faster once that prompt was cached, and that turned out to be true.

Since you're still here and sat through the whole story, you probably want the technical tea. The least I can do is give you the complete guide to how I'm now running a frontier-class model with reasonable speed and writing code with it on my local computer. Cloud-free.

The Hardware

Machine: Apple M3 Ultra Mac Studio, 512GB unified RAM (~800 GB/s memory bandwidth)
Model: GLM-5, a 744B parameter MoE model from Zhipu AI
Coding Agent: OpenCode

GLM-5 is a Mixture-of-Experts model, so it only activates a fraction of its 744B parameters per token. At IQ2_XXS quantization it fits in 225GB. Plenty of headroom on this machine.

Why MLX Was So Slow (And Why GGUF Isn't)

The MLX I used was the community 4-bit build (390GB) through mlx-lm's server. Apple's own ML framework, purpose-built for Apple Silicon, and it was painfully slow. Here's how it stacks up against the Unsloth GGUF (IQ2_XXS, 225GB) through llama-server:

Setup	Model Size	First Turn	Subsequent Turns	Generation Speed
MLX (mlx-lm)	390GB	~20 min	~20 min	~0.5 tok/s
GGUF (llama-server)	225GB	~10-20 min	2.6s	14 tok/s

Yes, the GGUF was a smaller footprint, but that didn't fully explain the painful slowness of the mlx-lm server.

I believe there were two elements causing this slowness:

Prompt caching. One of the key features of llama-server is that it caches the KV state of the prompt prefix between turns. The first turn chews through the full prompt. That's where the ~10-20 minutes goes (more on why it's so big in the next section). The second turn recognizes the prefix hasn't changed, skips prefill, and you're generating in 0.3 seconds.

mlx-lm does have prompt caching features (mlx_lm.cache_prompt, prompt_cache in the Python API, etc.) but the server mode (mlx_lm.server) never actually cached the prompt prefix between HTTP requests in my testing. Every turn paid the full prefill cost no matter how far into the conversation I was. There are known bugs around this: mlx-lm #259 reports different logits on repeat prompts, and LM Studio hit similar KV cache issues with their MLX engine. But a broken prompt cache triggers a domino effect as your context window builds up. Without working prompt caching in server mode, each turn reprocesses the entire conversation history from scratch (system prompt + tool definitions + every prior message), so response times just keep climbing the longer your session runs.

Fair warning: the first GGUF turn also takes 10-20 minutes, so it looks identical to the MLX problem. Don't give up. Send a second message. That's when you'll see the difference.

The Hidden Cost: 10,600 Tokens Before Your Message

The invisible CLI prompt For your first simple "hello" message, here's what actually gets sent:

POST /v1/chat/completions
Messages: 2
Tools field entries: 11
System message length: 10,082 chars
Tools field size: 36,614 chars
Total prompt tokens: 10,643

Over 10,000 tokens of system prompt and tool definitions before my message even shows up. The tools do actually go into the proper tools field, but llama-server's OpenAI endpoint serializes all of that into the prompt template as text, so every token has to go through prefill. That's the "giant 10K prompt injection" I mentioned earlier.

This is actually more bearable if you think of it as a one-time cost.

Turn	Prefill	Generation	Total
1st (cold)	97.0s / 10,623 tok	3.9s / 54 tok	100.9s
2nd (cached)	0.3s / 5 tok	2.3s / 33 tok	2.6s

For a coding session that goes dozens of turns, 98 seconds of cold start is nothing. The prompt cache makes those 10,000+ tokens invisible after the first message. You need to be aware that anytime you trigger a new uncached prompt, you'll pay this cost again. Reviewing an existing codebase would be really slow here. In my mind though, this is one giant leap for nerdy woman-kind in order to run a frontier class model on my homelab. I'll don my infinite patience cloak and I probably won't have to wear it for very long. Maybe I'll make a youtube video in my homelab. Just kidding. Let's face it. I'm not as photogenic as Network Chuck or Alex Ziskind.

Setup Guide

1. Download the Model

pip install huggingface_hub
huggingface-cli download unsloth/GLM-5-GGUF \
  --local-dir ~/Models/GLM-5-GGUF \
  --include "*UD-IQ2_XXS*"

Six shards, ~225GB total. You need enough RAM for the model plus KV cache, so realistically 300GB+ (I had 512GB).

2. Build and Run llama-server

# Build from source with Metal support
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
  -DBUILD_SHARED_LIBS=OFF -DGGML_METAL=ON
cmake --build llama.cpp/build --config Release -j

# Run
llama.cpp/build/bin/llama-server \
  --model ~/Models/GLM-5-GGUF/UD-IQ2_XXS/GLM-5-UD-IQ2_XXS-00001-of-00006.gguf \
  --ctx-size 65536 --parallel 1 --port 8080

Notes:

--ctx-size 65536 gives you room for tool definitions + conversation history
--parallel 1 keeps memory usage predictable with a single inference slot
Ollama can't load GLM-5 because it doesn't support sharded GGUFs yet (issue #5245)

3. Configure OpenCode

Add to ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama.cpp (local)",
      "options": {
        "baseURL": "http://localhost:8080/v1"
      },
      "models": {
        "GLM-5-UD-IQ2_XXS-00001-of-00006.gguf": {
          "name": "GLM-5 IQ2_XXS",
          "limit": { "context": 128000, "output": 8192 }
        }
      }
    }
  }
}

4. Run

opencode -m "llama.cpp/GLM-5-UD-IQ2_XXS-00001-of-00006.gguf"

First message will take a while. Go get coffee. Play Doom. This will be worth it.

What About Claude Code?

Claude Code also supports local models via ANTHROPIC_BASE_URL, and llama-server has an Anthropic Messages API endpoint. But Claude Code sends a bunch of internal background requests that crash a single-slot local server before your prompt ever reaches the model. I wrote a proxy to deal with it, and debugging that proxy is actually how I figured out the 10K token overhead issue. That's a whole separate writeup: The Ghost in the CLI: Why Claude Code Kills Local Inference.

Takeaways

My journey to GLM-5 was exactly like that sentence sounds: a trip to a far-off planet full of mysterious black holes, scientific conundrums, and strange alien symbols. Most of all, it was about the long passage of time. Running on an MLX server technically worked but was unusable. Prompt caching never kicked in, and the growing context meant each turn was slower than the last. Running an Unsloth quantization was the best choice for me, even though you can't hide from the tool tax. Unlike zippy cloud models, you will notice the 10K tokens of invisible overhead because they dominate your first local interaction (and any uncached prompts thereafter).

For me this was really about adjusting my expectations. But I'm here to tell you, if you have the hardware to run it, save your tokens and get your Doom game ready. You might just unlock a doorway to the future.

Hardware: M3 Ultra Mac Studio, 512GB | Model: unsloth/GLM-5-GGUF IQ2_XXS (225GB) | Server: llama.cpp with Metal | Agent: OpenCode