Why Your Custom NemoClaw LLM Takes Forever to Respond (Or Completely Ignores You)

# llm# nemoclaw# ai# openclaw

Hieu Pham

You finally set up a local AI agent to help you tackle your dev backlog (if you haven't yet, check...

You finally set up a local AI agent to help you tackle your dev backlog (if you haven't yet, check out my guide on how to run NemoClaw with a local LLM & connect to Telegram).

The goal is simple: feed it your local codebase so it can help you refactor complex components, map out new business logic, or write comprehensive unit tests—all without sending proprietary company code to an external API.

You fire up an agentic framework like NemoClaw on your RTX 4080, paste in your prompt, and... the agent completely loses its mind.

Instead of writing code, it either ghosts you, dumps a wall of unformatted JSON into your terminal, or gets trapped in an infinite 3-second retry loop until the session crashes.

After spending a full day digging through API logs, I realized this isn't a network bug. It is a fundamental flaw in how local agent frameworks handle context windows, and it affects almost every developer trying to build private AI workflows.

If your local agent is stuck in an infinite loop or timing out, here is the exact architectural bottleneck causing it, and how to permanently fix it.

The Root Cause: The Hidden ReAct Loop Trap

Frameworks like NemoClaw, AutoGen, and LangChain operate on a "Reasoning and Acting" (ReAct) loop. To make the AI autonomous, the framework secretly injects a massive set of invisible system instructions, tool schemas, and strict JSON formatting rules before it even attaches your actual question.

By the time you ask the agent to review a few hundred lines of code, your total prompt size easily explodes past 12,000 tokens.

Here is where the pipeline breaks:

The 4k Wall: By default, local inference engines like Ollama cap their context window at 4,096 tokens to save VRAM.
The Decapitation: When the framework sends that massive 12k-token prompt, the inference engine blindly chops off the oldest 8,000 tokens to make it fit. Unfortunately, those oldest tokens contain the framework's critical JSON formatting rules.
The Infinite Loop: The lobotomized model replies with broken, plain-text formatting. The framework's parser catches the bad JSON, slaps the model on the wrist, and automatically replies: "Invalid JSON schema, try again." The model tries again, gets truncated again, and you are officially trapped in a rapid-fire retry loop that hammers your GPU until the 60-second gateway timeout drops the connection.

The False Fix (`OLLAMA_NUM_CTX`)

Your first instinct as a lead developer is probably to just restart the server and force a larger context window via an environment variable: OLLAMA_NUM_CTX=16384 ollama serve.

This will not work. Most agent frameworks communicate with Ollama via the OpenAI compatibility endpoint (/v1/chat/completions). If the client framework doesn't explicitly declare a custom context size in its JSON payload, that specific endpoint completely ignores your environment variable and forces the model back to its 4k default.

To fix this, you have to bypass the API completely and bake the larger context window directly into the model's DNA.

The Real Fix: Building a Custom Modelfile

First, you need a highly capable "Instruct" model. With 16GB of VRAM on an RTX 4080, you have the perfect amount of hardware headroom to run a brilliant mid-weight model (like qwen2.5:14b) and a massive 16k context window without spilling over into agonizingly slow system RAM.

1. Bake the 16k Context into the DNA

In your terminal, create a custom Ollama model with the 16k limit hardcoded using a Modelfile:

echo "FROM qwen2.5:14b" > Modelfile
echo "PARAMETER num_ctx 16384" >> Modelfile
ollama create qwen14b-agent-16k -f Modelfile

2. Update the Gateway Route

Tell your framework's API gateway to route all inference to your newly minted, wide-context model. (For OpenShell/NemoClaw, it looks like this):

openshell inference set --provider ollama --model qwen14b-agent-16k --no-verify

3. Wipe the Corrupted Memory

Because your agent just spent the last 20 minutes screaming at itself in broken JSON, its session history is deeply corrupted. If you don't wipe it, the memory manager will crash trying to read the garbage data on your next prompt. Clear out the session storage before testing again.

# For NemoClaw users:
rm /sandbox/.openclaw-data/agents/main/sessions/*

The Result

Because the massive system prompt is no longer being decapitated, the 14b model perfectly understands the framework's JSON instructions. It can hold its tool schemas, its system prompt, and your entire codebase in its head simultaneously.

It executes its tool calls seamlessly and replies in natural language in just a few seconds.

You now have a lightning-fast, fully autonomous local agent running securely on your own hardware, taking full advantage of that 16GB of VRAM.

Have you tried pushing the limits of your GPU with local agent frameworks? Let me know your stack in the comments!