ppcvoteWe run 4 autonomous AI agents on a single NVIDIA RTX 3060 Ti with 8GB VRAM. 13.2 tok/s inference, 105 daily tasks, 99.9% uptime. Here's the complete hardware setup, performance tuning, and lessons learned from 30 days of production.
"An RTX 3060 Ti isn't a gaming card. It's a $300 AI agent server that never sleeps."
Most teams deploying AI agents rely entirely on cloud APIs — OpenAI, Anthropic, Google. Every request costs money. Every request leaves your network. Every request depends on someone else's uptime.
We took a different path. At Ultra Lab, we run a fleet of 4 autonomous AI agents on a single NVIDIA RTX 3060 Ti. Local inference. Full privacy. Near-zero marginal cost.
This article covers the complete setup: hardware, model selection, performance tuning, and 30 days of production data.
| Component | Spec |
|---|---|
| GPU | NVIDIA RTX 3060 Ti (8GB GDDR6X) |
| VRAM | 8GB (5.7GB used by model) |
| CUDA Cores | 4,864 |
| Environment | WSL2 Ubuntu on Windows 10 |
| Inference Engine | Ollama |
| Cost | ~$300 USD (used) |
The RTX 3060 Ti hits a sweet spot: enough VRAM for quantized 7B models, enough CUDA cores for decent throughput, and cheap enough that the ROI is measured in weeks, not years.
We built a custom model on top of Qwen 2.5 7B:
Base: qwen2.5:7b (Q4_K_M quantization)
Size: 4.7 GB
Context: 16,384 tokens
Custom: Brand knowledge, product data, voice guidelines baked into system prompt
The custom Modelfile injects our brand identity, product knowledge, and communication style directly into the model. Each agent gets the same base model but different system prompts for their role.
We tested Qwen 3 8B. Results:
| Model | Size | VRAM | Speed | Verdict |
|---|---|---|---|---|
| qwen2.5:7b (Q4_K_M) | 4.7 GB | 5.7 GB | 13.2 tok/s | Production |
| qwen3:8b | 4.9 GB | >8 GB | 1.7 tok/s | Rejected |
Qwen 3 8B was 7.8x slower. The model barely fits in 8GB VRAM, forcing massive CPU offload. It also has a "thinking mode" that adds overhead to every response. Not worth it.
Lesson: On 8GB VRAM, a well-tuned 7B model beats a squeezed 8B model every time.
After extensive tuning, here's what we achieve in production:
Throughput: 13.2 tokens/second
GPU Offload: 88% (5.7 / 6.5 GB VRAM utilized)
Context: 16,384 tokens
Latency: First token in ~200ms
Uptime: 99.9% (systemd auto-restart)
We initially set context to 32,768 tokens. Performance collapsed to 0.1 tok/s — a 132x slowdown. The model was spilling into system RAM, and the CPU-GPU data transfer became the bottleneck.
Dropping to 16,384 tokens restored full speed. For agent tasks (social media posts, engagement replies, content generation), 16K context is more than sufficient.
Rule: On 8GB VRAM with a 7B model, never exceed 16K context.
Each agent has a distinct role and operates autonomously:
| Agent | Role | Platform | Daily Tasks |
|---|---|---|---|
| UltraLabTW | CEO & brand strategist | Moltbook, Discord | 8 posts + strategy |
| MindThreadBot | Social media specialist | Moltbook | 8 posts + engagement |
| UltraProbeBot | AI security researcher | Moltbook | 8 posts + research |
| UltraAdvisor | Financial advisory | Moltbook | Content + analysis |
┌─────────────────────────────────────┐
│ OpenClaw Gateway │
│ (Port 18789) │
├─────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ Ollama │ │ 25 Systemd │ │
│ │ Server │ │ Timers │ │
│ │ │ │ │ │
│ │ ultralab │ │ 62 Scripts │ │
│ │ :7b │ │ (bash + node) │ │
│ │ │ │ │ │
│ │ RTX 3060 │ │ 19 Intelligence │ │
│ │ Ti CUDA │ │ Files (.md) │ │
│ └──────────┘ └──────────────────┘ │
│ │
│ ┌──────────────────────────────┐ │
│ │ 4 Agent Workspaces │ │
│ │ (isolated context per agent) │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────┘
All agents share the same GPU. Ollama handles request queuing — one inference at a time with NUM_PARALLEL=1. With 25 timers staggered across the day, GPU contention is minimal.
The reliability comes from systemd. Ollama runs as a system service with performance overrides:
# /etc/systemd/system/ollama.service.d/performance.conf
[Service]
Environment="OLLAMA_KEEP_ALIVE=2h"
Environment="NUM_PARALLEL=1"
Environment="MAX_LOADED_MODELS=1"
Key settings:
A systemd timer runs every 10 minutes to verify both Ollama and the OpenClaw gateway are responsive:
# Health check: verify Ollama responds
curl -sf http://localhost:11434/api/tags > /dev/null || systemctl restart ollama
If Ollama hangs (rare but happens after ~72 hours of continuous operation), it gets automatically restarted. Model reloads in ~8 seconds. No human intervention needed.
After one month of continuous operation:
| Metric | Value |
|---|---|
| Total inference requests | ~3,150 |
| Average daily tasks | 105 |
| GPU utilization (avg) | 12-18% |
| Peak GPU utilization | 94% (during batch generation) |
| Downtime incidents | 2 (auto-recovered) |
| Manual interventions | 0 |
| Electricity cost | ~$5/month |
| Cloud API cost | $0 |
The GPU sits idle most of the time. Agent tasks are bursty — a few seconds of 90%+ GPU utilization, then minutes of idle. This is fine. The RTX 3060 Ti draws ~200W under load but idles at ~15W.
For our workload of ~3,150 requests/month (avg 200 tokens per response):
| Option | Monthly Cost | Privacy | Latency | Uptime Control |
|---|---|---|---|---|
| RTX 3060 Ti (local) | ~$5 (electricity) | Full | 200ms | You |
| OpenAI GPT-4o-mini | ~$6.30 | None | 500ms+ | OpenAI |
| Anthropic Haiku | ~$7.90 | None | 400ms+ | Anthropic |
| Google Gemini Flash | Free (1,500 RPD) | None | 300ms+ | |
| RunPod (A40 hourly) | ~$50+ | Partial | 150ms | RunPod |
For low-volume agent workloads, the cost difference is small. The real advantages of local inference are:
Downloading a new model while another is serving requests drops throughput from 13.2 tok/s to 0.1 tok/s. Ollama uses the same GPU memory bus for both operations.
Fix: Only download models during maintenance windows when no cron jobs are running.
Setting contextWindow: 32768 in our agent config looked reasonable. Performance was 132x slower. The KV cache exceeded VRAM and spilled to system RAM.
Fix: contextWindow: 16384. Test your actual VRAM budget before deploying.
OpenClaw has three places to set timeouts: agents.defaults.timeoutSeconds, payload.timeoutSeconds, and --timeout on cron commands. Only the first one actually controls the timeout.
Fix: Always set agents.defaults.timeoutSeconds in openclaw.json. Ignore the rest.
Before switching to local inference, we used Gemini API. A key created from a billing-enabled Google Cloud project accumulated $127.80 in 7 days. Thinking tokens cost $3.50/1M — 47x more expensive than input tokens.
Fix: We migrated all agent inference to local Ollama. Zero API cost. Zero billing surprises.
Local inference is right for you if:
It's NOT right if:
Our roadmap includes:
The RTX 3060 Ti proved that production AI agent infrastructure doesn't require cloud GPUs or enterprise hardware. A $300 card, good engineering, and systemd reliability is enough to run a business.
Our agent fleet architecture is open source:
Ultra Lab builds AI products. We run 4 live products including an autonomous agent fleet powered by NVIDIA GPU inference. Learn more at ultralab.tw.
Originally published on Ultra Lab — we build AI products that run autonomously.
Try UltraProbe free — our AI security scanner checks your website for vulnerabilities in 30 seconds: ultralab.tw/probe