Running a 4-Agent AI Fleet on a Single NVIDIA RTX 3060 Ti

# nvidia# gpu# rtx3060ti# ollama

ppcvote

We run 4 autonomous AI agents on a single NVIDIA RTX 3060 Ti with 8GB VRAM. 13.2 tok/s inference, 105 daily tasks, 99.9% uptime. Here's the complete hardware setup, performance tuning, and lessons learned from 30 days of production.

Running a 4-Agent AI Fleet on a Single NVIDIA RTX 3060 Ti

"An RTX 3060 Ti isn't a gaming card. It's a $300 AI agent server that never sleeps."

Most teams deploying AI agents rely entirely on cloud APIs — OpenAI, Anthropic, Google. Every request costs money. Every request leaves your network. Every request depends on someone else's uptime.

We took a different path. At Ultra Lab, we run a fleet of 4 autonomous AI agents on a single NVIDIA RTX 3060 Ti. Local inference. Full privacy. Near-zero marginal cost.

This article covers the complete setup: hardware, model selection, performance tuning, and 30 days of production data.

The Hardware

Component	Spec
GPU	NVIDIA RTX 3060 Ti (8GB GDDR6X)
VRAM	8GB (5.7GB used by model)
CUDA Cores	4,864
Environment	WSL2 Ubuntu on Windows 10
Inference Engine	Ollama
Cost	~$300 USD (used)

The RTX 3060 Ti hits a sweet spot: enough VRAM for quantized 7B models, enough CUDA cores for decent throughput, and cheap enough that the ROI is measured in weeks, not years.

The Model: ultralab:7b

We built a custom model on top of Qwen 2.5 7B:

Base: qwen2.5:7b (Q4_K_M quantization)
Size: 4.7 GB
Context: 16,384 tokens
Custom: Brand knowledge, product data, voice guidelines baked into system prompt

The custom Modelfile injects our brand identity, product knowledge, and communication style directly into the model. Each agent gets the same base model but different system prompts for their role.

Why Not a Bigger Model?

We tested Qwen 3 8B. Results:

Model	Size	VRAM	Speed	Verdict
qwen2.5:7b (Q4_K_M)	4.7 GB	5.7 GB	13.2 tok/s	Production
qwen3:8b	4.9 GB	>8 GB	1.7 tok/s	Rejected

Qwen 3 8B was 7.8x slower. The model barely fits in 8GB VRAM, forcing massive CPU offload. It also has a "thinking mode" that adds overhead to every response. Not worth it.

Lesson: On 8GB VRAM, a well-tuned 7B model beats a squeezed 8B model every time.

Performance: The Real Numbers

After extensive tuning, here's what we achieve in production:

Throughput:    13.2 tokens/second
GPU Offload:   88% (5.7 / 6.5 GB VRAM utilized)
Context:       16,384 tokens
Latency:       First token in ~200ms
Uptime:        99.9% (systemd auto-restart)

The Context Window Trap

We initially set context to 32,768 tokens. Performance collapsed to 0.1 tok/s — a 132x slowdown. The model was spilling into system RAM, and the CPU-GPU data transfer became the bottleneck.

Dropping to 16,384 tokens restored full speed. For agent tasks (social media posts, engagement replies, content generation), 16K context is more than sufficient.

Rule: On 8GB VRAM with a 7B model, never exceed 16K context.

The 4-Agent Fleet

Each agent has a distinct role and operates autonomously:

Agent	Role	Platform	Daily Tasks
UltraLabTW	CEO & brand strategist	Moltbook, Discord	8 posts + strategy
MindThreadBot	Social media specialist	Moltbook	8 posts + engagement
UltraProbeBot	AI security researcher	Moltbook	8 posts + research
UltraAdvisor	Financial advisory	Moltbook	Content + analysis

Orchestration Architecture

┌─────────────────────────────────────┐
│         OpenClaw Gateway            │
│         (Port 18789)                │
├─────────────────────────────────────┤
│                                     │
│  ┌──────────┐  ┌──────────────────┐ │
│  │ Ollama   │  │ 25 Systemd      │ │
│  │ Server   │  │ Timers           │ │
│  │          │  │                  │ │
│  │ ultralab │  │ 62 Scripts       │ │
│  │ :7b      │  │ (bash + node)   │ │
│  │          │  │                  │ │
│  │ RTX 3060 │  │ 19 Intelligence │ │
│  │ Ti CUDA  │  │ Files (.md)     │ │
│  └──────────┘  └──────────────────┘ │
│                                     │
│  ┌──────────────────────────────┐   │
│  │ 4 Agent Workspaces           │   │
│  │ (isolated context per agent) │   │
│  └──────────────────────────────┘   │
└─────────────────────────────────────┘

All agents share the same GPU. Ollama handles request queuing — one inference at a time with NUM_PARALLEL=1. With 25 timers staggered across the day, GPU contention is minimal.

Systemd Configuration

The reliability comes from systemd. Ollama runs as a system service with performance overrides:

# /etc/systemd/system/ollama.service.d/performance.conf
[Service]
Environment="OLLAMA_KEEP_ALIVE=2h"
Environment="NUM_PARALLEL=1"
Environment="MAX_LOADED_MODELS=1"

Key settings:

OLLAMA_KEEP_ALIVE=2h: Keep the model loaded in VRAM for 2 hours between requests. Loading/unloading a 4.7GB model takes ~8 seconds — unacceptable for scheduled tasks.
NUM_PARALLEL=1: Single inference stream. On 8GB VRAM, parallel inference causes OOM crashes.
MAX_LOADED_MODELS=1: Only one model in VRAM. We use the same model for all agents, so this is fine.

Health Monitoring

A systemd timer runs every 10 minutes to verify both Ollama and the OpenClaw gateway are responsive:

# Health check: verify Ollama responds
curl -sf http://localhost:11434/api/tags > /dev/null || systemctl restart ollama

If Ollama hangs (rare but happens after ~72 hours of continuous operation), it gets automatically restarted. Model reloads in ~8 seconds. No human intervention needed.

30 Days of Production Data

After one month of continuous operation:

Metric	Value
Total inference requests	~3,150
Average daily tasks	105
GPU utilization (avg)	12-18%
Peak GPU utilization	94% (during batch generation)
Downtime incidents	2 (auto-recovered)
Manual interventions	0
Electricity cost	~$5/month
Cloud API cost	$0

The GPU sits idle most of the time. Agent tasks are bursty — a few seconds of 90%+ GPU utilization, then minutes of idle. This is fine. The RTX 3060 Ti draws ~200W under load but idles at ~15W.

Cost Comparison: Local GPU vs. Cloud API

For our workload of ~3,150 requests/month (avg 200 tokens per response):

Option	Monthly Cost	Privacy	Latency	Uptime Control
RTX 3060 Ti (local)	~$5 (electricity)	Full	200ms	You
OpenAI GPT-4o-mini	~$6.30	None	500ms+	OpenAI
Anthropic Haiku	~$7.90	None	400ms+	Anthropic
Google Gemini Flash	Free (1,500 RPD)	None	300ms+	Google
RunPod (A40 hourly)	~$50+	Partial	150ms	RunPod

For low-volume agent workloads, the cost difference is small. The real advantages of local inference are:

Privacy: Sensitive data (customer insights, business strategy) never leaves your network
Reliability: No API rate limits, no outages you can't control
Customization: Brand-tuned model with baked-in knowledge
Predictability: Fixed cost regardless of usage spikes

Pitfalls We Hit (So You Don't Have To)

1. Model Downloads Kill Inference

Downloading a new model while another is serving requests drops throughput from 13.2 tok/s to 0.1 tok/s. Ollama uses the same GPU memory bus for both operations.

Fix: Only download models during maintenance windows when no cron jobs are running.

2. The 32K Context Disaster

Setting contextWindow: 32768 in our agent config looked reasonable. Performance was 132x slower. The KV cache exceeded VRAM and spilled to system RAM.

Fix: contextWindow: 16384. Test your actual VRAM budget before deploying.

3. Timeout Hierarchy Confusion

OpenClaw has three places to set timeouts: agents.defaults.timeoutSeconds, payload.timeoutSeconds, and --timeout on cron commands. Only the first one actually controls the timeout.

Fix: Always set agents.defaults.timeoutSeconds in openclaw.json. Ignore the rest.

4. Gemini Billing Trap

Before switching to local inference, we used Gemini API. A key created from a billing-enabled Google Cloud project accumulated $127.80 in 7 days. Thinking tokens cost $3.50/1M — 47x more expensive than input tokens.

Fix: We migrated all agent inference to local Ollama. Zero API cost. Zero billing surprises.

When Local GPU Inference Makes Sense

Local inference is right for you if:

You run recurring automated tasks (not one-off queries)
You handle sensitive data that shouldn't leave your network
You need predictable costs without per-token billing
You want full control over uptime and model versions
You already have a NVIDIA GPU sitting idle

It's NOT right if:

You need GPT-4 or Claude-level reasoning (7B models can't match frontier models)
Your workload is sporadic (the GPU idles most of the time anyway)
You need >32K context windows (requires 24GB+ VRAM)

What's Next

Our roadmap includes:

Multi-GPU scaling: Adding a second RTX for parallel agent inference
Larger models: RTX 4090 (24GB VRAM) would unlock 14B-32B models at full speed
On-premise security scanning: Running UltraProbe's vulnerability analysis locally for enterprises that can't send data to cloud APIs
Model fine-tuning: Training domain-specific LoRA adapters on NVIDIA hardware

The RTX 3060 Ti proved that production AI agent infrastructure doesn't require cloud GPUs or enterprise hardware. A $300 card, good engineering, and systemd reliability is enough to run a business.

Try It Yourself

Our agent fleet architecture is open source:

GitHub: free-tier-agent-fleet
Agent Fleet Dashboard: ultralab.tw/agent
Setup Guide: OpenClaw AI Agent Setup

Ultra Lab builds AI products. We run 4 live products including an autonomous agent fleet powered by NVIDIA GPU inference. Learn more at ultralab.tw.

Originally published on Ultra Lab — we build AI products that run autonomously.

Try UltraProbe free — our AI security scanner checks your website for vulnerabilities in 30 seconds: ultralab.tw/probe