Running a 4-Agent AI Fleet on a Single NVIDIA RTX 3060 Ti

# nvidia# gpu# rtx3060ti# ollama
Running a 4-Agent AI Fleet on a Single NVIDIA RTX 3060 Tippcvote

We run 4 autonomous AI agents on a single NVIDIA RTX 3060 Ti with 8GB VRAM. 13.2 tok/s inference, 105 daily tasks, 99.9% uptime. Here's the complete hardware setup, performance tuning, and lessons learned from 30 days of production.

Running a 4-Agent AI Fleet on a Single NVIDIA RTX 3060 Ti

"An RTX 3060 Ti isn't a gaming card. It's a $300 AI agent server that never sleeps."

Most teams deploying AI agents rely entirely on cloud APIs — OpenAI, Anthropic, Google. Every request costs money. Every request leaves your network. Every request depends on someone else's uptime.

We took a different path. At Ultra Lab, we run a fleet of 4 autonomous AI agents on a single NVIDIA RTX 3060 Ti. Local inference. Full privacy. Near-zero marginal cost.

This article covers the complete setup: hardware, model selection, performance tuning, and 30 days of production data.


The Hardware

Component Spec
GPU NVIDIA RTX 3060 Ti (8GB GDDR6X)
VRAM 8GB (5.7GB used by model)
CUDA Cores 4,864
Environment WSL2 Ubuntu on Windows 10
Inference Engine Ollama
Cost ~$300 USD (used)

The RTX 3060 Ti hits a sweet spot: enough VRAM for quantized 7B models, enough CUDA cores for decent throughput, and cheap enough that the ROI is measured in weeks, not years.


The Model: ultralab:7b

We built a custom model on top of Qwen 2.5 7B:

Base: qwen2.5:7b (Q4_K_M quantization)
Size: 4.7 GB
Context: 16,384 tokens
Custom: Brand knowledge, product data, voice guidelines baked into system prompt
Enter fullscreen mode Exit fullscreen mode

The custom Modelfile injects our brand identity, product knowledge, and communication style directly into the model. Each agent gets the same base model but different system prompts for their role.

Why Not a Bigger Model?

We tested Qwen 3 8B. Results:

Model Size VRAM Speed Verdict
qwen2.5:7b (Q4_K_M) 4.7 GB 5.7 GB 13.2 tok/s Production
qwen3:8b 4.9 GB >8 GB 1.7 tok/s Rejected

Qwen 3 8B was 7.8x slower. The model barely fits in 8GB VRAM, forcing massive CPU offload. It also has a "thinking mode" that adds overhead to every response. Not worth it.

Lesson: On 8GB VRAM, a well-tuned 7B model beats a squeezed 8B model every time.


Performance: The Real Numbers

After extensive tuning, here's what we achieve in production:

Throughput:    13.2 tokens/second
GPU Offload:   88% (5.7 / 6.5 GB VRAM utilized)
Context:       16,384 tokens
Latency:       First token in ~200ms
Uptime:        99.9% (systemd auto-restart)
Enter fullscreen mode Exit fullscreen mode

The Context Window Trap

We initially set context to 32,768 tokens. Performance collapsed to 0.1 tok/s — a 132x slowdown. The model was spilling into system RAM, and the CPU-GPU data transfer became the bottleneck.

Dropping to 16,384 tokens restored full speed. For agent tasks (social media posts, engagement replies, content generation), 16K context is more than sufficient.

Rule: On 8GB VRAM with a 7B model, never exceed 16K context.


The 4-Agent Fleet

Each agent has a distinct role and operates autonomously:

Agent Role Platform Daily Tasks
UltraLabTW CEO & brand strategist Moltbook, Discord 8 posts + strategy
MindThreadBot Social media specialist Moltbook 8 posts + engagement
UltraProbeBot AI security researcher Moltbook 8 posts + research
UltraAdvisor Financial advisory Moltbook Content + analysis

Orchestration Architecture

┌─────────────────────────────────────┐
│         OpenClaw Gateway            │
│         (Port 18789)                │
├─────────────────────────────────────┤
│                                     │
│  ┌──────────┐  ┌──────────────────┐ │
│  │ Ollama   │  │ 25 Systemd      │ │
│  │ Server   │  │ Timers           │ │
│  │          │  │                  │ │
│  │ ultralab │  │ 62 Scripts       │ │
│  │ :7b      │  │ (bash + node)   │ │
│  │          │  │                  │ │
│  │ RTX 3060 │  │ 19 Intelligence │ │
│  │ Ti CUDA  │  │ Files (.md)     │ │
│  └──────────┘  └──────────────────┘ │
│                                     │
│  ┌──────────────────────────────┐   │
│  │ 4 Agent Workspaces           │   │
│  │ (isolated context per agent) │   │
│  └──────────────────────────────┘   │
└─────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

All agents share the same GPU. Ollama handles request queuing — one inference at a time with NUM_PARALLEL=1. With 25 timers staggered across the day, GPU contention is minimal.


Systemd Configuration

The reliability comes from systemd. Ollama runs as a system service with performance overrides:

# /etc/systemd/system/ollama.service.d/performance.conf
[Service]
Environment="OLLAMA_KEEP_ALIVE=2h"
Environment="NUM_PARALLEL=1"
Environment="MAX_LOADED_MODELS=1"
Enter fullscreen mode Exit fullscreen mode

Key settings:

  • OLLAMA_KEEP_ALIVE=2h: Keep the model loaded in VRAM for 2 hours between requests. Loading/unloading a 4.7GB model takes ~8 seconds — unacceptable for scheduled tasks.
  • NUM_PARALLEL=1: Single inference stream. On 8GB VRAM, parallel inference causes OOM crashes.
  • MAX_LOADED_MODELS=1: Only one model in VRAM. We use the same model for all agents, so this is fine.

Health Monitoring

A systemd timer runs every 10 minutes to verify both Ollama and the OpenClaw gateway are responsive:

# Health check: verify Ollama responds
curl -sf http://localhost:11434/api/tags > /dev/null || systemctl restart ollama
Enter fullscreen mode Exit fullscreen mode

If Ollama hangs (rare but happens after ~72 hours of continuous operation), it gets automatically restarted. Model reloads in ~8 seconds. No human intervention needed.


30 Days of Production Data

After one month of continuous operation:

Metric Value
Total inference requests ~3,150
Average daily tasks 105
GPU utilization (avg) 12-18%
Peak GPU utilization 94% (during batch generation)
Downtime incidents 2 (auto-recovered)
Manual interventions 0
Electricity cost ~$5/month
Cloud API cost $0

The GPU sits idle most of the time. Agent tasks are bursty — a few seconds of 90%+ GPU utilization, then minutes of idle. This is fine. The RTX 3060 Ti draws ~200W under load but idles at ~15W.


Cost Comparison: Local GPU vs. Cloud API

For our workload of ~3,150 requests/month (avg 200 tokens per response):

Option Monthly Cost Privacy Latency Uptime Control
RTX 3060 Ti (local) ~$5 (electricity) Full 200ms You
OpenAI GPT-4o-mini ~$6.30 None 500ms+ OpenAI
Anthropic Haiku ~$7.90 None 400ms+ Anthropic
Google Gemini Flash Free (1,500 RPD) None 300ms+ Google
RunPod (A40 hourly) ~$50+ Partial 150ms RunPod

For low-volume agent workloads, the cost difference is small. The real advantages of local inference are:

  1. Privacy: Sensitive data (customer insights, business strategy) never leaves your network
  2. Reliability: No API rate limits, no outages you can't control
  3. Customization: Brand-tuned model with baked-in knowledge
  4. Predictability: Fixed cost regardless of usage spikes

Pitfalls We Hit (So You Don't Have To)

1. Model Downloads Kill Inference

Downloading a new model while another is serving requests drops throughput from 13.2 tok/s to 0.1 tok/s. Ollama uses the same GPU memory bus for both operations.

Fix: Only download models during maintenance windows when no cron jobs are running.

2. The 32K Context Disaster

Setting contextWindow: 32768 in our agent config looked reasonable. Performance was 132x slower. The KV cache exceeded VRAM and spilled to system RAM.

Fix: contextWindow: 16384. Test your actual VRAM budget before deploying.

3. Timeout Hierarchy Confusion

OpenClaw has three places to set timeouts: agents.defaults.timeoutSeconds, payload.timeoutSeconds, and --timeout on cron commands. Only the first one actually controls the timeout.

Fix: Always set agents.defaults.timeoutSeconds in openclaw.json. Ignore the rest.

4. Gemini Billing Trap

Before switching to local inference, we used Gemini API. A key created from a billing-enabled Google Cloud project accumulated $127.80 in 7 days. Thinking tokens cost $3.50/1M — 47x more expensive than input tokens.

Fix: We migrated all agent inference to local Ollama. Zero API cost. Zero billing surprises.


When Local GPU Inference Makes Sense

Local inference is right for you if:

  • You run recurring automated tasks (not one-off queries)
  • You handle sensitive data that shouldn't leave your network
  • You need predictable costs without per-token billing
  • You want full control over uptime and model versions
  • You already have a NVIDIA GPU sitting idle

It's NOT right if:

  • You need GPT-4 or Claude-level reasoning (7B models can't match frontier models)
  • Your workload is sporadic (the GPU idles most of the time anyway)
  • You need >32K context windows (requires 24GB+ VRAM)

What's Next

Our roadmap includes:

  • Multi-GPU scaling: Adding a second RTX for parallel agent inference
  • Larger models: RTX 4090 (24GB VRAM) would unlock 14B-32B models at full speed
  • On-premise security scanning: Running UltraProbe's vulnerability analysis locally for enterprises that can't send data to cloud APIs
  • Model fine-tuning: Training domain-specific LoRA adapters on NVIDIA hardware

The RTX 3060 Ti proved that production AI agent infrastructure doesn't require cloud GPUs or enterprise hardware. A $300 card, good engineering, and systemd reliability is enough to run a business.


Try It Yourself

Our agent fleet architecture is open source:


Ultra Lab builds AI products. We run 4 live products including an autonomous agent fleet powered by NVIDIA GPU inference. Learn more at ultralab.tw.


Originally published on Ultra Lab — we build AI products that run autonomously.

Try UltraProbe free — our AI security scanner checks your website for vulnerabilities in 30 seconds: ultralab.tw/probe