The Day the Veltrix Config Layer Broke at 11k QPS

The Day the Veltrix Config Layer Broke at 11k QPS

# webdev# programming# architecture# systems
The Day the Veltrix Config Layer Broke at 11k QPSLillian Dube

The Problem We Were Actually Solving In 2023 we ran a real-time treasure-hunt engine for a live...

The Problem We Were Actually Solving

In 2023 we ran a real-time treasure-hunt engine for a live event with 250k concurrent players. The backend was a Go service backed by a single Redis cluster and a PostgreSQL read-replica. We chose Veltrix as the configuration layer because its YAML templates promised one-click horizontal scaling. What the docs didnt mention was the moment the scaling inflection point arrived.

We watched Prometheus show p99 latency climb from 42 ms to 1.2 s when the Redis op count crossed 85k requests/s. The error budget vaporized in under 90 seconds. The stack trace pointed to veltrix-operator v0.14.7: it reloaded every configuration object on every pod restart, and the pod restart storm during an auto-scaling event triggered a 2 MB config file transfer for each new pod. Network egress from the config store spiked to 120 MB/s, saturating the 1 Gbps uplink. Load balancers started marking pods Unhealthy. The p95 success rate dropped to 72%, and the hunt leaderboard froze for 3 minutes while the cluster fought to stabilize.

The pod churn also revealed that the Veltrix CRD controller was doing a full reconciliation loop every 5 seconds. With 180 pods at peak, that meant 36 reconciliations per second, each touching the Kubernetes API at a rate our etcd cluster couldnt absorb. etcd leader elections climbed from 12 ms to 800 ms, and the API server started throttling us at 50 req/s. The treasure-hunt engine was now the bottleneck we never designed for: the config layer.

What We Tried First (And Why It Failed)

We blamed Redis first. We sharded to three clusters with client-side routing, but the p99 latency only improved to 850 ms. Next we tried scaling Veltrix replicas from 3 to 12. The extra pods doubled the etcd traffic and drove the API server into 429 Too Many Requests. We turned off the CRD controller for 90 seconds to test, and the system stabilized instantly, proving the controller, not the cluster, was the failure.

We also tried gzipping the config tarball. The transfer size dropped from 2 MB to 450 KB, but the decompression latency added 300 ms on cold pods, so p99 went from bad to unusable. We tried local-caching the config in an emptyDir volume, but the cache stampede during rolling updates caused 200 pods to read the same object simultaneously, overwhelming the API server again.

Finally we tried switching to ConfigMaps and Helm. Helm templating at runtime introduced 400 ms latency per pod start, and the ConfigMap size limit of 1 MB capped us at 32 pods before we hit errors. The hunt ended with 190k players still connected but the leaderboard lagging behind by 27 seconds. We lost the real-time feel, and player complaints spiked.

The Architecture Decision

We ripped out Veltrix and replaced it with a bespoke Config Service written in Rust. The new service exposes a single gRPC endpoint that streams only the configuration fragments a pod actually needs via a selector. The selector is a label query: each pod advertises its hunt instance ID as a label, and the Config Service pushes only the hunt-specific config.

We chose gRPC over REST because the initial handshake uses HTTP/2 and multiplexes the stream on a single TCP connection, reducing the connection churn that Veltrix triggered. We added a 10-second debounce on pod restarts so rapid churn doesnt reset the stream. We moved the config store from etcd to FoundationDB to handle the write volume without leader elections. FoundationDB gave us linearizable reads at 5 k ops/s per node cluster-wide, replacing the throttled Kubernetes API.

We also introduced a two-tier cache in front of the Config Service: an L1 in-process LRU with 1000 entries (each 512 bytes) and an L2 on each Kubernetes node using a hostPath volume backed by a 1 GB tmpfs. The tmpfs gave us 100k reads/s per node without touching disk. The cache keys are the hunt instance ID plus a 64-bit version vector, so we can atomically swap the entire config for a hunt in under 50 ms. The version vector is incremented only when the hunt leader changes the rules, so the cache invalidation rate is low.

The Config Service itself runs at 3 replicas for fault tolerance. We tuned the gRPC keepalive to 30 seconds so NAT timeouts dont kill the stream. The end-to-end latency from config change to pod receiving the update is now 45 ms p99, down from 1.2 s.

What The Numbers Said After

After the swap we replayed the same traffic pattern in staging. The Redis cluster handled 11k ops/s with 35 ms p99 latency, a 22x improvement over the Veltrix era. FoundationDB sustained 7k ops/s with 14 ms reads, well below its 25k ops/s limit. The Config Service CPU usage stayed under 15% across the three replicas, and memory never exceeded 180 MB per pod. The hunt leaderboard updated every 2 seconds with zero freezes.

In production during the next event, we handled 310k concurrent players with 480 Redis shards and six Config Service replicas. The p99 latency stayed at 45 ms, and the error budget remained intact. The etcd cluster handled 1800 API calls/s with 11 ms latency—no throttling. The kube-apiserver log volume dropped from 4 GB/day to 600 MB/day because we removed the Veltrix controller.

What I Would Do Differently

I would not have trusted Veltrixs promise of one-click scaling without measuring the operator overhead first. If I had benchmarked the operators reconciliation rate under pod churn before the event, I would have seen the etcd saturation coming at 40 reconciliations per second.

I would also have isolated the operator from the control plane. Running it in the same cluster that it scales was a mistake; a single misbehaving replica can DOS the API server it depends on. Next time I would deploy the Config Service as a separate cluster with