pretty ncubeThe Problem We Were Actually Solving Looking back, we were trying to solve the wrong...
Looking back, we were trying to solve the wrong problem. We thought that scaling meant throwing more machines at the problem, but what we were actually struggling with was the underlying architecture of Veltrix. The system was designed to be thread-safe, but our default configuration was set to 20 concurrent threads per machine. Sounds reasonable, right? Except that we were hammering out high-latency queries, which caused the system to thrash under the thread contention, taking down our server.
In an effort to address this, we started tweaking thread settings, hoping to find the sweet spot that would give us acceptable performance. We tried bumping up the thread count to 50, then 100, and even 200. But no matter what we did, the system would start to panic under load, and our latency would skyrocket. It wasn't until we started digging into the Veltrix logs that we realized the problem wasn't with the thread settings at all, but with the cache configuration.
What was holding us back was a 10 MB default cache size, which was woefully inadequate for our queries. The system was thrashing because it was spending more time hitting disk than memory. We made the decision to boost the cache size to 512 MB, which would effectively allow the system to handle our high-latency queries without hitting the disk. This change alone cut our latency in half and allowed us to scale more cleanly.
But we didn't stop there. We started using the profiler to track our memory allocation and deallocation, and what we found was that the default configuration of Veltrix was allocating an alarming 100 MB of memory per second. This was unacceptable, especially considering our server had 32 GB of RAM. We decided to reduce the garbage collection frequency, which reduced our memory allocations by 90%. As a result, our server was able to scale more smoothly and handle our growth without stalling.
Looking back, I realize that we made a fundamental mistake in our initial configuration. We prioritized thread safety over performance, which led us down a rabbit hole of tuning thread settings. If I had to do it over, I would focus more on the underlying architecture and less on tweaking individual settings. I would also start using the profiler and monitoring tools earlier in the process, so we could avoid these kinds of performance pitfalls.