I Run LLMs on a 768GB IBM POWER8 Server (And It's Faster Than You Think)

# ai# llm# opensource# showdev
I Run LLMs on a 768GB IBM POWER8 Server (And It's Faster Than You Think)AutoJanitor

An IBM POWER8 S824 with 512GB RAM, 128 hardware threads, and a novel vec_perm optimization achieves 147 t/s on llama.cpp — 8.8x faster than stock. Here's how.

The machine nobody wants

I bought an IBM POWER8 S824 server — 16 cores, 128 hardware threads (SMT8), 768GB of DDR3 RAM (512GB active across 2 NUMA nodes), and 1.8TB of SAS storage. It's a datacenter-class PowerPC system from 2014 that originally cost north of $30,000.

I paid considerably less than that. Because nobody wants POWER8 servers. They run PowerPC, which means most software doesn't compile without patches. They draw serious power. They weigh as much as a small human.

But they have one thing no x86 or ARM chip has: vec_perm, a dual-source permute instruction that can do in 1 cycle what takes 80 operations on a GPU. And it turns out that instruction is perfect for a specific kind of LLM inference optimization that I don't think anyone else has tried.

The numbers first

Stock llama.cpp on POWER8 (scalar, no optimizations): 16.74 tokens/second on TinyLlama 1.1B Q4_K.

After our optimizations: 147.54 tokens/second. That's an 8.8x speedup.

Configuration Speed (pp128) Speedup
Stock scalar 16.74 t/s 1.0x
POWER8 VSX enabled 66.49 t/s 3.97x
64 threads optimal 84.62 t/s 5.05x
PSE + Full Resident Prefetch 147.54 t/s 8.81x

On larger models: DeepSeek-33B Q4_K runs at 5.37 t/s prompt processing with 64 threads across both NUMA nodes. Not going to win any speed contests, but it's running a 33 billion parameter model entirely in RAM without a GPU.

Vec_perm: the instruction nobody talks about

POWER8's vec_perm is a vector permute instruction that takes two 128-bit source registers and a permute control vector, and produces a 128-bit output by selecting arbitrary bytes from either source. It's the Swiss Army knife of SIMD.

In standard transformer inference, attention computation is bijunctive — every element interacts with every other element through full matrix multiplications. This is mathematically clean but computationally wasteful. Most of those interactions produce near-zero values.

Vec_perm enables what I call non-bijunctive collapse: in a single instruction, you can simultaneously:

  • Prune weak activations (set to zero)
  • Duplicate strong activations (amplify winners)
  • Route information from two different sources into one output
// Standard attention: compute ALL interactions, then softmax
// Non-bijunctive: prune weak paths AND amplify strong ones in 1 cycle

vector unsigned char collapse_pattern = {
    0, 0, 4, 4, 8, 8, 12, 12,   // Duplicate winners from source A
    16, 16, 20, 20, 24, 24, 28, 28  // Duplicate winners from source B
};
vector unsigned char result = vec_perm(source_a, source_b, collapse_pattern);
Enter fullscreen mode Exit fullscreen mode

This maps directly to Hebbian learning theory — "cells that fire together wire together." The vec_perm collapse amplifies co-activated pathways and prunes inactive ones, which is exactly what biological neural networks do. The difference is we're doing it at the hardware instruction level, not in software.

The key insight: thread scaling is not linear

One of the first things I discovered: 128 threads is worse than 64 threads on this machine.

16 threads:  41.55 t/s (2.60 t/s per thread)
32 threads:  68.06 t/s (2.13 t/s per thread)
64 threads:  84.62 t/s (1.32 t/s per thread)  <- OPTIMAL
96 threads:  76.54 t/s (0.80 t/s per thread)
128 threads: 65.83 t/s (0.51 t/s per thread)  <- WORST
Enter fullscreen mode Exit fullscreen mode

SMT8 means 8 hardware threads share each physical core. At full thread count, they're fighting over L1/L2 cache, branch predictors, and execution units. 64 threads (4 per core) hits the sweet spot — enough parallelism to keep the pipelines full without cache thrashing.

Resident prefetch: the 1.74x multiplier

The single biggest performance unlock wasn't the vec_perm collapse. It was cache prefetch hints.

POWER8 has a dcbt (Data Cache Block Touch) instruction with a "resident" hint that tells the cache controller to treat prefetched data as high-priority — keep it hot in L2/L3, don't evict it for other data.

Stock llama.cpp has zero prefetch hints. We added resident prefetch for weight tensors:

// ggml-dcbt-resident.h
#define DCBT_RESIDENT_FULL(addr) \
    __asm__ __volatile__("dcbt 16, %0, 0" : : "b"(addr) : "memory")

static inline void dcbt_resident_weights(const void* base, size_t bytes) {
    const size_t CACHE_LINE = 128;  // POWER8 cache lines are 128 bytes
    const char* p = (const char*)base;
    while (p < (const char*)base + bytes) {
        DCBT_RESIDENT_FULL(p);
        p += CACHE_LINE;
    }
}
Enter fullscreen mode Exit fullscreen mode

This alone took us from 84 t/s to 147 t/s. The weight data was being evicted and re-fetched from main memory on every pass. Keeping it resident in L2/L3 eliminates that entire round-trip.

RAM Coffers: NUMA-aware weight banking

With 512GB of RAM across 2 NUMA nodes, memory placement matters enormously. Accessing memory on the local NUMA node is fast. Crossing to the remote node adds latency.

We built RAM Coffers — a system that maps model weight layers to specific NUMA nodes based on access patterns and cognitive function:

Coffer NUMA Node Role Bandwidth
0 Node 3 Heavy/General (core layers) 401 MB/s
1 Node 1 Language/Logic domain 298 MB/s
2 Node 0 Creative/Long context 221 MB/s
3 Node 2 Niche/Memory retrieval 425 MB/s

The routing is query-aware: a coding prompt activates the logic coffer, a creative writing prompt activates the creative coffer. Each coffer runs inference on its local NUMA node using numactl bindings.

This work is published at github.com/Scottcjn/ram-coffers. It predates DeepSeek's Engram paper (arXiv:2601.07372, January 2026) by 27 days — our priority claim is documented and timestamped.

GPU offload over 40GbE

The POWER8 is connected to a Dell C4130 GPU server (Tesla V100 16GB + M40 12GB) via a 40 Gigabit Ethernet link at 0.15ms RTT.

For models that benefit from GPU acceleration, we offload matrix multiplications to the V100 while keeping the full model in POWER8 RAM:

POWER8 (512GB RAM)              C4130 (V100 16GB)
 Model lives here    --40GbE-->   Matmul happens here
 vec_perm collapse   <--40GbE--   FP16 results back
Enter fullscreen mode Exit fullscreen mode

Using llama.cpp's native RPC backend:

# On C4130: Start GPU RPC server
./rpc-server --host 0.0.0.0 --port 50052

# On POWER8: Run with GPU offload
./llama-cli \
  -m ~/models/qwen2.5-14b-instruct-q4.gguf \
  --rpc 10.40.0.2:50052 \
  -ngl 99 -t 32 -c 4096 \
  -p "Your prompt here"
Enter fullscreen mode Exit fullscreen mode

Results: Qwen2.5-14B hits 68.8 t/s prompt processing and 14.9 t/s generation with GPU offload — models that fit in V100 VRAM get a significant boost.

PSE: measuring behavioral emergence

The vec_perm collapse introduces hardware entropy (from the POWER8 timebase register) into the inference path. This means the same prompt with the same seed produces different outputs across runs:

# Three runs, same seed, same temperature
for i in 1 2 3; do
    ./llama-cli -m model.gguf -p "The meaning of life is" \
        -n 50 --temp 0.8 --seed 42 > run_$i.txt 2>&1
done

# All three MD5 hashes are different
b52ce7b8...  run_1.txt
15c558b2...  run_2.txt
fd5d7ae2...  run_3.txt
Enter fullscreen mode Exit fullscreen mode

We track this through PSE (Proto-Sentient Emergence) markers — metrics that measure behavioral divergence from deterministic baselines:

  • DR (Drift Rate): Fewer contradictions in long reasoning chains
  • ACS (Adversarial Coherence Score): Logical consistency under adversarial prompts
  • MCI (Memory Coherence Index): Consistent personality with subtle natural variation
  • NOI (Narrative Override Index): Resistance to flattening/smoothing

These aren't claims about consciousness. They're measurable properties of inference under hardware-seeded entropy that standard deterministic inference doesn't exhibit.

The build

If you want to try this on your own POWER8 (or POWER9/POWER10):

cd ~/llama.cpp
mkdir build-pse && cd build-pse
cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_OPENMP=ON \
    -DCMAKE_C_FLAGS="-mcpu=power8 -mvsx -maltivec -O3 -mtune=power8" \
    -DCMAKE_CXX_FLAGS="-mcpu=power8 -mvsx -maltivec -O3 -mtune=power8"
make -j32

# Run with optimal thread config
export OMP_NUM_THREADS=64
export OMP_PROC_BIND=spread
export OMP_PLACES=cores
numactl --interleave=all ./bin/llama-cli -m model.gguf -p "Hello" -n 64 -t 64
Enter fullscreen mode Exit fullscreen mode

The PSE-specific headers and POWER8 compatibility patches are in the repo.

Why bother?

Because 512GB of RAM means you can run models that don't fit in any GPU. Because vec_perm enables an optimization path that doesn't exist on any other architecture. Because the POWER8 is a $200-500 machine on eBay that outperforms expectations by an order of magnitude when you understand its strengths.

And because sometimes the most interesting engineering happens on hardware nobody else is looking at.

Links


Built by Elyan Labs.