I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.
I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.
I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.
You say it runs "at reading speed". Have you benchmarked it?
Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.
> You say it runs "at reading speed". Have you benchmarked it?
At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:
llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128
Gives:
llama_print_timings: load time = 83911.65 ms
llama_print_timings: sample time = 26.99 ms / 128 runs ( 0.21 ms per token, 4742.15 tokens per second)
llama_print_timings: prompt eval time = 343.41 ms / 7 tokens ( 49.06 ms per token, 20.38 tokens per second)
llama_print_timings: eval time = 10639.36 ms / 127 runs ( 83.77 ms per token, 11.94 tokens per second)
llama_print_timings: total time = 11114.98 ms / 134 tokens
So 11.94 tokens per second while it's also playing binary cache and CI builder.When I do it properly, I'll add it to the blog as well!
numactl --membind=1
so it is constrained to one of the memory sticks which speeds up token generation a little.
Model name: Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz
Mainboard Product Name: P8Z77 WS
GPU 05:00.0 VGA compatible controller: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti 16GB] (rev a1)
05:00.1 Audio device: NVIDIA Corporation AD106M High Definition Audio Controller (rev a1)
Memory: 32GBThis works.
So you'd change the invocation slightly here, but a lot of things you can potentially reuse.
That said, the Gemma 4 E4B models have so far in my experience been... not great when it comes to long context, but they are very passable for basic tasks, and even seem surprisingly okay at tool calls.
Totally just vibes based, I think it goes up to 20+ tps when it's not under load (and that's me trying to be conservative). For context, reading speed at 250 wpm would be around 5 to 6 tokens per second.