5.6x throughput on Kimi K2.6 by speculating less(huggingface.co) |
5.6x throughput on Kimi K2.6 by speculating less(huggingface.co) |
Scaling is linear @15.8 tok/s per slot latency is constant. REpo has command launcher, Dockerfile, benchmark tool. Known limitations: BF16 KV only (FP8 crashes due to an AITER 384-expert constraint)