5.6x throughput on Kimi K2.6 by speculating less

5.6x throughput on Kimi K2.6 by speculating less(huggingface.co)

11 points by florianleibert 73 days ago | 2 comments

hi! open sourced a serving config for Kimi K2.6 from 90 tok/s 508 tok/s on 8xMI300X. Same weights / 0 quality loss.

Scaling is linear @15.8 tok/s per slot latency is constant. REpo has command launcher, Dockerfile, benchmark tool. Known limitations: BF16 KV only (FP8 crashes due to an AITER 384-expert constraint)

latchkey 73 days ago |

This is a massive speed up and entirely open source!