LLM inference engine from scratch in C++ – why output tokens cost 5x(anirudhsathiya.com) |
LLM inference engine from scratch in C++ – why output tokens cost 5x(anirudhsathiya.com) |
Like when someone mentioned vLLM's paged attention, I knew virtual memory paging, but had no idea someone had applied the same idea to KV cache allocation on GPUs.
Github link to the project: https://github.com/Anirudh171202/WhiteLotus
UHHHH...