We built SiMM because LLM context lengths are growing much faster than GPU memory. With long Chain-of-Thought reasoning and multi-turn agents, prompts are getting much longer. According to OpenRouter’s State of AI 2025, average context length has grown about 4× in the past year. This creates two problems in inference systems: • Slow TTFT — long contexts make prefill expensive • High GPU memory cost — KV cache quickly exhausts HBM Instead of recomputing long prompts or keeping all KV cache in GPU memory, we explored a different approach: treat KV cache as a distributed memory system. SiMM is an open-source distributed KV cache engine for LLM inference. It stores KV cache in a high-speed RDMA-backed memory pool and lets engines like SGLang and vLLM reuse cached states across requests. This converts prefill from a compute-heavy step into a fast I/O lookup. In our tests with long-context multi-turn workloads: 3.1× speedup vs no cache 2.1× vs local CPU cache up to 9× lower KV I/O latency SiMM scales horizontally across nodes and fully utilizes RDMA NIC bandwidth. GitHub: https://github.com/scitix/SiMM |