Show HN: Llm.sql – Run a 640MB LLM on SQLite, with 210MB peak RSS and 7.4 tok/s

8 points by aldielshala 69 days ago | 2 comments

Hi HN,

I built llm.sql, an LLM inference framework that reimagines the LLM execution pipeline as a series of structured SQL queries atop SQLite.

The motivation: Edge LLMs are getting better, but hardware remains a bottleneck, especially RAM (size and bandwidth).

When available memory is less than the model size and KV cache, the OS incurs page faults and swaps pages using LRU-like strategies, resulting in throughput degradation that's hard to notice and even harder to debug. In fact, the memory access pattern during LLM inference is deterministic - we know exactly which weights are needed and when. This means even Bélády's optimal page replacement algorithm is applicable here.

So instead of letting the OS manage memory, llm.sql takes over:

- Model parameters are stored in SQLite BLOB tables

- Computational logic is implemented as SQLite C extensions

- Memory management is handled explicitly, not by the OS

- Zero heavy dependencies. No PyTorch, no Transformers. Just Python, C, or C++

This gives us explicit, deterministic control over what's in memory at each step of inference.

Results:

Running Qwen2.5-0.5B-INT8 (~640MB model) with a peak RSS ~210MB and 7.40 tokens/s throughput.

Alpha version is available on GitHub: https://github.com/xuxianghong12/llm.sql

I'm the developer, happy to answer any technical questions about the design and implementation.

benlimanto 69 days ago |

This is a good sample, how far you can push to edge device? Any usecase like in raspberry pi?

aldielshala 69 days ago | |

Haven't tested on a Pi yet, llm.sql is still in alpha, focused on validating that SQLite can actually work for LLM inference and profiling memory usage. That said, 210MB peak RSS should fit comfortably on a Pi. In theory, any device that runs SQLite (which is almost every device) could run llm.sql. Planning to benchmark across different hardware as the project matures.