How the new Raspberry Pi AI Hat supercharges LLMs at the edge(blog.novusteck.com) |
How the new Raspberry Pi AI Hat supercharges LLMs at the edge(blog.novusteck.com) |
LLM inference is basically bottlenecked by RAM bandwidth and how much RAM you have. Every token to be generated needs to iterate over the whole model, pulling it piece by piece from the RAM to the CPU, where some relatively small calculations are applied.
Having a separate NPU like this connected via PCIE makes LLMs much slower, since you're bottlenecked by a PCIE 3.0 x1 connection instead of your full memory bandwidth.
But this article is poor. Especially later part of the article that lists the benefits of the AI accelerator reads like it was written by ChatGPT because it has a formal tone, it is wordy and repeats basics facts already covered in the article.
"edge" - it's like embedded, but with 5 layers of abstraction and abysmal performance.
hth.
Others mention the article itself looks AI generated. I didn't spot that, but it would explain some things.