How Taalas “prints” LLM onto a chip?(anuragk.com) |
How Taalas “prints” LLM onto a chip?(anuragk.com) |
So how does this Taalas chip work? Analog compute by putting the weights/multipliers on the cross-bars? Transistors in the sub-threshold region? Something else?
Since model size determines die size, and die size has absolute limits as well as a correlation with yield, eventually it hits physical and economic limits. There was also some discussion about ganging chips.
Also the defect rate grows as the chip grows. It seems like there might be room for innovation in fault tolerance here, compared to a CPU where a randomly flipped bit can be catastrophic.
*Framework sells laptops and parts such that in theory users can own a ~~ship~~ laptop of Theseus over time without having to buy a whole new laptop when something breaks or needs upgrade.
If this happens, womp womp, recall the misaligned LLMs and learn from the mistake. It's part of running a hardware business as opposed to a software one.
I can't imagine they'd go for a full production run before at least testing a couple chips and finding issues.
Wow. Massively ignorant take. A modern GPUs is an amazing feat of engineering, particularly about making computation more efficient (low power/high throughput).
Then proceeds to explain, wrongly, how inference is supposssedly implemented and draws conclusions from there ...
I had written this post to have a higher level understanding of traditional vs Taalas's inference. So it does abstracts lots of things.
How disruptive dot com was depends on where you were.
Current open weight models < 20B are already capable of being useful. With even 1K tokens/second, they would change what it means to interact with them or for models to interact with the computer.
Talas promises a 10x higher throughtput, being 10x cheaper and using 10x less electricity.
Looks like a good value proposition.
In full precision, yes. But this talaas chip uses a heavily quantized version (the article calls it "3/6 bit quant", probably similar to Q4_K_M). You dont even need a GPU to run that with reasonable performance, a CPU is fine.
Roof! Roof!
Exciting times.
dwata: Entirely Local Financial Data Extraction from Emails Using Ministral 3 3B with Ollama: https://youtu.be/LVT-jYlvM18
I think they used block quantization: one can enumerate all possible blocks for all (sorted) permutations of coefficients and for each layer place only these blocks that are needed there. For 3-bit coefficients and block size of 4 coefficients only 330 different blocks are needed.
Matrices in the llama 3.1 are 4096x4096, 16M coefficients. They can be compressed into only 330 blocks, if we assume that all coefficients' permutations are there, and network of correct permutations of inputs and outputs.
Assuming that blocks are the most area consuming part, we have block's transistor budget of about 250 thousands of transistors, or 30 thousands of 2-inputs NAND gates per block.
250K transistors per block * 330 blocks / 16M transistors = about 5 transistors per coefficient.
Looks very, very doable.
It does look doable even for FP4 - these are 3-bit coefficients in disguise.
https://kilthub.cmu.edu/articles/thesis/Modern_Gate_Array_De...
And they are likely doing something similar to put their LLMs in silicon. I would believe a 10x electricity boost along with it being much faster.
The idea is that you can create a sea of generalized standard cells and it makes for a gate array at the manufacturing layer. This was also done 20 or so years ago, it was called a "structured ASIC".
I'd be curious to see if they use the LUT design of traditional structured ASICs or figured what what I did: you can use standard cells to do the same thing and use regular tools/PDKs to make it.
(I have my guesses as to what that is, but I admittedly don't know enough about that particular part of the field to give anything but a guess).
Other than the obvious costs (but Taalas seems to be bringing back the structured ASIC era so costs shouldn't be that low [1]), I'm curious why this isn't getting much attention from larger companies. Of course, this wouldn't be useful for training models but as the models further improve, I can totally see this inside fully local + ultrafast + ultra efficient processors.
Imagine a slot on your computer where you physically pop out and replace the chip with different models, sort of like a Nintendo DS.
Models would be available as USB plug-in devices. A dense < 20B model may be the best assistant we need for personal use. It is like graphic cards again.
I hope lots of vendors will take note. Open weight models are abundant now. Even at a few thousand tokens/second, low buying cost and low operating cost, this is massive.
For dense LLMs, like llama-3.1-8B, you profit a lot from having all the weights available close to the actual multiply-accumulate hardware.
With MoE, it is rather like a memory lookup. Instead of a 1:1 pairing of MACs to stored weights, you suddenly are forced to have a large memory block next to a small MAC block. And once this mismatch becomes large enough, there is a huge gain by using a highly optimized memory process for the memory instead of mask ROM.
At that point we are back to a chiplet approach...
They use Optical Circuit Switches, operating via MEMS mirrors, to create highly reconfigurable, high-bandwidth 3D torus topologies. The OCS fabric allows 4,096 chips to be connected in a single pod, with the ability to dynamically rewire the cluster to match the communication patterns of specific MoE models.
The 3D torus connects 64-chip cubes with 6 neighbors each. TPUv4 also contains 2 SparseCores which specialize handling high-bandwidth, non-contiguous memory accesses.
Of course this is a DC level system, not something on a chip for your pc, but just want to express the scale here.
*ed: SpareCubes to SparseCubes
I feel printing ASIC is the main block here.
Another commenter mentioned how we keep cycling between local and server-based compute/storage as the dominant approach, and the cycle itself seems to be almost a law of nature. Nonetheless, regardless of where we're currently at in the cycle, there will always be both large and small players who want everything on-prem as much as possible.
I didn't explore the actual manufacturing process.
From some announcements 2 years ago, it seems like they missed their initial schedule by a year, if that's indicative of anything.
For their hardware to make sense a couple of things would need to be true: 1. A model is good enough for a given usecase that there is no need to update/change it for 3-5 years. Note they need to redo their HW-Pipeline if even the weights change. 2. This application is also highly latency-sensitive and benefits from power efficiency. 3. That application is large enough in scale to warrant doing all this instead of running on last-gen hardware.
Maybe some edge-computing and non-civilian use-cases might fit that, but given the lifespan of models, I wonder if most companies wouldn't consider something like this too high-risk.
But maybe some non-text applications, like TTS, audio/video gen, might actually be a good fit.
LLama 3.1 is like 2 years at this point. Taking two months to convert a model that only updates every 2 years is very fast
This doesn't sound remotely possible, but I am here to be convinced.
Except they say it's fully digital, so not an analog multiplier
If the chip is designed as the article says, they should be able to do 1 token per clock cycle...
And whilst I'm sure the propagation time is long through all that logic, it should still be able to do tens of millions of tokens per second...
Planned obsolescence? /s
Jokes aside, they can make the "LLM chip" removable. I know almost nothing is replaceable in MacBooks, but this could be an exception.
"Large Parameter Set Computation Accelerator Using Memory with Parameter Encoding" [2]
"Mask Programmable ROM Using Shared Connections" [3]
The "single transistor multiply" could be multiplication by routing, not arithmetic. Patent [2] describes an accelerator where, if weights are 4-bit (16 possible values), you pre-compute all 16 products (input x each possible value) with a shared multiplier bank, then use a hardwired mesh to route the correct result to each weight's location. The abstract says it directly: multiplier circuits produce a set of outputs, readable cells store addresses associated with parameter values, and a selection circuit picks the right output. The per-weight "readable cell" would then just be an access transistor that passes through the right pre-computed product. If that reading is correct, it's consistent with the CEO telling EE Times compute is "fully digital" [4], and explains why 4-bit matters so much: 16 multipliers to broadcast is tractable, 256 (8-bit) is not.
The same patent reportedly describes the connectivity mesh as configurable via top metal masks, referred to as "saving the model in the mask ROM of the system." If so, the base die is identical across models, with only top metal layers changing to encode weights-as-connectivity and dataflow schedule.
Patent [3] covers high-density multibit mask ROM using shared drain and gate connections with mask-programmable vias, possibly how they hit the density for 8B parameters on one 815mm2 die.
If roughly right, some testable predictions: performance very sensitive to quantization bitwidth; near-zero external memory bandwidth dependence; fine-tuning limited to what fits in the SRAM sidecar.
Caveat: the specific implementation details beyond the abstracts are based on Deep Research's analysis of the full patent texts, not my own reading, so could be off. But the abstracts and public descriptions line up well.
[1] https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
[2] https://patents.google.com/patent/WO2025147771A1/en
[3] https://patents.google.com/patent/WO2025217724A1/en
[4] https://www.eetimes.com/taalas-specializes-to-extremes-for-e...
The single transistor multiply is intriguing.
Id assume they are layers of FMA operating in the log domain.
But everything tells me that would be too noisy and error prone to work.
On the other hand my mind is completely biased to the digital world.
If they stay in the log domain and use a resistor network for multiplication, and the transistor is just exponentiating for the addition that seems genuinely ingenious.
Mulling it over, actually the noise probably doesn't matter. It'll average to 0.
It's essentially compute and memory baked together.
I don't know much about the area of research so can't tell if it's innovative but it does seem compelling!
However, [1] provides the following description: "Taalas’ density is also helped by an innovation which stores a 4-bit model parameter and does multiplication on a single transistor, Bajic said (he declined to give further details but confirmed that compute is still fully digital)."
[1] https://www.eetimes.com/taalas-specializes-to-extremes-for-e...
Some would call it a multi-gate transistor, whilst others would call it multiple transistors in a row...
Perhaps mask manufacturers?
I distrust the notion. The bar of "good enough" seems to be bolted to "like today's frontier models", and frontier model performance only ever goes up.
Unless someone finds a way to turn these thijgs into a bios module.
There would be model size constraints and what quality they can achieve under those constraints.
Would be interesting if it didn't make sense to develop traditional video codecs anymore.
The current video<->latents networks (part of the generative AI model for video) don't optimize just for compression. And you probably wouldn't want variable size input in an actual video codec anyway.
(Of course excluding any cosmic rays / bit flips)?
I didnt see a editable temperature parameter on their chatjimmy demosite -- only a topK.
I can see two potential reasons:
1) Most of the big players seem convinced that AI is going to continue to improve at the rate it did in 2025, if their assumption is somehow correct by the time any chip entered mass production it would be obsolete.
2) The business model of the big players is to sell expensive subscriptions, and train on and sell the data you give it. Chips that allow for relatively inexpensive offline AI aren't conducive to that.
Guess who acqui-hired Groq to push this into GPUs?
The name GPU has been an anachronism for a couple of years now.
The cloud-based AI (OpenAI, etc.) are todays AOL.
It’s for cloud based servers.
And it produced fake headlines and summaries including the threat of lawsuits from involved person(s).
Apple usually waits until somebody else has refined a technology to "invent" it, but I guess they couldn't wait for this one.
Time is money and when you're competing with multiple companies with little margin for error you'll focus all your effort into releasing things quickly.
This chip is "only" a performance boost. It will unlock a lot of potential, but startups can't divide their attention like this. Big companies like google are surely already investigating this venue, but they might lack hardware expertise.
I would be shocked if Google isn’t working on this right now. They build their own TPUs, this is an extremely obvious direction from there.
(And there are plenty of interesting co-design questions that only the frontier labs can dabble with; Taalas is stuck working around architectural quirks like “top-8 MoE”, Google can just rework the architecture hyperparameters to whatever gets best results in silico.)
(Still compelling!)
With these speeds you can run it over USB2, though maybe power is limiting.
Infact, I was thinking, if robots of future could have such slots, where they can use different models, depending on the task they're given. Like a Hardware MoE.
Is this accurate? I don't know enough about hardware, but perhaps someone could clarify: how hard would it be to reverse engineer this to "leak" the model weights? Is it even possible?
There are some labs that sell access to their models (mistral, cohere, etc) without having their models open. I could see a world where more companies can do this if this turns out to be a viable way. Even to end customers, if reverse engineering is deemed impossible. You could have a device that does most of the inference locally and only "call home" when stumped (think alexa with local processing for intent detection and cloud processing for the rest, but better).
I doubt it would scale linearly, but for home use 170 tokens/s at 2.5W would be cool; 17 tokens/s at 0,25W would be awesome.
On the other hand, this may be a step towards positronic brains (https://en.wikipedia.org/wiki/Positronic_brain)
Generally, you use an ASIC to perform a specific task. In this case, I think the takeaway is the LLM functionality here is performance-sensitive, and has enough utility as-is to choose ASIC.
I think burning the weights into the gates is kinda new.
("Weights to gates." "Weighted gates"? "Gated weights"?)
It’s also not that different than how TPUs work where they have special registers in their PEs for weights.
We transitioned from software on CPUs to fixed GPU hardware... But then we transitioned back to software running on GPUs! So there's no way you can say "of course this is the future".
To your point, its neat tech, but the limitations are obvious since 'printing' only one LLM ensures further concentration of power. In other words, history repeats itself.
I don't expect it's like super commercially viable today, but for sure things need to trend to radically more efficient AI solutions.
I think the interesting point is the transition time. When is it ROI-positive to tape out a chip for your new model? There’s a bunch of fun infra to build to make this process cheaper/faster and I imagine MoE will bring some challenges.
Taalas of course builds base chips that are already closely tailored for a particular type of models. They aim to generate the final chips with the model weights baked into ROMs in two months after the weights become available. They hope that the hardware will be profitable for at least some customers, even if the model is only good enough for a year. Assuming they do get superior speed and energy efficiency, this may be a good idea.
[1] https://www.sciencedirect.com/science/article/pii/S138376212...
[2] https://arxiv.org/abs/2506.22772
You can synthesize a logic circuit that is as complex as it gets to have a certain accuracy.
Deep differentiable logic networks, in my experience, do not scale well for larger (more inputs) logic elements. One still has to apply logic optimization and synthesis afterwards. So why not to synthesize ones own approximate circuit to the accuracy one's desire?
EDIT: just in case, I define agent as inference unit with specific preloaded context, in this case, at this speed they don’t have to be async - they may run in sequence in multiple iterations.
(The chips also cost tens of thousands of dollars each)
I doubt anyone would have the skills, wallet, and tools to RE one of these and extract model weights to run them on other hardware. Maybe state actors like the Chinese government or similar could pull that off.
To be fair, 2.5kW does sound too much for a single 3x3cm chip, it would probably melt.
Yeah, though I suppose once we get properly 3d silicon I would not be surprised at power rating for that, 3cm^3 would be something to behold.
FPGAs don’t scale if they did all GPUs would’ve been replaced by FPGAs for graphics a long time ago.
You use an FPGA when spinning a custom ASIC doesn’t makes financial sense and generic processor such as a CPU or GPU is overkill.
Arguably the middle ground here are TPUs, just taking the most efficient parts of a “GPU” when it comes to these workloads but still relying on memory access in every step of the computation.
The reason no one is building large FPGAs is that there is no market for them.
If an H200 scale FPGA was viable we would have one.
But give that time (e.g. microfluidics) - something interesting is that it would be extra hard to use all layers at once, but NN might be a good fit, imagining that computation will be sparse (subsets activating simultaneously)...
Will your comment age well? We'll see.
We might all be surprised if (somehow, ternary logic?) models come down drastically in size. It doesn't have to be the hardware getting more dense.
Also, offline access is still a necessity for many usecases. If you have something like an autocomplete feature that stops working when you're on the subway, the change in UX between offline and online makes the feature more disruptive than helpful.
The A18 iPhone chip has 15b transistors for the GPU and CPU; the Taalas ASIC has 53b transistors dedicated to inference alone. If it's anything like NPUs, almost all vendors will bypass the baked-in silicon to use GPU acceleration past a certain point. It makes much more sense to ship a CUDA-style flexible GPGPU architecture.
Dedicated inference ASICs are a dead end. You can't reprogram them, you can't finetune them, and they won't keep any of their resale value. Outside cruise missiles it's hard to imagine where such a disposable technology would be desirable.
Ever wondered why those stupid "they secretly nerfed the model!" myths persist? Why users report that "model got dumber", even if benchmarks stay consistent, even if you're on the inference side yourself and know with certainty that they are actually being served the same inference over the same exact weights on the same hardware quantized the same way?
Because user demands rise over time, always.
Users get a new flashy model, and it impresses them. It can do things the old model couldn't. Then they push it, and learn its limitations and quirks as they use it. And then it feels like it "got dumber" - because they got more aggressive about using it, got better at spotting all the ways it was always dumb in.
It's a treadmill, and you pretty much have to keep improving the models just to stay ahead of user expectations.
I have seen this with ChatGPT progression from 4o to 5.2 applied to the newest model. Old prompts stop working reliably, different hallucination modes etc.
If you baked one of these into a smart speaker that could call tools to control lights and play music, it will still be able to do that when Llama 4 or 5 or 6 comes out.
Edit: assuming model owners will let this happen, which they wont
In the real world, theres talking refrigerators who dont need to know how to recite shakespeare.
But sure, the next generation could be much smaller. It doesn't require battery cells, (much) heat management, or ruggedization, all of which put hard limits on how much you can miniaturise power banks.
But as you said, the next generations are very likely to shrink (especially with them saying they want to do top of the line models in 2 generations), and with architecture improvements it could probably get much smaller.
Nowadays, your average cellphone has more computing power than those behemoths.
I have a micro SD card with 256GB capacity, and I think they are up to 2TB. On a device the size of a fingernail.
The form factor should be anything but thumbdrive.
I think you completely miss the UX point here. In 1997 CRT screens were mainstream, LCD was in the early stage, phones had antennas. In 2007 an iPhone with LCD touch screen changed the UX of computing forever. This tech that we see today is a precursor of technology that will dominate tomorrow. Today local inference is painful and expensive, it consumes a lot of energy. NPUs/GPUs solve nothing here, and they will always be less effective than hardwired models - by design. So only question is, when the consumer performance expectation for open-weight models will cross the price curve of specialized chips. It may happen earlier than for generic NPUs.
AI being static weights is already challenged with the frequent model updates we already see - but may even be a relic once we find a new architecture.
And then it'll increasingly make sense to build such a chip into laptops, smartphones, wearables. Not for high-end tasks, but to drive the everyday bread-and-butter tasks.
[1] although security might be a big enough reason for upgrades to still be required
For a 2.5 kW Server? I don't see it happening, your money and electricity is better spent on CUDA compute.
I don’t see any reason why this should not drop to 100-300W at peak with maybe 100W*h of daily usage on smartphones.