How Taalas “prints” LLM onto a chip?

How Taalas “prints” LLM onto a chip?(anuragk.com)

429 points by beAroundHere 131 days ago | 256 comments

thesz 130 days ago |

8B coefficients are packed into 53B transistors, 6.5 transistors per coefficient. Two-inputs NAND gate takes 4 transistors and register takes about the same. One coefficient gets processed (multiplied by and result added to a sum) with less than two two-inputs NAND gates.

I think they used block quantization: one can enumerate all possible blocks for all (sorted) permutations of coefficients and for each layer place only these blocks that are needed there. For 3-bit coefficients and block size of 4 coefficients only 330 different blocks are needed.

Matrices in the llama 3.1 are 4096x4096, 16M coefficients. They can be compressed into only 330 blocks, if we assume that all coefficients' permutations are there, and network of correct permutations of inputs and outputs.

Assuming that blocks are the most area consuming part, we have block's transistor budget of about 250 thousands of transistors, or 30 thousands of 2-inputs NAND gates per block.

250K transistors per block * 330 blocks / 16M transistors = about 5 transistors per coefficient.

Looks very, very doable.

It does look doable even for FP4 - these are 3-bit coefficients in disguise.

amelius 130 days ago | |

I'm looking forward to the model.toVHDL() method in PyTorch.

sowbug 130 days ago | | |

Ugh, quick, everyone start panic-buying FPGAs now.

p0u4a 130 days ago | | |

Pretty close to what you describe: https://github.com/fastmachinelearning/hls4ml

Simboo 130 days ago | | |

Deep Differentiable Logic Gate Networks

androiddrew 130 days ago | | |

Is this a thing?

cpldcpu 130 days ago | |

They mentioned that they using strong quantization (iirc 3bit) and that the model was degradeted from that. Also, they don't have to use transistors to store the bits.

amelius 130 days ago | | |

I think they are talking about the transistors that apply the weights to the inputs.

mirekrusin 130 days ago | | |

gpt-oss is fp4 - they're saying they'll next try mid size one, I'm guessing gpt-oss-20b then large one, i'm guessing gpt-oss-120b as their hardware is fp4 friendly

cyanydeez 130 days ago | |

Whats the theoretixal full wafer scale model they could produce?

kop316 130 days ago |

Ohh neat! A generalized version of this was the topic of my PhD dissertation:

https://kilthub.cmu.edu/articles/thesis/Modern_Gate_Array_De...

And they are likely doing something similar to put their LLMs in silicon. I would believe a 10x electricity boost along with it being much faster.

The idea is that you can create a sea of generalized standard cells and it makes for a gate array at the manufacturing layer. This was also done 20 or so years ago, it was called a "structured ASIC".

I'd be curious to see if they use the LUT design of traditional structured ASICs or figured what what I did: you can use standard cells to do the same thing and use regular tools/PDKs to make it.

fho 130 days ago | |

I think their "4-bit multiplier with a single transistor" bit is hinting at them using transistors in the sun-threshold regime.

kop316 130 days ago | | |

So something that you can do with PDKs is add your own custom standard cell and tell the EDA tools to use them. This is actually pretty smart, this way you can use most of the foundry cells (which have been extensively validated) and focus on things like this "magic multiplier", that you will have to manually validate. This also makes porting across tech nodes easier if you manage only a handful of custom cells versus a completely custom design.

(I have my guesses as to what that is, but I admittedly don't know enough about that particular part of the field to give anything but a guess).

Hello9999901 130 days ago |

This would be a very interesting future. I can imagine Gemma 5 Mini running locally on hardware, or a hard-coded "AI core" like an ALU or media processor that supports particular encoding mechanisms like H.264, AV1, etc.

Other than the obvious costs (but Taalas seems to be bringing back the structured ASIC era so costs shouldn't be that low [1]), I'm curious why this isn't getting much attention from larger companies. Of course, this wouldn't be useful for training models but as the models further improve, I can totally see this inside fully local + ultrafast + ultra efficient processors.

[1] https://en.wikipedia.org/wiki/Structured_ASIC_platform

owenpalmer 130 days ago |

> Kinda like a CD-ROM/Game cartridge, or a printed book, it only holds one model and cannot be rewritten.

Imagine a slot on your computer where you physically pop out and replace the chip with different models, sort of like a Nintendo DS.

bsenftner 130 days ago |

I'm surprised people are surprised. Of course this is possible, and of course this is the future. This has been demonstrated already: why do you think we even have GPUs at all?! Because we did this exact same transition from running in software to largely running in hardware for all 2D and 3D Computer Graphics. And these LLMs are practically the same math, it's all just obvious and inevitable, if you're paying attention to what we have, what we do to have what we have.

brainless 130 days ago |

If we can print ASIC at low cost, this will change how we work with models.

Models would be available as USB plug-in devices. A dense < 20B model may be the best assistant we need for personal use. It is like graphic cards again.

I hope lots of vendors will take note. Open weight models are abundant now. Even at a few thousand tokens/second, low buying cost and low operating cost, this is massive.

cpldcpu 130 days ago |

I wonder how well this works with MoE architectures?

For dense LLMs, like llama-3.1-8B, you profit a lot from having all the weights available close to the actual multiply-accumulate hardware.

With MoE, it is rather like a memory lookup. Instead of a 1:1 pairing of MACs to stored weights, you suddenly are forced to have a large memory block next to a small MAC block. And once this mismatch becomes large enough, there is a huge gain by using a highly optimized memory process for the memory instead of mask ROM.

At that point we are back to a chiplet approach...

pests 130 days ago | |

For comparison I wanted to write on how Google handles MoE archs with its TPUv4 arch.

They use Optical Circuit Switches, operating via MEMS mirrors, to create highly reconfigurable, high-bandwidth 3D torus topologies. The OCS fabric allows 4,096 chips to be connected in a single pod, with the ability to dynamically rewire the cluster to match the communication patterns of specific MoE models.

The 3D torus connects 64-chip cubes with 6 neighbors each. TPUv4 also contains 2 SparseCores which specialize handling high-bandwidth, non-contiguous memory accesses.

Of course this is a DC level system, not something on a chip for your pc, but just want to express the scale here.

*ed: SpareCubes to SparseCubes

brainless 130 days ago | |

If each of the Expert models were etched in Silicon, it would still have massive speed boost, isn't it?

I feel printing ASIC is the main block here.

ramshanker 130 days ago |

I can imagine, where this becomes a mainstream PCIe extension card. Like back in days we had separate graphics card, audio card etc. Now AI card. So to upgrade the PC to latest model, we could buy a new card, load up the drivers and boom, intelligence upgrade of the PC. This would be so cool.

slfnflctd 130 days ago | |

This is exactly what's going to happen. Assuming no civilization-crippling or Great Filter events, anyway. At this point I fail to see how it could go any other way. The path has already been traveled, and governments (along with many other large organizations) will demand this functionality for themselves, which will eventually have a consumer market as well.

Another commenter mentioned how we keep cycling between local and server-based compute/storage as the dominant approach, and the cycle itself seems to be almost a law of nature. Nonetheless, regardless of where we're currently at in the cycle, there will always be both large and small players who want everything on-prem as much as possible.

odyssey7 130 days ago |

Quick! We have to approve all the nuclear plants for AI now, before efficiency from optimization shows up

rustybolt 130 days ago |

Note that this doesn't answer the question in the title, it merely asks it.

beAroundHere 130 days ago | |

Yeah, I had written the blog to wrap my head around the idea of 'how would someone even be printing Weights on a chip?' 'Or how to even start to think in that direction?'.

I didn't explore the actual manufacturing process.

pixelmelt 130 days ago | | |

You should add an RSS feed so I can follow it!

alcasa 130 days ago | |

Frankly the most critical question is if they can really take shortcuts on DV etc, which are the main reasons nobody else tapes out new chips for every model. Note that their current architecture only allows some LORA-Adapter based fine-tuning, even a model with an updated cutoff date would require new masks etc. Which is kind of insane, but props to them if they can make it work.

From some announcements 2 years ago, it seems like they missed their initial schedule by a year, if that's indicative of anything.

For their hardware to make sense a couple of things would need to be true: 1. A model is good enough for a given usecase that there is no need to update/change it for 3-5 years. Note they need to redo their HW-Pipeline if even the weights change. 2. This application is also highly latency-sensitive and benefits from power efficiency. 3. That application is large enough in scale to warrant doing all this instead of running on last-gen hardware.

Maybe some edge-computing and non-civilian use-cases might fit that, but given the lifespan of models, I wonder if most companies wouldn't consider something like this too high-risk.

But maybe some non-text applications, like TTS, audio/video gen, might actually be a good fit.

K0balt 130 days ago | | |

TTS, speech recognition, ocr/document parsing, Vision-language-action models, vehicle control, things like that do seem to be the ideal applications. Latency constraints limit the utility of larger models in many applications.

qoez 130 days ago |

> It took them two months, to develop chip for Llama 3.1 8B. In the AI world where one week is a year, it's super slow. But in a world of custom chips, this is supposed to be insanely fast.

LLama 3.1 is like 2 years at this point. Taking two months to convert a model that only updates every 2 years is very fast

ac29 130 days ago | |

2 months of design work is fast, but how much time does fabrication, packaging, testing add? And that just gets you chips, whatever products incorporate them also need to be built and tested.

wmf 130 days ago | |

It only looks that way because Llama failed. Good models like Qwen are shipping every 6 months.

peteforde 130 days ago |

I would appreciate some clarification on the "store 4 bits of data with one transistor" part.

This doesn't sound remotely possible, but I am here to be convinced.

ajb 130 days ago | |

They declined to say: https://www.eetimes.com/taalas-specializes-to-extremes-for-e...

Except they say it's fully digital, so not an analog multiplier

tyingq 130 days ago | | |

Fully digital, no analog, 4 bits fit into one transistor. Hmm. In one clock cycle?

briansm 130 days ago |

I wonder if you could use the same technique (RAM models as ROM) for something like Whisper Speech-to-text, where the models are much smaller (around a Gigabyte) for a super-efficient single-chip speech recognition solution with tons of context knowledge.

JLO64 130 days ago | |

Right now I have to wait 10 minutes at a time for the 2+ hour long transcriptions I've uploaded to Voxstral to process. The speed up here could be immense and worthwhile to so many customers of these products.

londons_explore 130 days ago |

So why only 30,000 tokens per second?

If the chip is designed as the article says, they should be able to do 1 token per clock cycle...

And whilst I'm sure the propagation time is long through all that logic, it should still be able to do tens of millions of tokens per second...

wmf 130 days ago | |

You still need to do a forward pass per token. With massive batching and full pipelining you might be able to break the dependencies and output one token per cycle but clearly they aren't doing that.

amelius 130 days ago | | |

More aggressive pipelining will probably be the next step.

menaerus 130 days ago | |

Reading from and to memory alone takes much more than a clock cycle.

kioku 130 days ago |

I’m just wondering how this translates to computer manufacturers like Apple. Could we have these kinds of chips built directly into computers within three years? With insanely fast, local on-demand performance comparable to today’s models?

xattt 130 days ago | |

Is it possible to supplement the model with a diff for updates on modular memory, or would severely impact perf?

mips_avatar 130 days ago | | |

I imagine you could do something like a LORA

baq 130 days ago | | |

this design at 7 transistors per weight is 99.9% burnt in the silicon forever.

arisAlexis 130 days ago | |

and run an outdated model for 3 years while progress is exponential? what is the point of that

ivan_gammel 130 days ago | | |

When output is good enough, other considerations become more important. Most people on this planet cannot afford even an AI subscription, and cost of tokens is prohibitive to many low margin businesses. Privacy and personalization matter too, data sovereignty is a hot topic. Besides, we already see how focus has shifted to orchestration, which can be done on CPU and is cheap - software optimizations may compensate hardware deficiencies, so it’s not going to be frozen. I think the market for local hardware inference is bigger than for clouds, and it’s going to repeat Android vs iOS story.

padjo 130 days ago | | |

Is progress still exponential? Feels like its flattening to me, it is hard to quantify but if you could get Opus 4.2 to work at the speed of the Taalas demo and run locally I feel like I'd get an awful lot done.

sowbug 130 days ago | | |

Bake in a Genius Bar employee, trained on your model's hardware, whose entire reason for existence is to fix your computer when it breaks. If it takes an extra 50 cents of die space but saves Apple a dollar of support costs over the lifetime of the device, it's worth it.

r0b05 130 days ago | | |

Yeah, the space moves so quickly that I would not want to couple the hardware with a model that might be outdated in a month. There are some interesting talking points but a general purpose programmable asic makes more sense to me.

RobertDeNiro 130 days ago | | |

It won’t stay exponential forever.

selcuka 130 days ago | | |

> what is the point of that

Planned obsolescence? /s

Jokes aside, they can make the "LLM chip" removable. I know almost nothing is replaceable in MacBooks, but this could be an exception.

punnerud 130 days ago |

Could we all get bigger FPGAs and load the model onto it using the same technique?

generuso 130 days ago | |

You could [1], but it is not very cheap -- the 32GB development board with the FPGA used in the article used to cost about $16K.

[1] https://arxiv.org/abs/2401.03868

fercircularbuf 130 days ago | |

I thought about this exact question yesterday. Curious to know why we couldn't, if it isn't feasible. Would allow one to upgrade to the next model without fabricating all new hardware.

wmf 130 days ago | |

FPGAs have really low density so that would be ridiculously inefficient, probably requiring ~100 FPGAs to load the model. You'd be better off with Groq.

menaerus 130 days ago | | |

Not sure what you're on but I think what you said is incorrect. You can use hi-density HBM-enabled FPGA with (LP)DDR5 with sufficient number of logic elements to implement the inference. Reason why we don't see it in action is most likely in the fact that such FPGAs are insanely expensive and not so available off-the-shelf as the GPUs are.

sowbug 130 days ago | |

FPGAs aren't very power-efficient. You could do it, but the numbers wouldn't add up for anything but prototyping.

abrichr 130 days ago |

ChatGPT Deep Research dug through Taalas' WIPO patent filings and public reporting to piece together a hypothesis. Next Platform notes at least 14 patents filed [1]. The two most relevant:

"Large Parameter Set Computation Accelerator Using Memory with Parameter Encoding" [2]

"Mask Programmable ROM Using Shared Connections" [3]

The "single transistor multiply" could be multiplication by routing, not arithmetic. Patent [2] describes an accelerator where, if weights are 4-bit (16 possible values), you pre-compute all 16 products (input x each possible value) with a shared multiplier bank, then use a hardwired mesh to route the correct result to each weight's location. The abstract says it directly: multiplier circuits produce a set of outputs, readable cells store addresses associated with parameter values, and a selection circuit picks the right output. The per-weight "readable cell" would then just be an access transistor that passes through the right pre-computed product. If that reading is correct, it's consistent with the CEO telling EE Times compute is "fully digital" [4], and explains why 4-bit matters so much: 16 multipliers to broadcast is tractable, 256 (8-bit) is not.

The same patent reportedly describes the connectivity mesh as configurable via top metal masks, referred to as "saving the model in the mask ROM of the system." If so, the base die is identical across models, with only top metal layers changing to encode weights-as-connectivity and dataflow schedule.

Patent [3] covers high-density multibit mask ROM using shared drain and gate connections with mask-programmable vias, possibly how they hit the density for 8B parameters on one 815mm2 die.

If roughly right, some testable predictions: performance very sensitive to quantization bitwidth; near-zero external memory bandwidth dependence; fine-tuning limited to what fits in the SRAM sidecar.

Caveat: the specific implementation details beyond the abstracts are based on Deep Research's analysis of the full patent texts, not my own reading, so could be off. But the abstracts and public descriptions line up well.

[1] https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

[2] https://patents.google.com/patent/WO2025147771A1/en

[3] https://patents.google.com/patent/WO2025217724A1/en

[4] https://www.eetimes.com/taalas-specializes-to-extremes-for-e...

rustyhancock 130 days ago |

Edit: reading the below it looks like I'm quite wrong here but I've left the comment...

The single transistor multiply is intriguing.

Id assume they are layers of FMA operating in the log domain.

But everything tells me that would be too noisy and error prone to work.

On the other hand my mind is completely biased to the digital world.

If they stay in the log domain and use a resistor network for multiplication, and the transistor is just exponentiating for the addition that seems genuinely ingenious.

Mulling it over, actually the noise probably doesn't matter. It'll average to 0.

It's essentially compute and memory baked together.

I don't know much about the area of research so can't tell if it's innovative but it does seem compelling!

generuso 130 days ago | |

The document referenced in the blog does not say anything about the single transistor multiply.

However, [1] provides the following description: "Taalas’ density is also helped by an innovation which stores a 4-bit model parameter and does multiplication on a single transistor, Bajic said (he declined to give further details but confirmed that compute is still fully digital)."

[1] https://www.eetimes.com/taalas-specializes-to-extremes-for-e...

londons_explore 130 days ago | | |

It'll be different gates on the transistor for the different bits, and you power only one set depending on which bit of the result you wish to calculate.

Some would call it a multi-gate transistor, whilst others would call it multiple transistors in a row...

rustyhancock 130 days ago | | |

That's much more informative, I think my original comment is quite off the mark then.

jsjdjrjdjdjrn 130 days ago | |

I'd expect this is analog multiplication with voltage levels being ADC'd out for the bits they want. If you think about it, it makes the whole thing very analog.

jsjdjrjdjdjrn 130 days ago | | |

Note: reading further down, my speculation is wrong.

m101 130 days ago |

So if we assume this is the future, the useful life of many semiconductors will fall substantially. What part of the semiconductor supply chain would have pricing power in a world of producing many more different designs?

Perhaps mask manufacturers?

ivan_gammel 130 days ago | |

It might be not that bad. “Good enough” open-weight models are almost there, the focus may shift to agentic workflows and effective prompting. The lifecycle of a model chip will be comparable to smartphones, getting longer and longer, with orchestration software being responsible for faster innovation cycles.

ACCount37 130 days ago | | |

"Good enough" open weights models were "almost there" since 2022.

I distrust the notion. The bar of "good enough" seems to be bolted to "like today's frontier models", and frontier model performance only ever goes up.

m101 130 days ago | | |

If you’re running at 17k tokens / s what is the point of multiple agents?

atentaten 130 days ago |

Does this mean computer boards will someday have one or more slots for an AI chip? Or peripheral devices containing AI models, which can be plugged into computer's high speed port?

sowbug 130 days ago | |

It doesn't even need to be high speed. A minimal chip would have four pins: VCC, GND, TX, and RX. Even one-dollar microcontrollers can handle megabit-speed serial connections, which is fast enough for LLM communication.

cyanydeez 130 days ago | |

Probably more like either USB sidecar or PCIe drop in. I dont think theyll return to a world dedicated coprocessors.

Unless someone finds a way to turn these thijgs into a bios module.

coppsilgold 130 days ago |

How feasible would it be to integrate a neural video codec into the SoC/GPU silicon?

There would be model size constraints and what quality they can achieve under those constraints.

Would be interesting if it didn't make sense to develop traditional video codecs anymore.

The current video<->latents networks (part of the generative AI model for video) don't optimize just for compression. And you probably wouldn't want variable size input in an actual video codec anyway.

kinduff 130 days ago |

Very nice read, thank you for sharing this so well written.

midnitewarrior 130 days ago |

If model makers adopt an LTS model with an extended EOL for certain model versions, these chips would make that very affordable.

albert_e 130 days ago |

Does this offer truly "deterministic" responses when temperature is set to zero?

(Of course excluding any cosmic rays / bit flips)?

I didnt see a editable temperature parameter on their chatjimmy demosite -- only a topK.

TensorToad 130 days ago |

Super low latency inference might be helpful in applications like quant trading. However, in an era where a frontier model becomes outdated after 6 months, I wonder how useful it can be.

TensorToad 130 days ago | |

Also, quant trading probably care more about embedding the content instead of generating output tokens

Archit3ch 130 days ago |

The next frontier is power efficiency.

So how does this Taalas chip work? Analog compute by putting the weights/multipliers on the cross-bars? Transistors in the sub-threshold region? Something else?

708145_ 130 days ago |

Is Taalas' approach scalable to larger models?

sowbug 130 days ago | |

The top comment on Friday's discussion does some math on die size. https://news.ycombinator.com/item?id=47086634

Since model size determines die size, and die size has absolute limits as well as a correlation with yield, eventually it hits physical and economic limits. There was also some discussion about ganging chips.

shwaj 130 days ago | |

From what I read here, the required chip size would scale linearly with the number of model weights. That alone puts a ceiling on the size of model.

Also the defect rate grows as the chip grows. It seems like there might be room for innovation in fault tolerance here, compared to a CPU where a randomly flipped bit can be catastrophic.

konaraddi 130 days ago |

Imagine a Framework* laptop with these kinds of chips that could be swapped out as models get better over time

*Framework sells laptops and parts such that in theory users can own a ~~ship~~ laptop of Theseus over time without having to buy a whole new laptop when something breaks or needs upgrade.

dev1ycan 130 days ago |

Thank god, I hope this reduces prices of RAM and GPUs

jabedude 130 days ago |

Just me or does this seems incredibly frightening to anyone else? Imagine printing a misaligned LLM this way and never being able to update the HW to run a different (aligned) model

Liftyee 130 days ago | |

It frightens me no more than the possibility of building a flawed airplane or a computer that overheats (looking at you, NVIDIA 12-pin) and "never being able to update the HW". Product recalls and redesigns exist for a reason.

If this happens, womp womp, recall the misaligned LLMs and learn from the mistake. It's part of running a hardware business as opposed to a software one.

I can't imagine they'd go for a full production run before at least testing a couple chips and finding issues.

sowbug 130 days ago | |

The S in IoT is for security.

moralestapia 130 days ago |

>HOW NVIDIA GPUs process stuff? (Inefficiency 101)

Wow. Massively ignorant take. A modern GPUs is an amazing feat of engineering, particularly about making computation more efficient (low power/high throughput).

Then proceeds to explain, wrongly, how inference is supposssedly implemented and draws conclusions from there ...

beAroundHere 130 days ago | |

Hey, Can you please point out explain the inaccuracies in the article?

I had written this post to have a higher level understanding of traditional vs Taalas's inference. So it does abstracts lots of things.

wmf 130 days ago | |

Arguably DRAM-based GPUs/TPUs are quite inefficient for inference compared to SRAM-based Groq/Cerebras. GPUs are highly optimized but they still lose to different architectures that are better suited for inference.

imtringued 130 days ago | |

The way modern Nvidia GPUs perform inference is that they have a processor (tensor memory accelerator) that directly performs tensor memory operations which directly concedes that GPGPU as a paradigm is too inefficient for matrix multiplication.

trebligdivad 130 days ago |

Hmm I guess you'll get this pile of used boards which hmm is not a great source of waste; but I guess they will get reused for a few generations. A problem is it doesn't seem to be just the chips that would be thrown but the whole board which gets silly.

throwaway85825 130 days ago |

Few customers value tokens anywhere near what it costs the big API vendors. When the bubble pops the only survivors will be whoever can offer tokens at as close to zero cost as possible. Also whoever is selling hardware for local AI.

ramraj07 130 days ago | |

To those who use AI to get real work done in real products we build, we very much appreciate the value of each token given how much operational overhead it offsets. A bubble pop, if one does indeed happen, would at best be as disruptive as the dot-com bust.

throwaway85825 129 days ago | | |

It's a full employment program for security engineers.

How disruptive dot com was depends on where you were.

lm28469 130 days ago |

Who's going to pay for custom chips when they shit out new models every two weeks and their deluded CEOs keep promising AGI in two release cycles?

spyder 130 days ago | |

It all depends on how cheap they can get. And another interesting thought: what if you could stack them? For example you have a base model module, then new ones come out that can work together with the old ones and expanding their capabilities.

brainless 130 days ago | |

New GPUs come out all the time. New phones come out (if you count all the manufacturers) all the time. We do not need to always buy the new one.

Current open weight models < 20B are already capable of being useful. With even 1K tokens/second, they would change what it means to interact with them or for models to interact with the computer.

lm28469 130 days ago | | |

hm yeah I guess if they stick to shitty models it works out, I was talking about the models people use to actually do things instead of shitposting from openclaw and getting reminders about their next dentist appointment.

villgax 130 days ago |

This read itself is slop lol, literally dances around the term printing as if its some inkjet printer

sargun 130 days ago |

Isn’t the highly connected nature of the model layers problematic to build into physical layer?