Accelerating Gemma 4: faster inference with multi-token prediction drafters

Accelerating Gemma 4: faster inference with multi-token prediction drafters(blog.google)

687 points by amrrs 62 days ago | 330 comments

Speculative decoding is an amazingly clever invention, almost seems-too-good-to-be-true (faster interference with zero degradation from the quality of the main model). The core idea is: if you can find a way to generate a small run of draft next tokens with a smaller model that have a reasonable likelihood of being correct, it's fast to check that they are actually correct with the main model because you can run the checks in parallel. And if you think about it, a lot of next tokens are pretty obvious in certain situations (e.g. it doesn't take a frontier model to guess the likely next token in "United States of...", and a lot of code is boilerplate and easy to predict from previous code sections).

I always encourage folks who are interested in LLM internals to read up on speculative decoding (both the basic version and the more advanced MTP), and if you have time, try and implement your own version of it (writing the core without a coding agent, to begin with!)

zmmmmm 61 days ago | |

> it's fast to check that they are actually correct with the main model because you can run the checks in parallel.

Can you give an intuition as to why it's faster? I would have thought regardless how many you run in parallel, the successful check has to execute the full model to generate the full sequence so you will have exactly the same time needed? Or is it by process of elimination so it terminates early once it eliminates the non-viable choices? (in which case, how do you guarantee the correct output was speculatively generated at all to be the last survivor?)

janalsncm 61 days ago | | |

The small draft model proposes a sequence of tokens d1 d2 d3.

The big target model calculates

P(d1)

P(d2|d1)

P(d3|d1 d2)

In parallel. If we were just greedy decoding it would be simple. Just stop when the draft model doesn’t predict the most likely token as judged by the target model. At that point, append the correct token from the target model and kick off both models again in parallel.

In practice we aren’t using greedy decoding. We are sampling and we need to match the target model’s distribution. To do this, we accept tokens from the draft model probabilistically, which is possible because we have the logits of both the draft model and the target at that point. The ratio of their softmax probabilities is used for this.

You are right that actually accepting tokens has to happen sequentially but that’s a heck of a lot faster than a forward pass.

miki123211 61 days ago | | |

To add to what others have said here, this is due to the memory hierarchy.

GPUS have different kinds of memory, there's fast-but-small memory and slow-but-large memory.

Conceptually, you can imagine the process of LLM inference as transferring some weights from slow memory to fast memory, doing some calculations on those weights, discarding them from fast memory once the computation is done, loading in the next portion, and so on, until you're fully done.

You can do calculations for multiple tokens in parallel, but to calculate what token n is, you need to already know all the previous tokens 1..(n-1). Therefore, if you don't have spec decoding, you go one token at a time. If you do, you assume that the next tokens actually are what the smaller model gave you, discarding the results in case you were wrong.

With speculative decoding, you can basically load the weights once and apply them to multiple tokens instead of just one, because of the assumption of what the next tokens are that you're making. This decreases the amount of data that has to go between slow and fast memory. As the decode stage[1] is bottlenecked by memory bandwidth and not compute speed, more efficient use of this bandwidth increases your token generation speed.

As another poster said, this idea is closely related to batching. In batching, you re-use the same weights to serve multiple requests. In speculative decoding, you re-use them to accelerate a single one. If you have many users, care only about how many tokens per second your GPUs produce in general, and don't care at all about per-user speed, speculative decoding won't do anything for you.

[1] There are two stages in LLM inference: prefill and decode. In prefill, you do calculations on the tokens of the prompt, prefilling the KV cache to accelerate attention computations at decode time. Because you have access to all the tokens of the prompt, you can process everything in parallel and use your weights very efficiently. Your bottleneck here is the computation units and not memory bandwidth. In decode, you don't know what your future tokens will be, so you can only go one at a time as explained above. In a way, speculative decoding turns decode into a little prefill.

fulafel 61 days ago | | |

AIUI you run the checks of several predicted tokens in lockstep, and the computation for each token is served by the same data loaded from memory. In normal execution, each token would depend on the previous one, precluding the parallelization and causing much more per-token memory traffic.

So this is a case of trading off idle compute capacity that's waiting for the bottleneck (memory access).

mike_hearn 61 days ago | | |

An obscure fact about the transformer architecture is that it more or less computes the most likely next token for every single token in the context window at once. This is because the KV cache values needed to predict the next token are needed for every token, and the attention modules do nearly all the work, so once you computed the KVs running them through the last sections to get the target probabilities is nearly free.

The reason it's designed this way is a bit subtle but it has the advantage during training that you can use a single block of 10 tokens to generate 9 training examples in parallel, so it's highly efficient. This efficiency is basically the main benefit of transformers - the algorithm parallelizes really well and that's what allowed the scale up to large language models as opposed to the previous reality of just language models.

The blog post does discuss why MTP is faster but it's maybe a bit hard to understand if you haven't studied LLM internals. During inference the hardware has arithmetic units idling because they spend so much time waiting for the weight matrices to get moved closer to the processors. Because data movement and computation can be overlapped, if you can reuse the same loaded data for multiple calculations at once you're winning - it's free latency-wise because you're just exploiting previously idle resources (it's not free in terms of energy).

Speculative decoding and MTP exploit this to run the model in parallel on several tokens at once. Say your context window contains "The United". The KV cache has been populated by the main model for this set of tokens. The draft model is given "The United" and predicts " States of America" in one forward pass (this part where it can predict multiple tokens at once with a single pass is the MTP part). Then the main model is given the KV cache from last time along with " States of America". In its own forward pass it can then compute in parallel the completions of both "The United", "The United States", "The United States of" and "The United States of America" (the last one might be an eos token indicating it wants to stop talking.). That's the speculative decoding part.

Now you decode the main model at each position (look at the token probabilities and pick one according to some decoding strategy). It's possible the main model didn't pick " States" at all, or picked " States", but then its prediction diverged e.g. if it wants to say "The United States is a country". So you just select the tokens that match and toss all the tokens starting from the one that didn't. Repeat.

The parallelism comes almost for free because the same weight matrices can be reused multiple times before they're swapped out for the next.

mungoman2 62 days ago | |

Naively it seems odd that running multiple checks in parallel is faster than just running the autoregressive model multiple times in series. It’s the same amount of compute right?

But I think the key is that in the standard autoregressive case we get memory bandwidth bound, so there are tons of idle compute resources. And so checking multiple tokens is cheap because we can batch and thus reuse the read weights for multiple tokens.

The verification step is similar to a prefill with a small batch size. The difference is what we do with the generated logits.

libraryofbabel 62 days ago | | |

That’s correct, and yes - not less compute total on the main model (actually slightly more, since checking failed draft tokens costs you compute), but faster because inference is memory-bandwidth bound. And like you I also think of it as like a “mini prefill” (but on top of the existing KV cache, of course); the code is very similar to prefill if you implement a simple toy version yourself.

Most of the complexity in implementing a simple toy version comes from having to get the KV cache back into a good state for the next cycle (e.g. if only the first half of your draft tokens were correct).

zozbot234 62 days ago | | |

> But I think the key is that in the standard autoregressive case we get memory bandwidth bound, so there are tons of idle compute resources.

Right, this is the same way batching works. It's "free" until we exhaust available compute resources, at which point decode throughput becomes compute bound. (This is a good place to be, because scaling out compute is a lot easier than adding fast VRAM.) This is why MTP is mostly useful when you have one or few users, which means compute is abundant. When you're running large batches you're better off using that compute to grow your batch size.

Of course, batch size is usually limited by things like bulky KV caches. So perhaps MTP has some residual use in that setting. But if you're sharing cached context in a subagent swarm, or running a model like the recent DeepSeek V4 with its tiny KV cache, you can go a lot further in processing a larger batch.

m12k 61 days ago | |

So we've basically taken the concept of branch prediction from CPUs and applied it to LLMs?

c7b 61 days ago | | |

The concept of predicting future elements in a series is not specific to CS. It's older than computers.

kpw94 61 days ago | | |

Speculative execution techniques in software & hardware exist everywhere,

- Speculative multi threading

- Data Value Speculation

- Speculative Memory Disambiguation

- Runahead Execution

- Speculative Prefetching

- Multi-path (Dual-path) Execution (goes beyond branch prediction by computing both paths)

- Optimistic Concurrency Control (for database transactions etc)

mike_hearn 61 days ago | | |

Maybe at very high level of abstraction, but there's no branching involved.

fragmede 61 days ago | | |

Well, the TPUs they're running on don't have branch prediction, so that had to end up somewhere in the stack.

alfiedotwtf 61 days ago | |

Maybe it’s just me, but I feel like the LLM crowd are re-discovering Coding and Compression all over again.

algoth1 61 days ago | |

That’s basically the original gpt5 routing idea but done right

manas96 61 days ago | |

so in essence is it trading memory for speed?

HarHarVeryFunny 61 days ago | | |

Seems more like trading FLOPs for speed.

If you are just generating as usual with the main model then you're sequentially generating A -> AB -> ABC.

If I'm understanding correctly, what speculative decoding is doing is first (= more FLOPs) using a different small/fast (but less accurate) model to generate this ABC (you hope) sequence, then use the main model to now verify it in parallel (A + AB + ABC in parallel) rather then generate it sequentially. Assuming you had the FLOPs available to really do this in parallel, then this parallel verification vs sequential generation is what gives you the speed up.

WarmWash 62 days ago |

I don't see it talked about much, but Gemma (and gemini) use enormously less tokens per output than other models, while still staying within arms reach of top benchmark performance.

It's not uncommon to see a gemma vs qwen comparison, where qwen does a bit better, but spent 22 minutes on the task, while gemma aligned the buttons wrong, but only spent 4 minutes on the same prompt. So taken at face value, gemma is now under performing leading open models by 5-10%, but doing it in 1/10th the time.

zdw 62 days ago |

MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.

The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.

msp26 62 days ago |

Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic.

However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.

srigi 62 days ago | |

You could keep multimodal projector (understanding of audio, images & PDFs) in system RAM with `--no-mmproj-offload` in llama.cpp. Of course, then it is not accelerated with GPU, but you save its VRAM.

msp26 61 days ago | | |

Interesting, I might try that, thanks!

ActorNightly 62 days ago | |

Qwen is still better that Gemma though. Also you can tune it more for different tasks, which means that you can prioritize thinking and accuracy versus inference speed.

SwellJoe 62 days ago | | |

Qwen is better at some things (code, in particular), but Gemma has better prose and better vision. At least, it feels that way to me.

MikeTheGreat 62 days ago | | |

Genuine question: how do you tune it?

I thought "fine-tuning" meant training it on additional data to add additional facts / knowledge? I might be mistaking your use of the word "tune", though :)

redman25 62 days ago | | |

It’s a heck of a lot faster too.

2ndorderthought 62 days ago | | |

Yes I would just go with qwen.

skybrian 62 days ago |

Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.

aleksiy123 62 days ago |

I’m starting to think that googles strategy is a bit different then the other frontier providers.

Focusing more on performance to compute efficiency over pure performance. And maybe that’s why Gemini is (seemingly) lagging behind?

Other providers hitting capacity and hitting the limits subsidising their inference.

Google strategy seems to be about scaling and distributing these models to their existing billions of users.

christina97 62 days ago |

I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment.

I tried first with Qwen but it was unstable and had ridiculously long thinning traces!

aimxhaisse 62 days ago | |

It even fits on a 3060 with turboquant / Q4 at decent speed (40T/s) for ~200$ (:

2ndorderthought 62 days ago | |

Some of the early quants for qwen3.6 were broken. It's still finicky but with a little hand holding it's crazy.

Local models are the future it's awesome

jszymborski 62 days ago | |

The A4B model is blazing fast and the model is super good at general inquiries. Notably worse than Qwen 3.6 for coding tasks but that says more about the Qwen model.

maille 62 days ago | | |

Bad at coding, but would it be good at code review?

moffkalast 62 days ago | |

The 31B is surprisingly fast too, for a dense model. Runs tg at least twice as fast as it ought to on my machine when compared to other 30B, probably due to the hybrid attention I guess. Ingestion is somewhat slower though.

Patrick_Devine 62 days ago |

In my testing the Gemma 4 31b model had the biggest speed boost in Ollama w/ the MLX runner for coding tasks (at about 2x). Unfortunately you'll need a pretty beefy Mac to run it because quantization really hurts the acceptance rate. The three other smaller models didn't perform as well because the validation time of the draft model ate up most of the performance gains. I'm still trying to tune things to see if I can get better performance.

You can try it out with Ollama 0.23.1 by running `ollama run gemma4:31b-coding-mtp-bf16`.

these 62 days ago |

Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.

dvt 62 days ago | |

It's not implemented in mlx[1] yet (or llama.cpp[2]), so it may take a while.

[1] https://github.com/ml-explore/mlx-lm/pull/990

[2] https://github.com/ggml-org/llama.cpp/pull/22673

AlphaSite 62 days ago | |

Yes. Make sure you’re not using the Gemma sparse models since they don’t have a small model to use. Also I removed all the image models from the workspace.

adrian_b 62 days ago | | |

I do not know what you mean by sparse models.

All 4 gemma-4-*-it models, regardless whether they are dense models or MoE models, have associated small models for MTP, whose names are obtained by adding the "-assistant" suffix.

https://huggingface.co/google/gemma-4-E2B-it-assistant

https://huggingface.co/google/gemma-4-E4B-it-assistant

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant

https://huggingface.co/google/gemma-4-31B-it-assistant

Havoc 62 days ago | |

Normally when LM Studio doesn't like it it's because of the presence of mmproj files in the folder. Sometimes removing them helps it show up.

They're somehow connected to vision & block speculative decode...don't ask me how/why though

For gemma specifically had more luck with speculative using the llama-server route than lm studio

svachalek 62 days ago | |

I've gotten it to work with other models. They've got to be perfectly aligned usually, in terms of provider, quantization etc. Might be a bit before you can get a matched set.

julianlam 62 days ago |

Really excited to try this once it is merged into llama.cpp.

Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing.

Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)

VHRanger 62 days ago | |

On vllm with a 5090 I get 120-180TPS with the awq 4 bit quant + MTP speculative decoding

For gemma4 26B, same quantization, I get >200TPS.

Also note that qwen is extremely inefficient in reasoning; the reasoning chains are ~3x longer than gemma on average

DoctorOetker 61 days ago |

Why is a separate MTP model even necessary?

An LLM forward inference doesn't just predict token vectors for the new last token:

In diagrams the forward pass is typically depicted as taking input tokens vectors <t1, t2, t3, ... t98, t99, t100> (here native context being 100 for didactic purposes) and generating output token vectors <t2, t2, t4, ..., t99, t100, t101>.

As far as I understand that is didactically only semi correct, it correctly depicts the locations of tokens in the input and output string, but actually the token vector at the t2 output position is NOT identical to the t2 vector from the input, but a token vector which after softmax gives P(t2 | t1).

And output token position t5 actually corresponds to P(t5 | t1,t2,t3,t4). I.e. the forward inference is modelling the statistical conditional N-gram function from inputs to outputs, from the bigram conditional probability P(t2 | t1) all the way up to P(t101 | t1, t2, t3, ..., t98, t99, t100).

Suppose you want to take bigger steps, nothing prevents one from calculating the forward function by sliding a fixed (committed output string) to the left not 1 position but say 10 positions, and then using the last 10 predictions as the new output prediction. That doesn't need a new MTP model. Perhaps it would take some careful modification to ensure the same original output distributions as if the tokens were generated one at a time, but this hints at the possibility.

One could also slide to the left 5 positions twice, not committing to all 10 new tokens at once but only commiting to the 5 oldest values of the 10 new values, and using the noncommited 5 last values as input vectors for the next invocation, so the model can push the new 5 vectors towards its final commited output vector value in 2 steps for better convergence...

Is there any reason multitoken prediction doesn't work this way, or is there some aspect of the conditional N-gram interpretation of LLM models that I am miscomprehending?

regexorcist 62 days ago |

Sounds like a game changer if I see that kind of speed up on my hardware. So far I've prefered Qwen 3.6 because of its better tool handling, even though Gemma 4 is faster, but I saw they've updated the model template and that's supposed to be better now. Looking forward to trying this with llama.cpp.

ch_sm 62 days ago | |

gemma4 has a specific problem with toolcalls that affects most runtimes. fixes for ollama and vllm are being worked on right now

adrian_b 62 days ago | | |

The chat templates of all Gemma 4 models have been updated 7 days ago, to fix some bugs related to invoking tools.

So any tests done with models that have not been updated during the last days are no longer relevant and they must be repeated after updating the models and regenerating any other file formats, like GGUF files.

apexalpha 62 days ago | | |

I read somewhere you need to drop temp to 0.1 on gemma for tools.

Not sure why (too amateur sorry).

Though I think qwen was natively trained on toolcalling.

vhiremath4 62 days ago |

So this is like branch prediction for operating systems? Except we have probability baked into the model itself so it’s even more reliable.

Lihh27 62 days ago | |

similar idea, but the failure mode is better. a branch mispredict burns cycles. a bad guess here usually just means no bonus tokens. https://arxiv.org/abs/2211.17192

TOMDM 62 days ago | | |

As long as you're not bound on parallelism or bandwidth then it's "free", but if you're constrained on either resource then your lighter predictor model just needs to save you more cycles than it congests on average.

dchftcs 62 days ago | | |

A bad guess still costs cycles, but the penalty is smaller compared to branch mispredict in the current state. But if we have some kind of pipelining, like if we have something that assumed the speculative decode is correct, then it'll be expensive again.

mchusma 62 days ago |

I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?

recsv-heredoc 62 days ago |

CloudFlare offers excellent service for many of the open-weights models. It's fast, cheap and simple to set up. Can highly suggest as an LLM provider.

They serve gemma-4-26b-a4b-it.

brikym 62 days ago | |

It doesn't seem that compelling to me. I can get the gpt-oss models cheaper from the openrouter nitro providers like groq and cerebras. The model you mention on Cloudflare infra is the same price through open router or directly.

andruby 62 days ago | |

They do indeed. See https://developers.cloudflare.com/workers-ai/models/ They seem to allow some free usage without user account. Do they list limits anywhere?

nalinidash 62 days ago |

technical details are here: https://x.com/googlegemma/status/2051694045869879749

netdur 62 days ago |

I am getting 21 t/s on Fold 7, 21 x 1.8 = 37.8 t/s compared to M1 Max's 54 t/s, that is impressive

AbuAssar 62 days ago |

these are the updated models:

google/gemma-4-31B-it-assistant

google/gemma-4-26B-A4B-it-assistant

google/gemma-4-E4B-it-assistant

google/gemma-4-E2B-it-assistant

sigmar 62 days ago | |

for anyone wanting a glossary to explain the naming scheme here:

E4B = 4B effective parameters (using per-layer embeddings)

E2B = 2B (like above)

it = instruction tuned (rlhf and all that jazz)

assistant = Multi-token drafters (the new 2x speed up)

qiine 61 days ago | | |

> assistant

naming still hard I see

el_isma 62 days ago |

How is this different from the speculative decoding that we had before?

You could pair a big and small model like qwen 32b with qwen 4b and had that same dynamic of the small model generating tokens and the big one "certifiying" them.

The blog says something about re-using the big model's data?

adrian_b 62 days ago | |

Multi token prediction is the same thing as speculative decoding. This is mentioned in the Google pages describing their MTP implementation.

Google has now provided small models for each of the previous Gemma 4 models, e.g. "gemma-4-26B-A4B-it-assistant" for "gemma-4-26B-A4B-it".

The difference vs. Qwen is that here each small model is not some general-purpose smaller model, but a model that has been optimized specifically for this task, to predict the output of the bigger model with which it is paired.

This specialization and optimization of the Google "gemma-4-*-assistant" models ensures that they are much smaller and thus much faster than general-purpose small models.

fulafel 61 days ago | | |

Multi-token prediction is a refined form of speculative decoding.

Researchers at Google came up with Speculative decoding in 2022: https://research.google/blog/looking-back-at-speculative-dec... (Fast Inference from Transformers via Speculative Decoding - Yaniv Leviathan, Matan Kalman, Yossi Matias)

Researchers at Meta came up with MTP, a smarter way of doing speculative decoding in 2024: https://arxiv.org/abs/2404.19737 (Better & Faster Large Language Models via Multi-token Prediction Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve)

DeepSeek V3 shipped MTP in a product first, in 2024: https://arxiv.org/abs/2412.19437 (DeepSeek-V3 Technical Report, 100+ authors)

julianlam 62 days ago | | |

So then these models could be used by llama.cpp today with the -md switch?

Interesting, must try tomorrow.

OneDeuxTriSeiGo 62 days ago | |

As far as I can tell MTP is unique from regular speculative decode because the small model is trained to consume and operate on the big model's hidden state for prediction.

dchftcs 61 days ago | |

It's the same speculative decoding. The news is that it came out for a popular local model.

ActorNightly 62 days ago |

I found that Gemma 4:26b makes way more mistakes compared to Qwen and Gemma 3. Gemma3 27b QAT was my goto for some time as this was quite fast. Qwen is still king for a balance of accuracy and inference speed.

Gemma:31b was more accurate but speed was horrendous.

nolist_policy 62 days ago |

Works great in the latest version of Google AI Edge Gallery: https://github.com/google-ai-edge/gallery/releases

brikym 62 days ago |

I wonder what latency and tok/s this model on Groq or Cerebras would be capable of. I have a couple LLM driven games [1][2] where speed is really important to the experience. Currently the best performance I can get is the gpt-oss models on Groq or Cerebras but they need quite a bit of extra context and tools to correct for mistakes. I'm making a bet I'll be able to get the same performance much cheaper in the next few months.

[1] https://sleuththetruth.com [2] https://lextension.net/

fulafel 61 days ago |

Looks like DeepSeek did this as well since V3: https://deepwiki.com/deepseek-ai/DeepSeek-V3/4.4-multi-token...

Credit for the MTP technique is due to https://arxiv.org/abs/2404.19737 from 2024:

Better & Faster Large Language Models via Multi-token Prediction Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve

wrxd 62 days ago |

I'm not sure I understand how this work https://huggingface.co/google/gemma-4-E4B-it-assistant has 78.8M parameters while the standard variant https://huggingface.co/google/gemma-4-E4B-it has 8B parameters.

Is gemma-4-E4B-it-assistant a model I can use stand-alone or a model I need to use in combination with gemma-4-E4B-it?

gunalx 62 days ago | |

You need the regular gemma model as well. You can think of this as a really small distillation of the original. Useless by its own because it often is wrong, but it is fifth more than not. And because verifying a transformer model can be done faster than running it. We can effectively speed up by using this draft model and only doing the compute where it was wrong.

This is a oversimplification, but tldr you need both yes.

wrxd 62 days ago | | |

Thank you!

I already played with Gemma4 on oMLX a while ago. When I have some time I'll check if it supports running MTP models and play a bit more

disiplus 62 days ago |

nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.

sporkland 61 days ago |

Is there any current research on as agents w/tools start dominating LLM use, if making making models smaller / less single-shot, more like efficient engines that can process a lot of context, and feeding a lot more into context windows is going to be more of a path forward vs trying to memory the world?

Like smaller models that show effectiveness on problems with verifiable rewards when run in a loop with external grounding context?

danborn26 61 days ago |

Multi-token prediction is exactly what we need for practical local inference. The speedup makes running these models on edge devices much more viable.

zkmon 61 days ago |

The "how to get started" asks you to read "documentation" which turns out to be a sales blurb. Am I missing something?

pu_pe 62 days ago |

So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?

tarruda 62 days ago | |

They also published draft models for E4B and E2B. For those, the draft models are only 78m parameters: https://huggingface.co/google/gemma-4-E4B-it-assistant

furyofantares 62 days ago | |

Is it really no quality degradation?

I'm curious where my understanding is wrong, but I didn't think you necessarily got the exact same output with how I understand speculative decoding to be used. I thought that if the small model produces tokens that are "good enough", meaning within the top few tokens the larger model produces, they're accepted.

I thought it doesn't necessarily have to produce the exact same token the larger model would have produced to be accepted (and that requiring this would reduce the hit rate by a lot.) Just one the top model could have produced with whatever top-k and temperature settings.

Klaus23 62 days ago | | |

It really is. This is because LLMs with a single output/user are strongly bandwidth limited. Although the hardware can generate multiple tokens simultaneously, it is slowed down if the tokens depend on each other, as is the case with regular text generation.

The draft model essentially predicts the next token quickly, enabling you to start generating the subsequent token in parallel. If the guess is right, the second generated token is correct. If it is wrong, the second generated token is also potentially wrong, so it must be generated again using the correct prior token obtained through the big model.

A poor draft model will simply slow down the process without affecting the output.

petu 62 days ago | | |

Speculative decoding batches multiple completions on all possible outcomes (0/1/2 draft tokens accepted) and sees if big model deviates at any point -- thus verifying each token. So there's no difference in output.

coder543 62 days ago | |

MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP model, but it's a manageable amount.

a_e_k 62 days ago | | |

From the linked post, it didn't read like a separate KV cache was needed:

> The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.

moffkalast 62 days ago | |

It's based on taking advantage of spare compute if you have it. A tiny model generates a few steps ahead first, then the large one runs batch inference on all of those at once as if you are at that point in time. If they all check out afterwards it jumps ahead, otherwise it discards and goes onto the next one.

Not sure about this implementation, but conceptually it only works well on very capable GPUs for very predictable output. Typical speedup is about 30%, not sure how google is claiming 250% which is ridiculous.

And if you don't have enough compute, then you get negative speedup from all the extra overhead.

ac29 62 days ago | |

Memory and compute/energy overhead

julianlam 62 days ago |

Does this mean there will be new Gemma 4 models released with MTP, or are they already available in existing models + quants?

adrian_b 62 days ago | |

For each of the 4 gemma-4-*-it models there has been published an associated small model gemma-4-*-it-assistant, to be used for MTP.

If a GGUF file is generated for MTP, it must include both the big model and the small model. There was a reference in another comment to a PR for llama.cpp, which also included updates for the Python program used for conversion from the safetensors files, which presumably can handle the combining of the two paired Gemma 4 models.

jug 62 days ago | |

They have now been released on e.g Hugging Face with model suffixes "-assistant".

joakleaf 62 days ago |

Seems like a pull request for vLLM was just approved a few minutes ago:

https://github.com/vllm-project/vllm/pull/41745

("Add Gemma4 MTP speculative decoding support")

great_psy 61 days ago |

This might be silly, but … since the assistant models are so much smaller than the full models. What if we just use those smaller models?

Any idea how much worse they will be ? Or is the issue that their error will really diverge as you accept more of their tokens?

amdivia 61 days ago | |

I think they'll be extremely worse on their own

Predicting "America" in "The United States of ..." Is a different task from predicting the whole sentence.

So the small model is laying the blocks, and the bigger model would be cementing them in place or kicking them down. The bigger model's course correction is what keeps the smaller models predictions relatively on track

zozbot234 61 days ago | |

I assume these are just output layers that are trained on the hidden state from the larger model - that's how MTP works. It's not a separate drafting model.

WASDx 61 days ago | |

gemma-4-31B-it-assistant is a 0.5B model. So it's performance would likely be comparable to other models of such size.

sigmar 62 days ago |

>try them directly on Google AI Edge Gallery for Android or iOS.

I'm not seeing any update to the app on my android phone... maybe later today?

>We’ve published an in-depth technical explainer

I was expected a pdf link, but this goes to a brief article on twitter/X. lol, okay...

nolist_policy 62 days ago | |

It's up on GitHub: https://github.com/google-ai-edge/gallery/releases

brcmthrowaway 62 days ago |

Is Google's local model strategy tuned to pegging down big AI cloud labs a notch?

whoahwio 62 days ago | |

dumping money into Gemma and shorting new data center buildouts is a level of Corporate Vision that ends up in an HBS case study

tannhaeuser 62 days ago |

Tested gemma4 26 MoE 4bit quantisized gguf on llama.cpp following these guides with mmap'd I/O on a 16GB MBP and it was unbearably slow (0.0 t/s).

deskamess 62 days ago |

Did DeepSeek come up with MTP? It was listed prominently in their recent paper as being carried forward from the previous release.

logickkk1 62 days ago | |

i think this is mixing two separate ideas. MTP is the training-side piece. speculative decoding is the inference trick. DeepSeek V3 used MTP as an auxiliary loss. the 2022 Google paper is speculative decoding. now Google is combining them. https://arxiv.org/abs/2404.19737

deskamess 62 days ago | | |

Oh... so MTP is not speculative decoding? The (T)oken (P)rediction made me think it was on the inference side. I shall read the paper.

Edit: Ok, I understand now. You are saying that MTP has two aspects. 1) The training (for the mini-models to generate tokens), and 2) The actual speculative decoding implementation on the inference side (which uses those trained mini-models).

woadwarrior01 61 days ago |

The Qwen 3.5, 3.6 and Kimi 2.5, 2.6 models also have multi-token prediction heads baked into their model weights.

shay_ker 62 days ago |

curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron

https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...

zargon 62 days ago | |

They're using the term speculative decoding but doing MTP. It's the same thing as Nemotron, but Google removed the MTP heads from the original safetensora release. (They were not removed from the LiteRM format.)

simianwords 62 days ago |

Gemma 4 is really a beast. The 31B version is totally usable like for cases when I'm bored without internet

imrozim 62 days ago |

3x faster inference means cheaper api costs tooo. For solo dev building ai this matters a lot

ydj 62 days ago | |

Not necessarily. Servers serving the model likely has enough traffic that they are batching decodes already. MTP reduces latency and increase efficiency only when the server can’t batch enough concurrent streams to be compute bound rather than memory bound.

imrozim 62 days ago | | |

Fair didn't think about batching makes more sense for self hosted models then.

larnon 62 days ago |

Anyone tried this with vLLM yet? I am confused on how to turn this on tbh.

ThouYS 62 days ago |

don't know about this guy, but qwen3.6:27b with the UD 4bit quant and little-coder/pi has been amazing. the first local LLM experience that can do actual meaningful work

brcmthrowaway 62 days ago | |

What is UD?

ac29 62 days ago | | |

Unsloth Dynamic, just some branding from Unsloth for their quants (other people use similar techniques)

OliverSmith34 61 days ago |

The best IOS inferencing model comes from Google..

Alonski 62 days ago |

This is sort of similar to Ethereum and maybe a bit of zero knowledge proofs but with the LLM handling both sides.

noashavit 62 days ago |

Gemma4:e4b is a huge upgrade

franze 62 days ago |

if someone wants to work with gemma and dont deal with ollama or configs - there is (my baby) https://airplane-ai.franzai.com/

Beta but useable

CharlesW 62 days ago | |

LM Studio (for example) is free, can you pitch me on your USP vs. it?

franze 62 days ago | | |

easiness of install (one download), zero configuration, zero online access by design - there will never we websearch, never any kind of tracking, your prompts stay on your device - you can totally put in user data, confident contracts, ...

plus over time the harness - coming version has a hotkey for screen capture, next release will have support for native excel, docx export

there is value in being offline by design

franze 62 days ago | |

biggest pain is currently waiting for apple for the next release with updates mac os app store screenshots

m3kw9 62 days ago |

ok so? Anyone got a verdict/review?