Gemma 4 12B: A unified, encoder-free multimodal model

Gemma 4 12B: A unified, encoder-free multimodal model(blog.google)

167 points by rvz 1 hour ago | 60 comments

minimaxir 1 hour ago |

The big story here is the encoder-free part, which I still don't fully understand.

> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.

That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...

> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.

spott 4 minutes ago | |

This is just early fusion basically.

FAIR did this 2 years ago now: https://arxiv.org/abs/2405.09818

I've been waiting for something like this to be released since then.

The annoying thing is that chameleon was multi-modal out based on the same principles, but this model is just inputs... (I'm curious how they did pre-training without having multi-modal outputs as well. I wonder if they just chopped them off rather than support image output).

georgehm 26 minutes ago | |

Embedded within that developer page is a good explainer of the encoder free architecture . https://newsletter.maartengrootendorst.com/p/a-visual-guide-...

mchinen 11 minutes ago | |

The audio side is even more interesting, as it seems they totally got rid of positional embedding are just doing a single linear transform to match the LLM input dimension and that's it.

> Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.

kristjansson 1 hour ago | |

> quantization

12b means 12G @ 8 bits/param (basically lossless) and 6G at 4 b/p (generally accepted 'pretty close' level). Not too bad?

But TBD how well the base model performs before thinking too much about quantization

jszymborski 1 hour ago | |

Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.

minimaxir 58 minutes ago | | |

In hindsight I may have been pedantic.

matja 34 minutes ago | |

One side-effect, is that the separate .mmproj file (Multi-Modal Projection encoder) is no longer needed, when using the model with llama.cpp etc.

pferdone 3 minutes ago | | |

But do I have the option to run it 'text only'?

rao-v 12 minutes ago | |

Encoder free is huge for running on SBCs etc. often the encoding time is a significant fraction of generation time if you are using a VLM as a all purpose vision model

reactordev 59 minutes ago | |

It actually works well because unlike encoders, the latent space is trained on that initial layer so it “knows” what to do with that sparse density. I’ve been using gemma4-12b with Flux2 and its ability to reason on visual input is pretty good. That said, each model is good in their own ways so YMMV but overall, it’s about as solid as Qwen just with a more advanced architecture.

wolttam 57 minutes ago | |

I think the idea is that the model is seeing embeddings that map directly to underlying pixel data, rather than being fed semantically rich embeddings from an encoder model which itself had seen the raw pixel data.

LarsDu88 59 minutes ago | |

Well its a real simple encoder I guess

GaggiX 1 hour ago | |

> That's technically encoding

Isn't that just projecting the patches into the d_model size vectors that the models takes?

>I am assuming that involves of quantization

12B model in 16GB seems very reasonable to me, int8 is top quality for running models.

minimaxir 55 minutes ago | | |

The guide describes it as projection although there is apparently an extra step: "A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input."

12B at int8 would take up 12G memory, or 75% of the system memory which technically fits within 16GB but the OS will not like that.

lxgr 22 minutes ago |

Am I missing something or are the Ollama versions of this (https://ollama.com/library/gemma4/tags) text-only for now?

philipkglass 20 minutes ago | |

Since ollama has diverged from llama.cpp, it will take a bit of time for ollama to support multi-modality. If you're using plain llama.cpp it looks like a PR has already merged for this model with vision and audio support:

https://github.com/ggml-org/llama.cpp/pull/24077

ethanpil 56 minutes ago |

What's Google's business case for releasing open models? Don't get me wrong, I am grateful and appreciative of these releases. I'm trying to understand how it fits into their bigger picture as a for profit company? Are they not helping competitors build on the novel technology they have developed?

Is it simply goodwill and/or marketing? Or am I missing something strategic?

ComputerGuru 24 minutes ago |

Quite aside from the architectural changes, I suppose this is the answer to why Google had such a glaring hole in the (pretrained) Gemma4 model lineup between the Gemma4 4b and Gemma4 26b models!

A model that comfortably fits in 16GB of VRAM (allowing room for context) is a welcome upgrade.

Havoc 27 minutes ago |

Quite a niche release. The MoE outperforms it on score and will likely be faster thanks to lower active weights. So this really only makes sense for specific ram constrained applications that can’t fit a quantized MoE

dist-epoch 22 minutes ago | |

The un-quantized MoE outperforms it.

But between same (V)RAM requirement 4 bit 26B-A3B and 8 bit 12B it's unclear which one will win, especially given one is MoE and the other dense.

All the launch benchmarks are at 16 bit.

Zambyte 43 minutes ago |

Is this Mac only? Or is that an Ollama issue that it only supports this release of models on Mac? It seems like every tag with the MLX badge is only supported on Mac[0], and that includes all of the tags in this release.

[0] https://ollama.com/library/gemma4/tags

Edit: MLX being Mac-only is independent of the model being MLX (and therefore Mac) only. The latter is what I am asking about.

embedding-shape 37 minutes ago | |

MLX is quite literally macOS-specific technology, for other platforms you want non-MLX.

I was sure "MLX" stood for "Metal-something-something" but can't find any reference to that somehow, anywho, "Metal" is hardware-accelerated graphics on Apple platforms FWIW.

Edit: about the actual release on Ollama, if you're on non-Apple hardware you probably want the NVFP4 variant ("gemma4:12b-nvfp4") which was uploaded 45 minutes ago, especially if you're with a recent nvidia GPU.

sambaumann 12 minutes ago | | |

I still get "this model requires macOS" when trying to pull that one

jasonjmcghee 14 minutes ago | |

There's a CUDA backend for MLX now. Not sure about the maturity.

jw1224 33 minutes ago | |

MLX is Apple’s own machine learning framework, designed for Apple Silicon: https://opensource.apple.com/projects/mlx/

dwa3592 46 minutes ago |

This is a pretty good update. The demo video is a bit funny though - the tester asks to turn the release into bullet points. okay, the model obliges. then the tester says draft an email with this content. BAM! the LLM turns the content from bullets to passages even though it was not asked and it undid the last good thing that it did. i am not sure if it's an email etiquette to not put bullets in the email.

BiraIgnacio 23 minutes ago |

using an embedder instead of a decoder is quite clever. Not sure who came up with that first but it's a cool idea.

randomNumber7 43 minutes ago |

> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.

I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)

claysmithr 24 minutes ago |

I don’t see the download in lm studio

djyde 40 minutes ago |

What are the use cases for these small models? Is there anyone using models of this scale in their daily life who could share their experience?

robgough 50 seconds ago | |

I've got a home-built dictation app that uses a local model to clear up the text and fix grammar. It was super easy to build. I’m extending it to capture meeting notes and summarise too. All on-device.

I saw a little app the other day, I think someone posted on here, that looks at your screenshot and renames the file based off the contents of the file.

There's tons of little examples like that. For a lot of use cases, you really don't need the frontier models.

properbrew 1 minute ago | |

I think small models have a very good niche for specific tasks. I utilise a fine tuned Phi-4 model (smaller than this one) that fits in about 3.5gb of RAM (not vram) for the document processing side of things for the desktop app I develop (a bit of a shameless plug - whistle-enterprise.com).

If you have a very specific idea for local model use you can find a way to make it work very well, you don't even need to have a graphics card or NPU chip. You just have to be extremely constrained in how it's used. I think as a generic chatbot they're not great, I'd use a hosted SOTA model and I'm a big fan of local LLMs myself.

Aachen 23 minutes ago | |

"Small" models are the ones I can run myself on my own terms. LLMs aren't useful enough for me to justify spending hundreds of euros on a GPU with 16GB VRAM or something, and that's assuming I have the rest of the desktop just laying around. Back when I checked (before the RAM price hike), these models weren't meaningfully better than 4-8GB ones anyway, you'd have to go for the top tier cards at 24 or 32 GB iirc to get something vaguely in the direction of the SaaS versions, and that was absolutely out of my budget. Even if that changed, so have hardware prices so it'd probably still work out the same

Xiol 24 minutes ago | |

I've yet to see someone answer a question like this with a decent, useful answer.

digdugdirk 28 minutes ago |

I do enjoy the immediate out of touch signaling with the "runs on your 16gb vram laptop" line. Because everyone has a laptop with 16gb vram, or can just pop out and buy a new one, right?

vehemenz 10 minutes ago | |

This comment has me a bit confused.

Consumers were complaining about the standard 8GB with the early 2020 refresh of MacBook Pros, many OSes ago. Sure, it might be workable for many tasks (as evidenced by the recent sales of the MacBook Neo), but users with a mere 8GB shouldn't have expectations of LLM performance. Even 16GB feels like a stretch.

zuminator 49 minutes ago |

How does it compare with e4b, aside from being larger?

anonova 34 minutes ago | |

There's a comparison of all the Gemma 4 models (+ Gemma 3 27B) on the Huggingface model card: https://huggingface.co/google/gemma-4-12B-it#benchmark-resul...

thomasjb 41 minutes ago | |

That's what I want to know too. A smarter E4B that's happy in opencode would be a good selfhosted model for me

nickandbro 1 hour ago |

Wow Google is becoming the new pre Llama 4 Meta when it comes to releasing open weights models.

embedding-shape 57 minutes ago | |

I dunno, feels a bit unfair to companies that actually do FOSS releases (Gemma 4 being released under Apache 2.0 license) to compare them to a company that never done any FOSS releases, and mostly done proprietary "available to download" releases.

seba_dos1 44 minutes ago | | |

Note that a binary released under Apache 2.0 license does not yet make it FOSS.

brianwawok 43 minutes ago | |

Every other Google model I have tried felt very weak compared to qwen models. I dont have a ton of use case for multimodal though, so its very possible this is a fantastic multimodal model.

wongarsu 20 minutes ago | | |

Gemma 4 27b and 32b feel pretty capable for text and visionn. Comparable with qwen, maybe a bit better on tool calling heavy tasks

I am not overly impressed with the smaller gemma models. And gemma 3 was a bit of a mixed bag, great at some things, bad at most others

redman25 54 minutes ago | |

IDK this model release is a bit disappointing considering the community has been chomping at the bit for the 124ba4b model. There was some leaked info about it but people suspect it was not released because it was too close to gemini flash in performance.

jdelman 37 minutes ago |

I can’t help but wonder if this is the basis of the model they’ve helped tune for Apple.