Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models(ai.meta.com)

924 points by nmwnmw 1 year ago | 328 comments

simonw 1 year ago |

I'm absolutely amazed at how capable the new 1B model is, considering it's just a 1.3GB download (for the Ollama GGUF version).

I tried running a full codebase through it (since it can handle 128,000 tokens) and asking it to summarize the code - it did a surprisingly decent job, incomplete but still unbelievable for a model that tiny: https://gist.github.com/simonw/64c5f5b111fe473999144932bef42...

More of my notes here: https://simonwillison.net/2024/Sep/25/llama-32/

I've been trying out the larger image models to using the versions hosted on https://lmarena.ai/ - navigate to "Direct Chat" and you can select them from the dropdown and upload images to run prompts.

GaggiX 1 year ago | |

Llama 3.2 vision models don't seem that great if they have to compare them to Claude 3 Haiku or GPT4o-mini. For an open alternative I would use Qwen-2-72B model, it's smaller than the 90B and seems to perform quite better. Also Qwen2-VL-7B as an alternative to Llama-3.2-11B, smaller, better in visual benchmarks and also Apache 2.0.

Molmo models: https://huggingface.co/collections/allenai/molmo-66f379e6fe3..., also seem to perform better than Llama-3.2 models while being smaller and Apache 2.0.

f38zf5vdt 1 year ago | | |

1. Ignore the benchmarks. I've been A/Bing 11B today with Molmo 72B [1], which itself has an ELO neck-and-neck with GPT4o, and it's even. Because everyone in open source tends to train on validation benchmarks, you really can not trust them.

2. The method of tokenization/adapter is novel and uses many fewer tokens than all comparable CLIP/SigLIP-adapter models, making it _much_ faster. Attention is O(n^2) on memory/compute per sequence length.

[1] https://molmo.allenai.org/blog

dannyobrien 1 year ago | | |

What interface do you use for a locally-run Qwen2-VL-7B? Inspired by Simon Willison's research[1], I have tried it out on Hugging Face[2]. Its handwriting recognition seems fantastic, but I haven't figured out how to run it locally yet.

[1] https://simonwillison.net/2024/Sep/4/qwen2-vl/ [2] https://huggingface.co/spaces/GanymedeNil/Qwen2-VL-7B

faangguyindia 1 year ago | |

If you are in US, you get 1 billion tokens a DAY with Gemini (Google) completely free of cost.

Gemini Flash is fast with upto 4 million token context.

Gemini Flash 002 improved in math and logical abilities surpassing Claude and Gpt 4o

You can simply use Gemini Flash for Code Completion, git review tool and many more.

a2128 1 year ago | | |

Is this sustainable though, or are they just trying really hard to attract users? If I build all of my tooling on it, will they start charging me thousands of dollars next year once the subsidies dry up? With a local model running with open source software, at least I can know that as long as my computer can still compute, the model will still run just as well and just as fast as it did on day 1, and cost the same amount of electricity

nycdatasci 1 year ago | | |

This is great for experimentation, but as others have pointed out recently there are persistent issues with Gemini that prevent use in actual products. The recitation/self-sensoring issue results in random failures:

https://github.com/google/generative-ai-docs/issues/257

airspresso 1 year ago | | |

Free of cost != free open model. Free of cost means all your requests are logged for Google to use as training data and whatnot.

Llama3.2 on the other hand runs locally, no data is ever sent to a 3rd party, so I can freely use it to summarize all my notes regardless of one of them being from my most recent therapy session and another being my thoughts on how to solve a delicate problem involving politics at work. I don't need to pre-classify all the input to make sure it's safe to share. Same with images, I can use Llama3.2 11B locally to interpret any photo I've taken without having to worry about getting consent from the people in the photo to share it with a 3rd party, or whether the photo is of my passport for some application I had to file or a receipt of something I bought that I don't want Google to train their next vision model OCR on.

TL;DR - Google free of cost models are irrelevant when talking about local models.

Deathmax 1 year ago | | |

The free tier API isn't US-only, Google has removed the free tier restriction for UK/EEA countries for a while now, with the added bonus of not training on your data if making a request from the UK/CH/EEA.

hobofan 1 year ago | | |

Not locked to the US, you get 1 billion tokens per month per model with Mistral since their recent announcement: https://mistral.ai/news/september-24-release/ (1 request per second is quite a harsh rate limit, but hey, free is free)

I'm pretty excited what all the services adopting free tiers is going to do to the landscape, as that should allow for a lot more experimentation and a lot more hobby projects transitioning into full-time projects, that previously felt a lot more risky/unpredictable with pricing.

jackbravo 1 year ago | |

I saw that you mention https://github.com/simonw/llm/. Hadn't seen this before. What is its purpose? And why not use ollama instead?

dannyobrien 1 year ago | | |

llm is Simon's command line front-end to a lot of the llm apis, local and cloud-based. Along with aider-chat, it's my main interface to any LLM work -- it works well with a chat model, one-off queries, and piping text or output into a llm chain. For people who live on the command line, or are just put-off by web interfaces, it's a godsend.

About the only thing I need to look further abroad for is when I'm working multi-modally -- I know Simon and the community are mainly noodling over the best command line UX for that: https://github.com/simonw/llm/issues/331

jerieljan 1 year ago | | |

It looks like a multi-purpose utility in the terminal for bridging together the terminal, your scripts or programs to both local and remote LLM providers.

And it looks very handy! I'll use this myself because I do want to invoke OpenAI and other cloud providers just like I do in ollama and piping things around and this accomplishes that, and more.

https://llm.datasette.io/en/stable/

I guess you can also accomplish similar results if you're just looking for `/chat/completions` and such if you configured something like LiteLLM and connecting that to ollama and any other service.

flakiness 1 year ago | | |

There is a recent podcast episode with the tool's author https://newsletter.pragmaticengineer.com/p/ai-tools-for-soft...

It's worth listening to learn abouut the context on how that tool is used.

magicalhippo 1 year ago | |

I'm new to this game. I played with Gemma 2 9B in an agent-like role before and was pleasantly surprised. I just tried some of the same prompts with Llama 3.2 3B and found it doesn't stick to my instructions very well.

Since I'm a n00b, does this just mean Llama 3.2 3B instruct was "tuned more softly" than Gemma 2 instruct? That is, could one expect to be able to further fine-tune it to more closely follow instructions?

forgingahead 1 year ago | |

What are people using to check token length of code bases? I'd like to point certain app folders to a local LLM, but no idea how that stuff is calculated? Seems like some strategic prompting (eg: this is a rails app, here is the folder structure with file names, and btw here are the actual files to parse) would be more efficient than just giving it the full app folder? No point giving it stuff from /lib and /vendor for the most part I reckon.

simonw 1 year ago | | |

I use my https://github.com/simonw/ttok command for that - you can pipe stuff into it for a token count.

Unfortunately it only uses the OpenAI tokenizers at the moment (via tiktoken), so counts for other models may be inaccurate. I find they tend to be close enough though.

xyc 1 year ago | | |

You can use llama.cpp server's tokenize endpoint to tokenize and count the tokens: https://github.com/ggerganov/llama.cpp/blob/master/examples/...

sumedh 1 year ago | | |

You can try Gemini Token count. https://ai.google.dev/api/tokens

lowyek 1 year ago | |

Hi simon, is there a way to run the vision model easily on my mac locally?

simonw 1 year ago | | |

Not that I’ve seen so far, but Ollama are pending a solution for that “soon”.

theaniketmaurya 1 year ago | | |

You can run it with LitServe (MPS GPU), here is the code - https://lightning.ai/lightning-ai/studios/deploy-llama-3-2-v...

foxhop 1 year ago | |

The llama 3.0, 3.1, & 3.2 all use the TikToken tokenizer which is the open source openai tokenizer.

littlestymaar 1 year ago | | |

GP is talking about context windows, not the number of token used by the tokenizer.

TZubiri 1 year ago | |

This obsession with using AI to help with programming is short sighted.

We discover gold and you think of gold pickaxes.

Carrok 1 year ago | | |

If we make this an analogy to video games, gold pickaxes can usually mine more gold much faster.

What could be short sighted about using tools to improve your daily work?

opdahl 1 year ago |

I'm blown away with just how open the Llama team at Meta is. It is nice to see that they are not only giving access to the models, but they at the same time are open about how they built them. I don't know how the future is going to go in the terms of models, but I sure am grateful that Meta has taken this position, and are pushing more openness.

a_wild_dandan 1 year ago |

"The Llama jumped over the ______!" (Fence? River? Wall? Synagogue?)

With 1-hot encoding, the answer is "wall", with 100% probability. Oh, you gave plausibility to "fence" too? WRONG! ENJOY MORE PENALTY, SCRUB!

I believe this unforgiving dynamic is why model distillation works well. The original teacher model had to learn via the "hot or cold" game on text answers. But when the child instead imitates the teacher's predictions, it learns semantically rich answers. That strikes me as vastly more compute-efficient. So to me, it makes sense why these Llama 3.2 edge models punch so far above their weight(s). But it still blows my mind thinking how far models have advanced from a year or two ago. Kudos to Meta for these releases.

adtac 1 year ago | |

>WRONG! ENJOY MORE PENALTY, SCRUB!

Is that true tho? During training, the model predicts {"wall": 0.65, "fence": 0.25, "river": 0.03}. Then backprop modifies the weights such that it produces {"wall": 0.67, "fence": 0.24, "river": 0.02} next time.

But it does that with a much richer feedback than WRONG! because we're also telling the model how much more likely "fence" is than "wall" in an indirect way. It's likely most of the neurons that supported "wall" also supported "fence", so the average neuron that supported "river" gets penalised much more than a neuron that supported "fence".

I agree that distillation is more efficient for exactly the same reason, but I think even models as old as GPT-3 use this trick to work as well as they do.

snovv_crash 1 year ago | | |

You are in violent agreement with GP.

croes 1 year ago | |

Isn't jumping over a fence more likely than jumping over a wall?

refulgentis 1 year ago | |

They don't, they're playing "hide the #s" a bit. Llama 3.2 3B is definitively worse than Phi-3 from May, both on any given metric and in an hour of playing with the 2, trying to justify moving to Llama 3.2 at 3B, given I'm adding Llama 3.2 at 1B.

grahamj 1 year ago | |

I would have went with “moon”

illwrks 1 year ago | |

Moat

whimsicalism 1 year ago | |

yeah i mean that is exactly why distillation works. if you just were one hotting it would be the same as training on same dataset

alanzhuly 1 year ago |

Llama3.2 3B feels a lot better than other models with same size (e.g. Gemma2, Phi3.5-mini models).

For anyone looking for a simple way to test Llama3.2 3B locally with UI, Install nexa-sdk(https://github.com/NexaAI/nexa-sdk) and type in terminal:

nexa run llama3.2 --streamlit

Disclaimer: I am from Nexa AI and nexa-sdk is an open-sourced. We'd love your feedback.

alfredgg 1 year ago | |

It's a great tool. Thanks!

I had to test it with Llama3.1 and was really easy. At a first glance Llama3.2 didn't seem available. The command you provided did not work, raising "An error occurred while pulling the model: not enough values to unpack (expected 2, got 1)".

alanzhuly 1 year ago | | |

Thanks for reporting. We are investigating this issue. Could you help submit an issue to our GitHub and provide a screenshot of the terminal (with pip show nexaai)? This could help us reproduce this issue faster. Much appreciated!

mikestaub 1 year ago | |

or https://chat.webllm.ai/

grahamj 1 year ago | |

or grab lmstudio

Zuiii 1 year ago | | |

For people who really care about open source, this is not.

freedomben 1 year ago |

If anyone else is looking for the bigger models on ollama and wondering where they are, the Ollama blog post answered that for me. The are "coming soon" so they just aren't ready quite yet[1]. I was a little worried when I couldn't find them but sounds like we just need to be patient.

[1]: https://ollama.com/blog/llama3.2

Patrick_Devine 1 year ago | |

We're working on it. There are already draft PRs up in the GH repo. We're still working out some kinks though.

xena 1 year ago | |

As a rule of thumb with AI stuff: it either works instantly, or wait a day or two.

refulgentis 1 year ago | |

ollama is "just" llama.cpp underneath, I recommend switching to LM Studio or Jan, they don't have this issue of proprietary wrapper that obfuscates, you can just use any ol GGUF

lolinder 1 year ago | | |

What proprietary wrapper? Isn't Ollama entirely open source?

calgoo 1 year ago | | |

I use gguf in ollama on a daily basis, so not sure what the issue is? Just wrap it in a modelfile and done!

moffkalast 1 year ago |

I've just tested the 1B and 3B at Q8, some interesting bits:

- The 1B is extremely coherent (feels something like maybe Mistral 7B at 4 bits), and with flash attention and 4 bit KV cache it only uses about 4.2 GB of VRAM for 128k context

- A Pi 5 runs the 1B at 8.4 tok/s, haven't tested the 3B yet but it might need a lower quant to fit it and with 9T training tokens it'll probably degrade pretty badly

- The 3B is a certified Gemma-2-2B killer

Given that llama.cpp doesn't support any multimodality (they removed the old implementation), it might be a while before the 11B and 90B become runnable. Doesn't seem like they outperform Qwen-2-VL at vision benchmarks though.

Patrick_Devine 1 year ago | |

Hoping to get this out soon w/ Ollama. Just working out a couple of last kinks. The 11b model is legit good though, particularly for tasks like OCR. It can actually read my cursive handwriting.

jsarv 1 year ago | | |

Naah, Qwen2-VL-7b still is much much better than 11b model for handwritten OCR from what i have tested. The 11b model hallucinates in case of handwritten OCR.

dhbradshaw 1 year ago |

Tried out 3B on ollama, asking questions in optics, bio, and rust.

It's super fast with a lot of knowledge, a large context and great understanding. Really impressive model.

kingkongjaffa 1 year ago |

llama3.2:3b-instruct-q8_0 is performing better than 3.1 8b-q4 on my macbookpro M1. It's faster and the results are better. It answered a few riddles and thought experiments better despite being 3b vs 8b.

I just removed my install of 3.1-8b.

my ollama list is currently:

$ ollama list

NAME ID SIZE MODIFIED

llama3.2:3b-instruct-q8_0 e410b836fe61 3.4 GB 2 hours ago

gemma2:9b-instruct-q4_1 5bfc4cf059e2 6.0 GB 3 days ago

phi3.5:3.8b-mini-instruct-q8_0 8b50e8e1e216 4.1 GB 3 days ago

mxbai-embed-large:latest 468836162de7 669 MB 3 months ago

PhilippGille 1 year ago | |

Aren't the _0 quantizations considered deprecated and _K_S or _K_M preferable?

https://github.com/ollama/ollama/issues/5425

Patrick_Devine 1 year ago | | |

For _K_S definitely not. We quantized 3b with q4_K_M since we were getting good results out of it. Officially Meta has only talked about quantization for 405b and hasn't given any actual guidance for what the "best" quantization should be for the smaller models. With The 1b model we didn't see good results with any of the 4b quantizations and went with q8_0 as the default.

taneq 1 year ago | |

For a second I read that as “it just removed my install of 3.1-8b” :D

fragmede 1 year ago | | |

https://github.com/KillianLucas/open-interpreter/

aryehof 1 year ago | |

On what basis do you use these different models?

kingkongjaffa 1 year ago | | |

mxbai is for embeddings for RAG.

The others are for text generation / instruction following, for various writing tasks.

kgeist 1 year ago |

Tried the 1B model with the "think step by step" prompt.

It gets "which is larger: 9.11 or 9.9?" right if it manages to mention that decimals need to be compared first in its step-by-step thinking. If it skips mentioning decimals, then it says 9.11 is larger.

It gets the strawberry question wrong even after enumerating all the letters correctly, probably because it can't properly count.

altruios 1 year ago | |

My understanding is the way the tokenization works prevents the LMM from being able to count occurrences of words or individual characters.

khafra 1 year ago | |

Of course, in many contexts, it is correct to put 9.11 after 9.9--software versioning does it that way, for example.

KeplerBoy 1 year ago | | |

That's why it's an interesting question and why it struggles so hard.

A good answer would explain that and state both results if the context is not hundred percent clear.

vergessenmir 1 year ago | |

What is the "think step by step" prompt? An example would be great, Is this part of the system prompt?

potatoman22 1 year ago | | |

It's appending "think step-by-step" to the end of the prompt to elicit a chain-of-thought response. See: https://arxiv.org/abs/2205.11916

bick_nyers 1 year ago | |

Does anyone know of a CoT dataset somewhere for finetuning? I would think exposing it to that type of modality during a finetune/lora would help.

JohnHammersley 1 year ago |

Ollama post: https://ollama.com/blog/llama3.2

getcrunk 1 year ago |

Still no 14/30b parameter models since llama 2. Seriously killing real usability for power users/diy.

The 7/8B models are great for poc and moving to edge for minor use cases … but there’s a big and empty gap till 70b that most people can’t run.

The tin foil hat in me is saying this is the compromise the powers that be have agreed too. Basically being “open” but practically gimped for average joe techie. Basically arms control

luke-stanley 1 year ago | |

The Llama 3.2 11B multimodal model is a bit less than 14B but smaller models can do more these days, and Meta are not the only ones making models. The 70B model has been pruned down by NVIDIA if I recall correctly. The 405B model also will be shrunk down and can presumably be used to strengthen smaller models. I'm not convinced by your shiny hat.

swader999 1 year ago | |

You don't need an F-15 to play at least, a decent sniper rifle will do. You can still practise even with a pellet gun. I'm running 70b models on my M2 max with 96 ram. Even larger models sort of work, although I haven't really put much time into anything above 70b.

int_19h 1 year ago | | |

With a 128Gb Mac, you can even run 405b at 1-bit quantization - it's large enough that even with the considerable quality drop that entails, it still appears to be smarter than 70b.

foxhop 1 year ago | |

4090 has 24G

So we really need ~40B or G model (two cards) or like a ~20B with some room for context window.

5090 has ??G - still unreleased

regularfry 1 year ago | | |

Qwen2.5 has a 32B release, and quantised at q5_k_m it *just about" completely fills a 4090.

It's a good model, too.

arnaudsm 1 year ago |

Is there an up-to-date leaderboard with multiple LLM benchmarks?

Livebench and Lmsys are weeks behind and sometimes refuse to add some major models. And press releases like this cherry pick their benchmarks and ignore better models like qwen2.5.

If it doesn't exist I'm willing to create it

threatripper 1 year ago | |

https://artificialanalysis.ai/leaderboards/models

"LLM Leaderboard - Comparison of GPT-4o, Llama 3, Mistral, Gemini and over 30 models

Comparison and ranking the performance of over 30 AI models (LLMs) across key metrics including quality, price, performance and speed (output speed - tokens per second & latency - TTFT), context window & others. For more details including relating to our methodology, see our FAQs."

gdiamos 1 year ago |

Llama 3.2 includes a 1B parameter model. This should be 8x higher throughput for data pipelines. In our experience, smaller models are just fine for simple tasks like reading paragraphs from PDF documents.

kombine 1 year ago |

Are these models suitable for Code assistance - as an alternative to Cursor or Copilot?

bboygravity 1 year ago | |

I use Continue on VScode, works well with Ollama and llama3.1 (but obviously not as good as Claude).

Ey7NFZ3P0nzAe 1 year ago |

Interesting that its scores are somewhat helow Pixtral 12B https://mistral.ai/news/pixtral-12b/

gunalx 1 year ago |

3b was pretty good at multimodal (Norwegian) still a lot of gibberish at times, and way more sensitive than 8b but more usable than Gemma 2 2b at multi modal, fine at my python list sorter with args standard question. But 90b vision just refuses all my actually useful tasks like helping recreate the images in html or do anything useful with the image data other than describing it. Have not gotten as stuck with 70b or openai before. Insane amount of refusals all the time.

resters 1 year ago |

This is great! Does anyone know if the llama models are trained to do function calling like openAI models are? And/or are there any function calling training datasets?

refulgentis 1 year ago | |

Yes (rationale: 3.1 was, would be strange to rollback.)

In general, you'll do a ton of damage by constraining token generation to valid JSON - I've seen models as small as 800M handle JSON with that. It's ~impossible to train constraining into it with remotely the same reliability -- you have to erase a ton of conversational training that makes it say ex. "Sure! Here's the JSON you requested:"

noahbp 1 year ago | | |

What kind of damage is done by constraining token generation to valid JSON?

Closi 1 year ago | | |

What about OpenAI Structured Outputs? This seems to do exactly this.

TmpstsTrrctta 1 year ago | |

They mention tool calling in the link for the smaller models, and compare to 8B levels of function calling in benchmarks here:

https://news.ycombinator.com/item?id=41651126

ushakov 1 year ago | |

yes, but only the text-only models!

https://www.llama.com/docs/model-cards-and-prompt-formats/ll...

zackangelo 1 year ago | | |

This is incorrect:

> With text-only inputs, the Llama 3.2 Vision Models can do tool-calling exactly like their Llama 3.1 Text Model counterparts. You can use either the system or user prompts to provide the function definitions.

> Currently the vision models don’t support tool-calling with text+image inputs.

They support it, but not when an image is submitted in the prompt. I'd be curious to see what the model does. Meta typically sets conservative expectations around this type of behavior (e.g., they say that the 3.1 8b model won't do multiple tool calls, but in my experience it does so just fine).

winddude 1 year ago | | |

the vision models can also do tool calling according to the docs, but with text-only inputs, maybe that's what you meant ~ <https://www.llama.com/docs/model-cards-and-prompt-formats/ll...>

l5870uoo9y 1 year ago |

> These models are enabled on day one for Qualcomm and MediaTek hardware and optimized for Arm processors.

Do they require GPU or can they be deployed on VPS with dedicated CPU?

KeplerBoy 1 year ago | |

Doesn't require a GPU, it will just be faster with a GPU.

chriskanan 1 year ago |

The assessments of visual capability really need to be more robust. They are still using datasets like VQAv2, which while providing some insight, have many issues. There are many newer datasets that serve as much more robust tests and that are less prone to being affected by linguistic bias.

I'd like to see more head-to-head comparisons with community created multi-modal LLMs as done in these papers:

https://arxiv.org/abs/2408.05334

https://arxiv.org/abs/2408.03326

I look forward to reading the technical report, once its available. I couldn't find a link to one, yet.

Jackson__ 1 year ago | |

Looking at their benchmark results and my own experience with their 11B vision model, I think while not perfect they represent the model well.

Meaning it's doing impressively bad compared to other models I've tried in similar sizes(for vision).

sgt 1 year ago |

Anyone on HN running models on their own local machines, like smaller Llama models or such? Or something else?

grahamj 1 year ago | |

Doesn’t everyone? X) it’s super easy now with ollama + openwebui or an all in 1 like mlstudio

sgt 1 year ago | | |

Was just concerned I don't have enough RAM. I have 16GB (M2 Pro). Got amazing mem bandwidth though (800GB/s)

karpatic 1 year ago | |

For sure dude! Top comment thread is all about using ollama and other ways to get that done.

404mm 1 year ago |

Can anyone recommend a webUI client for ollama?

Ey7NFZ3P0nzAe 1 year ago | |

Open webui has promising aspects, the same authors are pushing for "pipelines" which are a standard for how inputs and outputs are modified on the fly for different purposes.

iKlsR 1 year ago | |

openwebui

404mm 1 year ago | | |

Nice one. Thank you .. it looks like ChatGPT (not that there’s anything wrong with that)

fungi 1 year ago | |

ive been using https://github.com/valiantlynx/ollama-docker which comes with https://github.com/open-webui/open-webui

papascrubs 1 year ago | |

https://get.big-agi.com/

xrd 1 year ago |

I'm currently fighting with a fastapi python app deployed to render. It's interesting because I'm struggling to see how I encode the image and send it using curl. Their example sends directly from the browser and uses a data uri.

But, this is relevant because I'm curious how this new model allows image inputs. Do you paste a base64 image into the prompt?

It feels like these models can start not only providing the text generation backend, but start to replace the infrastructure for the API as well.

Can you input images without something in front of it like openwebui?

josephernest 1 year ago |

Can it run with llama-cpp-python? If so, where can we find and download the gguf files? Are they distributed directly by meta, or are they converted to gguf format by third parties?

thimabi 1 year ago |

Does anyone know how these models fare in terms of multilingual real-world usage? I’ve used previous iterations of llama models and they all seemed to be lacking in that regard.

aussieguy1234 1 year ago |

When using meta.ai, its able to generate images as well as understand them. Has this also been open sourced or just a GPT4o style ability to see images?

desireco42 1 year ago |

I have to say that running this model locally I was pleasantly suprised how well it ran, it doesn't use as much resources and produce decent output, comparable to ChatGPT, it is not quite as OpenAI but for a lot of tasks, since it doesn't burden the computer, it can be used with local model.

Next I want to try to use Aider with it and see how this would work.

GaggiX 1 year ago |

The 90B seem to perform pretty weak on visual tasks compare to Qwen2-VL-72B: https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct, or am I missing something?

notpublic 1 year ago |

Llama-3.2-11B-Vision-Instruct does an excellent job extracting/answering questions from screenshots. It is even able to answer questions based on information buried inside a flowchart. How is this even possible??

Ey7NFZ3P0nzAe 1 year ago | |

Because they trained the text model. Then froze the weights. Then trained a vision model on text image pairs of progressively higher quality. Then trained an adapter to align their latent spaces. So it became smart on text then gain a new input sense magically without changing its weights

ComputerGuru 1 year ago | | |

Is this - at a reasonable guess - what most believe OpenAI did with 4o?

vintermann 1 year ago | |

Oh, this is promising. It's not surprising to me: image models have been very oriented towards photography and scene understanding rather than understanding symbolic information in images (like text or diagrams), but I always thought that it should be possible to make the model better at the latter, for instance by training it more on historical handwritten documents.

faangguyindia 1 year ago | |

How good it is at comic reading?

bboygravity 1 year ago | |

magic

sk11001 1 year ago |

Can one of thse models be run on a single machine? What specs do you need?

Y_Y 1 year ago | |

Absolutely! They have a billion-parameter model that will run on my first computer if we quantize it to 1.5 bits. But realistically yes, if you can fit in total ram you can run it slowly, if you can fit it in gpu ram you can probably run it fast enough to chat.

sumedh 1 year ago | |

The 8B models run fine on a M1 pro 16GB.

bombi 1 year ago |

Is Termux enough to run the 1B model on Android?

brrrrrm 1 year ago | |

depends on your phone, but try a couple of these variants with ollama https://ollama.com/library/llama3.2/tags

e.g. `ollama run llama3.2:1b-instruct-q4_0`

oulipo 1 year ago |

Can the 3B run on a M1 macbook? It seems that it hogs all the memory. The 1B runs fine

Tepix 1 year ago | |

It all depends on the amount of RAM. Since you're not mentioning it, you probably have only the default 8GB? Still, the 3B should run even on an 8GB machine.

oulipo 1 year ago | | |

I tried and it was so slow the system became unuseable

bottlepalm 1 year ago |

What mobile devices can the smaller models run on? iPhone, Android?

jillion 1 year ago | |

apparently so, but im trying to find a working example / some details on what specific iOS / android devices are capable of running this

troupo 1 year ago |

It's hallucinating so badly, it's kinda hilarious

Literally everything about the quote below is wrong. The problem with these "amazing ground-breaking models" is that you never know which knowledge they actually possess.

--- start quote ---

>>> what is the last version of Ash Framework for Elixir you know My training data goes up to December 2023, and according to the Ash Framework releases, the last stable version that I'm familiar with is:

* *Ash 2.14*: Released on April 8, 2021.

However, it's worth noting that Ash has been merged into Elixir itself, and most of its features have been integrated into the Elixir language and standard libraries.

In fact, Ash was one of the first frameworks to be included in Elixir, and it played a significant role in shaping the language. Many of Ash's core concepts, such as `defstruct/1` and `defmodule/1`, were later adopted by Elixir itself.

...

Keep in mind that Ash is no longer actively maintained as a separate framework, but its legacy continues to influence Elixir and its community.

--- end quote ---

borromakot 1 year ago | |

Wildly incorrect

gdiamos 1 year ago |

Do inference frameworks like vllm support vision?

woodson 1 year ago | |

Yes, vLLM does (though marked experimental): https://docs.vllm.ai/en/latest/models/vlm.html

theaniketmaurya 1 year ago | |

You can run with LitServe. here is the code - https://lightning.ai/lightning-ai/studios/deploy-llama-3-2-v...

stogot 1 year ago |

Surprised no mention of audio?

edude03 1 year ago | |

was surprised by this as well

ofermend 1 year ago |

Great release. Models just added to Hallucination Leaderboard: https://github.com/vectara/hallucination-leaderboard.

TL;DR: * 90B-Vision: 4.3% hallucination rate * 11B-Vision: 5.5% hallucination rate

dharma1 1 year ago |

are these better than qwen at codegen?

taytus 1 year ago |

meta.ai still running on 3.1

84adam 1 year ago |

excited for this

sva_ 1 year ago |

Curious about the multimodal model's architecture. But alas, when I try to request access

> Llama 3.2 Multimodal is not available in your region.

It sounds like they input the continuous output of an image encoder into a transformer, similar to transfusion[0]? Does someone know where to find more details?

Edit:

> Regarding the licensing terms, Llama 3.2 comes with a very similar license to Llama 3.1, with one key difference in the acceptable use policy: any individual domiciled in, or a company with a principal place of business in, the European Union is not being granted the license rights to use multimodal models included in Llama 3.2. [1]

What a bummer.

0. https://www.arxiv.org/abs/2408.11039

1. https://huggingface.co/blog/llama32#llama-32-license-changes...

minimaxir 1 year ago |

Off topic/meta, but the Llama 3.2 news topic received many, many HN submissions and upvotes but never made it to the front page: the fact that it's on the front page now indicates that moderators intervened to rescue it: https://news.ycombinator.com/from?site=meta.com (showdead on)

If there's an algorithmic penalty against the news for whatever reason, that may be a flaw in the HN ranking algorithm.

makin 1 year ago | |

The main issue was that Meta quickly took down the first announcement, and the only remaining working submission was the information-sparse HuggingFace link. By the time the other links were back up, it was too late. Perfect opportunity for a rescue.

senko 1 year ago | |

Yeah I submitted what turned out to be a dupe but I could never find the original, probably was buried at the time. Then a few hours later it miraculously (re?)appeared.

AIUI exact dupes just get counted as upvotes, which hasn’t happened in my case.

nmwnmw 1 year ago |

- Llama 3.2 introduces small vision LLMs (11B and 90B parameters) and lightweight text-only models (1B and 3B) for edge/mobile devices, with the smaller models supporting 128K token context.

- The 11B and 90B vision models are competitive with leading closed models like Claude 3 Haiku on image understanding tasks, while being open and customizable.

- Llama 3.2 comes with official Llama Stack distributions to simplify deployment across environments (cloud, on-prem, edge), including support for RAG and safety features.

- The lightweight 1B and 3B models are optimized for on-device use cases like summarization and instruction following.

monkfish328 1 year ago |

Zuckerberg has never liked having Android/iOs as gatekeepers i.e. "platforms" for his apps.

He's hoping to control AI as the next platform through which users interact with apps. Free AI is then fine if the surplus value created by not having a gatekeeper to his apps exceeds the cost of the free AI.

That's the strategy. No values here - just strategy folks.

jsemrau 1 year ago | |

Agents are the new Apps

acedTrex 1 year ago | |

I mean, just because he is not doing this as a perfectly altruistic gesture does not mean the broader ecosystem does not benefit from him doing it

monkfish328 1 year ago | | |

For sure

TheAceOfHearts 1 year ago |

I still can't access the hosted model at meta.ai from Puerto Rico, despite us being U.S. citizens. I don't know what Meta has against us.

Could someone try giving the 90b model this word search problem [0] and tell me how it performs? So far with every model I've tried, none has ever managed to find a single word correctly.

[0] https://imgur.com/i9Ps1v6

daemonologist 1 year ago | |

Both Llama 3.2 90B and Claude 3.5 Sonnet can find "turkey" and "spoon", probably because they're left-to-right. Llama gave approximate locations for each and Claude gave precise but slightly incorrect locations. Further prompting to look for diagonal and right-to-left words returned plausible but incorrect responses, slightly more plausible from Claude than Llama. (In this test I cropped the word search to just the letter grid, and asked the model to find any English words related to soup.)

Anyways, I think there just isn't a lot of non-right-to-left English in the training data. A word search is pretty different from the usual completion, chat, and QA tasks these models are oriented towards; you might be able to get somewhere with fine-tuning though.

gunalx 1 year ago | |

Try and find where the words are in this word puzzle undefined

''' There are two words in this word puzzle: "soup" and "mix". The word "soup" is located in the top row, and the word "mix" is located in the bottom row. ''' Edit: Tried a bit more probing like asking it to find spoon or any other word. It just makes up a row and column.

paxys 1 year ago | |

Non US citizens can access the model just fine, if that's what you are implying.

TheAceOfHearts 1 year ago | | |

I'm not implying anything. It's just frustrating that despite being a US territory with US citizens, PR isn't allowed to use this service without any explanation.

Workaccount2 1 year ago | |

This is likely because the models use OCR on images with text, and once parsed the word search doesn't make sense anymore.

Would be interesting to see a model just working on raw input though.

simonw 1 year ago | | |

Image models such as Llama 3.2 11B and 90B (and the Claude 3 series, and Microsoft Phi-3.5-vision-instruct, and PaliGemma, and GPT-4o) don't run OCR as a separate step. Everything they do is from that raw vision model.

alexcpn 1 year ago |

In KungfuPanda there is this line that the Panda says "I love KungFuuuuuuuu", well I normally don't tell like this, but when I saw this and (starting to use this), I feel like yelling"I like Metaaaaa or is it LLAMMMAAA or is it Open source.. or is it this cool ecosystem which gives such value for free...

404mm 1 year ago |

Newbie question, what size model would be needed to have a 10x software engineer skills and no knowledge of the human kind (ie, no need to know how to make a pizza or sequence your DNA). Is there such a model?

keyle 1 year ago | |

No, not yet. And such LLM wouldn't speak back in English or French without some "knowledge of the human kind" as you put it.

pants2 1 year ago | |

Most code is grounded in real-world concepts somehow. Imagine an engineer at Domino's asking it to write an ordering app. Now your model needs to know what goes in to a pizza.

_lvbh 1 year ago | |

10x relative to what? I’ve seen bad developers use AI to 10x their productivity but they still couldn’t come anywhere close to a good developer without AI (granted, this was at a hackathon on pretty advanced optimization research. Maybe there’s more impact on lower skilled tasks)

acedTrex 1 year ago | | |

A bad dev using AI is now 10 times more productive at writing bad code

palisade 1 year ago | |

Not yet. But, Nvidia's CEO announced a few months ago that we're about 5 years away. And, OpenAI just this week announced Super Intelligence is up to 2000 days (e.g. around 5 years) away.

latentsea 1 year ago | |

So long as you don't mind glue in your pizza...

faangguyindia 1 year ago | |

Try codegemma.

Or Gemini Flash for code completion and generation.