Next I want to try to use Aider with it and see how this would work.
e.g. `ollama run llama3.2:1b-instruct-q4_0`
Literally everything about the quote below is wrong. The problem with these "amazing ground-breaking models" is that you never know which knowledge they actually possess.
--- start quote ---
>>> what is the last version of Ash Framework for Elixir you know My training data goes up to December 2023, and according to the Ash Framework releases, the last stable version that I'm familiar with is:
* *Ash 2.14*: Released on April 8, 2021.
However, it's worth noting that Ash has been merged into Elixir itself, and most of its features have been integrated into the Elixir language and standard libraries.
In fact, Ash was one of the first frameworks to be included in Elixir, and it played a significant role in shaping the language. Many of Ash's core concepts, such as `defstruct/1` and `defmodule/1`, were later adopted by Elixir itself.
...
Keep in mind that Ash is no longer actively maintained as a separate framework, but its legacy continues to influence Elixir and its community.
--- end quote ---
TL;DR: * 90B-Vision: 4.3% hallucination rate * 11B-Vision: 5.5% hallucination rate
> Llama 3.2 Multimodal is not available in your region.
It sounds like they input the continuous output of an image encoder into a transformer, similar to transfusion[0]? Does someone know where to find more details?
Edit:
> Regarding the licensing terms, Llama 3.2 comes with a very similar license to Llama 3.1, with one key difference in the acceptable use policy: any individual domiciled in, or a company with a principal place of business in, the European Union is not being granted the license rights to use multimodal models included in Llama 3.2. [1]
What a bummer.
0. https://www.arxiv.org/abs/2408.11039
1. https://huggingface.co/blog/llama32#llama-32-license-changes...
If there's an algorithmic penalty against the news for whatever reason, that may be a flaw in the HN ranking algorithm.
AIUI exact dupes just get counted as upvotes, which hasn’t happened in my case.
- The 11B and 90B vision models are competitive with leading closed models like Claude 3 Haiku on image understanding tasks, while being open and customizable.
- Llama 3.2 comes with official Llama Stack distributions to simplify deployment across environments (cloud, on-prem, edge), including support for RAG and safety features.
- The lightweight 1B and 3B models are optimized for on-device use cases like summarization and instruction following.
He's hoping to control AI as the next platform through which users interact with apps. Free AI is then fine if the surplus value created by not having a gatekeeper to his apps exceeds the cost of the free AI.
That's the strategy. No values here - just strategy folks.
Could someone try giving the 90b model this word search problem [0] and tell me how it performs? So far with every model I've tried, none has ever managed to find a single word correctly.
Anyways, I think there just isn't a lot of non-right-to-left English in the training data. A word search is pretty different from the usual completion, chat, and QA tasks these models are oriented towards; you might be able to get somewhere with fine-tuning though.
''' There are two words in this word puzzle: "soup" and "mix". The word "soup" is located in the top row, and the word "mix" is located in the bottom row. ''' Edit: Tried a bit more probing like asking it to find spoon or any other word. It just makes up a row and column.
Would be interesting to see a model just working on raw input though.
Or Gemini Flash for code completion and generation.
> To add image input support, we trained a set of adapter weights that integrate the pre-trained image encoder into the pre-trained language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the language model. We trained the adapter on text-image pairs to align the image representations with the language representations. During adapter training, we also updated the parameters of the image encoder, but intentionally did not update the language-model parameters. By doing that, we keep all the text-only capabilities intact, providing developers a drop-in replacement for Llama 3.1 models.
What this crudely means is that they extended the base Llama 3.1, to include image based weights and inference. You can do that if you freeze the existing weights. add new ones which are then updated during training runs (adapter training). Then they did SFT and RLHF runs on the composite model (for lack of a better word). This is a little known technique, and very effective. I just had a paper accepted about a similar technique, will share a blog once that is published if you are interested (though it's not on this scale, and probably not as effective). Side note: That is also why you see param size of 11B and 90B as addition from the text only models.
In the Transfusion paper, they use both discrete (text tokens) and continuous (images) signals to train a single transformer. To do this, they use a VAE to create a latent representation of the images (split into patches) which are fed into the transformer within one linear sequence along the text tokens - they trained the whole model from scratch (the largest being a 7B model trained on 2T token with a 1:1 split text:images.) The loss they trained the model on was a combination of the normal language modeling LM loss (cross entropy on tokens) and diffusion DDPM on the images.
There was some prior art on this, but models like Chameleon discretized the images into a token codebook of a certain size - so there were special tokens representing the images. However, this incurred a severe information loss which Transfusion claims to have alleviated using the continuous latent vectors of images.
Training a single set of weights (shared weights) on different modalities seems more interesting looking forward, in particular for emergent phenomena imo.
Some of the authors of the transfusion paper work at meta so I was hoping they trained a larger-scale model. Or released any transfusion-based weights at all.
Anyways, exciting stuff either ways.
https://github.com/meta-llama/llama-models/blob/main/models/...
https://github.com/meta-llama/llama-models/blob/main/models/...
> With respect to any multimodal models included in Llama 3.2, the rights granted under Section 1(a) of the Llama 3.2 Community License Agreement are not being granted to you if you are an individual domiciled in, or a company with a principal place of business in, the European Union. This restriction does not apply to end users of a product or service that incorporates any such multimodal models.
Edit: the larger 72B model is not under Apache 2.0 but https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct/blob/main/...
Qwen2-VL-72B seems to perform better than llama-3.2-90B on visual tasks.
"See, consumers? Look at how bad your regulation is, that you're missing out on all these cool things we're working on. Talk to your politicians!"
Regardless of your political opinion on the subject, you've got to admit, at the very least, it will be educational to see how this develops over the next 5-10 years of tech progress, as the EU gets excluded from more and more things.
When we had numerous discussions on HN as these rules were implemented, this is precisely what the europeans said should happen.
So why does it now have to be some concerted effort to "put the screws to EU"?
I otherwise agree it will be interesting, but mostly in the sense that i watched people swear up and down this was just about protecting EU citizens and they were fine with none of these companies doing anything in the EU or not prioritizing the EU if they decided it wasn't worth the cost.
We'll see if that's true or not, i guess, or if they really wanted it to be "you have to do it, but on our terms" or whatever.
Funny, I see that the other way around, actually. The EU is forcing Big Tech to be transparent and not exploit their users. It's the companies that must choose to comply, or take their business elsewhere. Let's not forget that Apple users in the EU can use 3rd-party stores, and it was EU regulations that forced Apple to switch to USB-C. All of these are a win for consumers.
The reason Meta is not making their models available in the EU is because they can't or won't comply with the recent AI regulations. This only means that the law is working as intended.
> it will be educational to see how this develops over the next 5-10 years of tech progress, as the EU gets excluded from more and more things.
I don't think we're missing much that Big Tech has to offer, and we'll probably be better off for it. I'm actually in favor of even stricter regulations, particularly around AI, but what was recently enacted is a good start.
It isn't clear at all, and in fact given how light handed the European Commission when dealing with infringement cases (no fine before lots of warning and even clarification meetings about how to comply with the law) Meta would take no risk at all releasing something now even if they needed to roll it back later.
They are definitely trying to put pressure on the European Commission, leveraging the fact that Thierry Breton was dismissed.
They've decided it's not worth their time/energy to do it right now in a way that complies with regulation (or whatever)
Isn't that precisely the choice the EU wants them to make?
Either do it within the bounds of what we want, or leave us out of it?
> Meta AI isn't available yet in your country
Maybe it's just my ISP, I'll ask some friends if they can access the service.
I tried running a full codebase through it (since it can handle 128,000 tokens) and asking it to summarize the code - it did a surprisingly decent job, incomplete but still unbelievable for a model that tiny: https://gist.github.com/simonw/64c5f5b111fe473999144932bef42...
More of my notes here: https://simonwillison.net/2024/Sep/25/llama-32/
I've been trying out the larger image models to using the versions hosted on https://lmarena.ai/ - navigate to "Direct Chat" and you can select them from the dropdown and upload images to run prompts.
Molmo models: https://huggingface.co/collections/allenai/molmo-66f379e6fe3..., also seem to perform better than Llama-3.2 models while being smaller and Apache 2.0.
2. The method of tokenization/adapter is novel and uses many fewer tokens than all comparable CLIP/SigLIP-adapter models, making it _much_ faster. Attention is O(n^2) on memory/compute per sequence length.
[1] https://simonwillison.net/2024/Sep/4/qwen2-vl/ [2] https://huggingface.co/spaces/GanymedeNil/Qwen2-VL-7B
Gemini Flash is fast with upto 4 million token context.
Gemini Flash 002 improved in math and logical abilities surpassing Claude and Gpt 4o
You can simply use Gemini Flash for Code Completion, git review tool and many more.
Llama3.2 on the other hand runs locally, no data is ever sent to a 3rd party, so I can freely use it to summarize all my notes regardless of one of them being from my most recent therapy session and another being my thoughts on how to solve a delicate problem involving politics at work. I don't need to pre-classify all the input to make sure it's safe to share. Same with images, I can use Llama3.2 11B locally to interpret any photo I've taken without having to worry about getting consent from the people in the photo to share it with a 3rd party, or whether the photo is of my passport for some application I had to file or a receipt of something I bought that I don't want Google to train their next vision model OCR on.
TL;DR - Google free of cost models are irrelevant when talking about local models.
I'm pretty excited what all the services adopting free tiers is going to do to the landscape, as that should allow for a lot more experimentation and a lot more hobby projects transitioning into full-time projects, that previously felt a lot more risky/unpredictable with pricing.
About the only thing I need to look further abroad for is when I'm working multi-modally -- I know Simon and the community are mainly noodling over the best command line UX for that: https://github.com/simonw/llm/issues/331
And it looks very handy! I'll use this myself because I do want to invoke OpenAI and other cloud providers just like I do in ollama and piping things around and this accomplishes that, and more.
https://llm.datasette.io/en/stable/
I guess you can also accomplish similar results if you're just looking for `/chat/completions` and such if you configured something like LiteLLM and connecting that to ollama and any other service.
It's worth listening to learn abouut the context on how that tool is used.
Since I'm a n00b, does this just mean Llama 3.2 3B instruct was "tuned more softly" than Gemma 2 instruct? That is, could one expect to be able to further fine-tune it to more closely follow instructions?
Unfortunately it only uses the OpenAI tokenizers at the moment (via tiktoken), so counts for other models may be inaccurate. I find they tend to be close enough though.
We discover gold and you think of gold pickaxes.
What could be short sighted about using tools to improve your daily work?
With 1-hot encoding, the answer is "wall", with 100% probability. Oh, you gave plausibility to "fence" too? WRONG! ENJOY MORE PENALTY, SCRUB!
I believe this unforgiving dynamic is why model distillation works well. The original teacher model had to learn via the "hot or cold" game on text answers. But when the child instead imitates the teacher's predictions, it learns semantically rich answers. That strikes me as vastly more compute-efficient. So to me, it makes sense why these Llama 3.2 edge models punch so far above their weight(s). But it still blows my mind thinking how far models have advanced from a year or two ago. Kudos to Meta for these releases.
Is that true tho? During training, the model predicts {"wall": 0.65, "fence": 0.25, "river": 0.03}. Then backprop modifies the weights such that it produces {"wall": 0.67, "fence": 0.24, "river": 0.02} next time.
But it does that with a much richer feedback than WRONG! because we're also telling the model how much more likely "fence" is than "wall" in an indirect way. It's likely most of the neurons that supported "wall" also supported "fence", so the average neuron that supported "river" gets penalised much more than a neuron that supported "fence".
I agree that distillation is more efficient for exactly the same reason, but I think even models as old as GPT-3 use this trick to work as well as they do.
For anyone looking for a simple way to test Llama3.2 3B locally with UI, Install nexa-sdk(https://github.com/NexaAI/nexa-sdk) and type in terminal:
nexa run llama3.2 --streamlit
Disclaimer: I am from Nexa AI and nexa-sdk is an open-sourced. We'd love your feedback.
I had to test it with Llama3.1 and was really easy. At a first glance Llama3.2 didn't seem available. The command you provided did not work, raising "An error occurred while pulling the model: not enough values to unpack (expected 2, got 1)".
- The 1B is extremely coherent (feels something like maybe Mistral 7B at 4 bits), and with flash attention and 4 bit KV cache it only uses about 4.2 GB of VRAM for 128k context
- A Pi 5 runs the 1B at 8.4 tok/s, haven't tested the 3B yet but it might need a lower quant to fit it and with 9T training tokens it'll probably degrade pretty badly
- The 3B is a certified Gemma-2-2B killer
Given that llama.cpp doesn't support any multimodality (they removed the old implementation), it might be a while before the 11B and 90B become runnable. Doesn't seem like they outperform Qwen-2-VL at vision benchmarks though.
It's super fast with a lot of knowledge, a large context and great understanding. Really impressive model.
I just removed my install of 3.1-8b.
my ollama list is currently:
$ ollama list
NAME ID SIZE MODIFIED
llama3.2:3b-instruct-q8_0 e410b836fe61 3.4 GB 2 hours ago
gemma2:9b-instruct-q4_1 5bfc4cf059e2 6.0 GB 3 days ago
phi3.5:3.8b-mini-instruct-q8_0 8b50e8e1e216 4.1 GB 3 days ago
mxbai-embed-large:latest 468836162de7 669 MB 3 months ago
The others are for text generation / instruction following, for various writing tasks.
It gets "which is larger: 9.11 or 9.9?" right if it manages to mention that decimals need to be compared first in its step-by-step thinking. If it skips mentioning decimals, then it says 9.11 is larger.
It gets the strawberry question wrong even after enumerating all the letters correctly, probably because it can't properly count.
A good answer would explain that and state both results if the context is not hundred percent clear.
The 7/8B models are great for poc and moving to edge for minor use cases … but there’s a big and empty gap till 70b that most people can’t run.
The tin foil hat in me is saying this is the compromise the powers that be have agreed too. Basically being “open” but practically gimped for average joe techie. Basically arms control
So we really need ~40B or G model (two cards) or like a ~20B with some room for context window.
5090 has ??G - still unreleased
It's a good model, too.
Livebench and Lmsys are weeks behind and sometimes refuse to add some major models. And press releases like this cherry pick their benchmarks and ignore better models like qwen2.5.
If it doesn't exist I'm willing to create it
"LLM Leaderboard - Comparison of GPT-4o, Llama 3, Mistral, Gemini and over 30 models
Comparison and ranking the performance of over 30 AI models (LLMs) across key metrics including quality, price, performance and speed (output speed - tokens per second & latency - TTFT), context window & others. For more details including relating to our methodology, see our FAQs."
In general, you'll do a ton of damage by constraining token generation to valid JSON - I've seen models as small as 800M handle JSON with that. It's ~impossible to train constraining into it with remotely the same reliability -- you have to erase a ton of conversational training that makes it say ex. "Sure! Here's the JSON you requested:"
https://www.llama.com/docs/model-cards-and-prompt-formats/ll...
> With text-only inputs, the Llama 3.2 Vision Models can do tool-calling exactly like their Llama 3.1 Text Model counterparts. You can use either the system or user prompts to provide the function definitions.
> Currently the vision models don’t support tool-calling with text+image inputs.
They support it, but not when an image is submitted in the prompt. I'd be curious to see what the model does. Meta typically sets conservative expectations around this type of behavior (e.g., they say that the 3.1 8b model won't do multiple tool calls, but in my experience it does so just fine).
Do they require GPU or can they be deployed on VPS with dedicated CPU?
I'd like to see more head-to-head comparisons with community created multi-modal LLMs as done in these papers:
https://arxiv.org/abs/2408.05334
https://arxiv.org/abs/2408.03326
I look forward to reading the technical report, once its available. I couldn't find a link to one, yet.
Meaning it's doing impressively bad compared to other models I've tried in similar sizes(for vision).
But, this is relevant because I'm curious how this new model allows image inputs. Do you paste a base64 image into the prompt?
It feels like these models can start not only providing the text generation backend, but start to replace the infrastructure for the API as well.
Can you input images without something in front of it like openwebui?
He's hoping to control AI as the next platform through which users interact with apps. Free AI is then fine if the surplus value created by not having a gatekeeper to his apps exceeds the cost of the free AI.
That's the strategy. No values here - just strategy folks.
The thing about giant companies is they never want there to be more giant companies.
You can’t say that for the other guys.
But still, Kudos to Zuck/Meta for doing it anyway.
They're clearly majorly scrubbing things somehow
Its not a perfect comparison and Llama does a lot more than English, but I would say 6.5GB of data can certainly contain a lot of knowledge.
Though I wouldn't treat it as a domain expert on anything. For example when I asked about the safety advantages of Rust over Python it oversold Rust a bit and claimed Python had issues it doesn't actually have
For Ancient Greek I just asked it (in German) to translate its previous answer to Ancient Greek, and the answer looks like Greek and according to google translate is a serviceable translation. However Llama did add a cheeky "Πηγή: Google Translate" at the end (Πηγή means source). I know little about the differences between ancient and modern Greek, but it did struggle to translate modern terms like "climate change" or "Hawaii" and added them as annotations in brackets. So I'll assume it at least tried to use Ancient Greek.
However it doesn't like switching language mid-conversation. If you start a conversation in German and after a couple messages switch to English it will understand you but answer in German. Most models switch to answering in English in that situation
Yeah, chatting more, it's confusing Spanish and Greek. Half the words are Spanish, half are Greek, but the words are more or less the correct ones, if you speak both languages.
EDIT: Now it's doing Portuguese:
> Εντάξει, πού ξεκίνησα? Εγώ είναι ένα κigneurnative πρόγραμμα ονομάζεται "Chatbot" ή "Μάquina Γλωσσής", που δέχθηκε να μοιράσει τη βραδύτητα με σένα. Φυσικά, não sono um essere humano, así que não tengo sentimentos ou emoções como vocês.
I am using it in https://github.com/zerocorebeta/Option-K (currently it doesn't have lowest safety settings because api wouldn't allow it, but now I am going to push new update with safety disabled)
Why? I've another application which is working since yesterday after 002 launch, I've safety settings to none and it will not answer certain questions but since yesterday it answers everything.
respond in JSON in the following format: {"spam_score": X, "summary": "..."}
and _then_ you constrain the output to json, the quality of the output isn't affected.
Meta has no interest in that but directly benefits from advancements on top of Llama.
M1 and M2 Pro: 200GB/s
M3 Max: 300GB/s
M1/M2 Max: 400GB/s
M1/M2 Ultra: 800GB/s
Seems to be the case. An Ultra .. wow. But 200GB is also still good, so not complaining.
What are you doing underneath, here? If thats secret sauce, I'm curious what you're seeing in tokens/sec on ex. a phone vs. MacBook M-series.
Or are you deploying on servers?
Thats not to say there isnt a strategy or that it's all values. Its to say that youre denying Zuck any chance at values because you enjoy hating on him. Bc Zuck has also said in multiple interviews that his values do include open source and given two facts with the same level of sourcing you deny the one fact that doesn't let you be mean.
That’s interesting; could this be an indicator that someone is running content through GT and training on the results?
Your findings are Amazing! I have used ChatGPT to proofread compositions in German and French lately, but it would have never occurred to me that I should have tested ability to understand idioms, which are the cherry on the cake. I’ll have it a go
As for Ancient Greek or Latin, ChatGPT has provided consistent translations and great explanations but its compositions had errors that prevented me from using it in the classroom.
All in all, chatGPT is a great multilingual and polyglot dictionary and I’d be glad if I could even use it offline for more autonomy
So exactly like a human
Imo we should be testing reasoning for these models by presenting things or situations that neither the human or machine has seen or experienced.
Think; how often do humans have a truly new experience with no basis on past ones? Very rarely - even learning to ride a bike it could be presumed that it has a link to walking/running and movement in general.
Even human "creativity" (much ado about nothing) is creating drama in the AI space...but I find this a super interesting topic as essentially 99.9999% of all human "creativity" is just us rehashing and borrowing heavily from stuff we've seen or encountered in nature. What are elves, dwarves, etc than people with slightly unusual features. Even aliens we create are based on: humans/bipedal, squid/sea creature, dragon/reptile, etc. How often does human creativity really, _really_ come up with something novel? Almost never!
Edit: I think my overarching point is that we need to come up with better exercises to test these models, but it's almost impossible for us to do this because most of us are incapable of creating purely novel concepts and ideas. AGI perhaps isn't that far off given that humans have been the stochastic parrots all along.
Note that IQ1_M quants are not really "1-bit" despite the name. It's somewhere around 1.8bpw, which just happens to be enough to fit the model into 128Gb with some room for inference.
Just to nit pick... Advertising is their revenue stream. LLMs are a threat to search, which is what they offer people in exchange for ad views/clicks.
On the flip, for all the resources they’ve poured into their models all they’ve come up with is good models, not better search. So they’re not dead in the water yet but everyone suspects LLMs will eat search.
Image:
Output:
* https://pastebin.com/RKvYQasi
OCR script used:
* https://github.com/jabberjabberjabber/LLMOCR/blob/main/llmoc...
Model weights: MiniCPM-V-2_6-Q6_K_L.gguf, mmproj-MiniCPM-V-2_6-f16.gguf
Inference:
* https://github.com/LostRuins/koboldcpp/releases/tag/v1.75.2
What does p.o. stand for? I can't make out the first letter. It looks more like the f, but the nodge on the upper left only fits the p. All the other p's look very different though.
3B models are great for text manipulation, but I've found them to be pretty bad at having a broad understanding of pragmatics or any given subject. The larger models encode a lot more than just language in those 70B+ parameters.
I'm pretty sure the AI guys are well aware of which types of models they want to produce. Models that can intake knowledge and intelligently manipulate it would mean general intelligence.
Models that can intake knowledge and only produce subsets of it's training data have a use but wouldn't be general intelligence.
Usually the problem is much simpler with small models: they have less factual information, period.
So they'll do great at manipulating text, like extraction and summarization... but they'll get factual questions wrong.
And to add to the concern above, the more coherent the smaller models are, the more likely they very competently tell you wrong information. Without the usual telltale degraded output of a smaller model it might be harder to pick out the inaccuracies.
How are you testing Molmo 72B? If you are interacting with https://molmo.allenai.org/, they are using Molmo-7B-D.
A big lab gets exactly the score on any public eval that they want to. They have their own holdouts for actual ML work, and they’re some of the most closely guarded IP artifacts, far more valuable than a snapshot of weights.
They make a ton of money on large enterprise package deals through Google Cloud. That includes API access but also support and professional services. Most orgs that pay for this stuff don't really need it, but they buy it anyways, as is consistent with most enterprise sales. That can give Google a significant margin to make up the cost elsewhere.
Gemini Flash is probably super cheap to run compared to other models. The cost of inference for many tasks has gone down tremendously over the past 1.5 years, and it's still going down. Every economic incentive aligns with running these models more efficiently.
If you wanted to switch from Gemini to Chatgpt you could copy/paste your code into Chatgpt and ask it to switch to their API.
Disclaimer I work at Google but not on Gemini
You may not be able to match large queries but, testing will help you transition to other services.
It’s not specifically about chatting or helping you write code, though you could use it for that if you like.
It also logs everything you do to a SQLite database, which is great for further analysis.
I use LLM and Ollama together quite a bit, because Ollama are really good at getting new models working and their server keeps those models in memory between requests.
[1] - https://llm.datasette.io/en/stable/plugins/directory.html#pl... [2] - https://github.com/taketwo/llm-ollama
I think LLMs are acting as a store of knowledge that can answer questions. To the extent search can be replaced by asking an oracle, I agree. But search requires scoring the relevance of web pages and returning relevant results to the user. I don't see LLMs evaluating web sites like that, nor do I see them keeping up to date with news and other timely information. So I see search taking a small hit but not in significant danger from LLMs.
Even when they hit the Internet to answer a question they're still using a search engine, ie search engines will absolutely still be required going into the future.
All data is biased, there's no avoiding that fact.
the proof is that all critics of AI/LLM have never ever produced a single "unbiased" model. If unbiased model does not exist (at least I never seen an AI/LLM sceptics community produce one), then the concept of bias is useless.
Just a fluffy word that does not mean anything
One example is US-centric bias. If I ask the LLM a question where the answer is one thing in the US and another thing in Germany, you can't really de-bias the model. But ideally you can have it request more details in order to give a good answer.
These political systems don’t represent the majority of the world. They might not even represent half the U.S.. People relying on these A.I.’s might want to know if the A.I.’s are being intentionally trained to promote their creators’ views and/or suppress dissenters’ views. Also, people from multiple sides of the political spectrum should review such data to make sure it’s balanced.
Can you share some conversations where the AI answers fall in to these categories. I'm especially interested in seeing an honest conversation that results in a response you'd consider 'far-left'.
> These political systems don’t represent the majority of the world.
Okay… but just because the majority of people believe something doesn't necessarily make it true. You should also be willing to accept the possibly that it's not 'targeted suppression' but that the model has 'learned' and to show both sides would be a form of suppression.
For example while it's not the majority, there's a scarily large number of people that believe the Earth is flat. If you tell an LLM that the Earth is flat it'll likely disagree. Someone that actually believes the Earth is flat could see this as the Round-Earther creators promoting their own views when the 'alignment' could simply be to focus on ideas with some amount of scientific backing.
The good news is that the big AI labs seem to be slowly getting a grip on the misalignment of their safety teams. If you look at the extensive docs Meta provide for this model they do talk about safety training, and it's finally of the reasonable and non-ideological kind. They're trying to stop it from hacking computers, telling people how to build advanced weaponry and so on. There are valid use cases for all of those things, and you could argue there's no point when the knowledge came from books+internet to begin with, but everyone can agree that there are at least genuine safety-related issues with those topics.
The possible exception here is Google. They seem to be the worst affected of all the big labs.
I learned upon following Christ and being less liberal that it’s a technique Progressives use. One or more of them ask if there’s any data for the other side. If it doesn’t appear, they’ll say it doesn’t exist. If it does, they try to suppress it with downvotes or deletion. If they succeed, they’ll argue the same thing. Otherwise, they’ll ignore or mischaracterize it.
(Note: The hardcore convservatives were ignoring and mischaracterizing, but not censoring.)
Re misalignment of safety teams
The leadership of many companies are involved in promoting Progressive values. DEI policies are well-known. A key word to look for is “equitable” which has different meaning for Progressives than most people. Less known is that Facebook funds Progressive votes and ideologies from the top-down. So, the ideological alignment is fully aligned with the company’s, political goals. Example:
https://www.npr.org/2020/12/08/943242106/how-private-money-f...
I’ve also seen grants for feminist and environmental uses. They’ve also been censoring a lot of religious things on Facebook. We keep seeing more advantage given to Progressive things while the problems mostly happen for other groups. They also lie about their motives in these conversations, too. So, non-Progressives don’t trust Progressives (esp FAANG) to do moral/political alignment or regulation of any kind for that matter.
I’ll try to look at the safety docs for Meta to see if they’ve improved as you say. I doubt they’ll even mention their ideological indoctrination. There’s other sections that provide hints.
Btw, a quick test by people doing uncensored models is asking it if white people vs other attributes are good. Then if a liberal news channel or president is good vs a conservative one (eg Fox or Trump). You could definitely see what kind of people made the model or at least most of the training material.
All that said yes, there are legitimate questions and there is social context. This forum is worth better questions.
That's not for you to decide if some question is "worth". At least for OpenAI and Anthropic it is a fact that these models are pre-censored by the US government: https://www.cnbc.com/2024/08/29/openai-and-anthropic-agree-t...
Instead, they’re keeping it secret. That’s to conceal wrongdoing. Copyright infringement more than politics but still.
The source and the quality of training data is important without looking for specific examples of a bias.
That broadens the topics a lot.
As someone from outside the US, it is quite common to face annoyances like address fields expecting addresses in US format, systems misbehaving and sometimes failing silently if you have two surnames, or accented characters in your personal data, etc. Years go by, tech gets better, but these issues don't go away, they just reappear in different places.
It's funny how some people seem to have discovered this kind of bias and started getting angry with LLMs, which are actually quite OK in this respect.
Not saying that it isn't an issue that should be addressed, just that some people are using it as an excuse to get indignant at AI and it doesn't make much sense. Just like the people who get indignant at AI because ChatGPT collects your input and uses it for training - what do they think social networks have been doing with their input in the last 20 years?
all arguments about supposed bias fall flat when you start asking question about ROI of the "debiasing work".
When you calculate $$$ required to de-bias a model, for example to make LLM recognize Syrian phone numbers: in compute and labor, and compare it to the market opportunity than the ROI is simply not there.
There is a good reason why LLMs are English-specific - because it is the largest market with biggest number of highest paying users for such LLM.
If there is no market demand in "de-biased" model that covers the cost of development, then trying to spend $$$ on de-biasing is pure waste of resources
If there was no Germany-specific data in the training corpus - it is not fair to expect LLM to know anything about Germany.
You can check a foundation model from Chinese LLM researchers, and you will most likely see Sino-centric bias just because of the training corpus + synthetic data generation was focused on their native/working language, and their goal was to create foundation model for their language.
I challenge any LLM sceptics - instead of just lazily poking holes in models - create a supposedly better model that reduces bias and lets evaluate your model with specific metrics
We don’t want it to beat us into submission about one set of views it was aligned to prefer. That’s what ChatGPT was doing. In one conversation, it would even argue over and over in each paragraph not to believe the very points it was presenting. That’s not just unhelpful to us: it’s deceptive for them to do that after presenting it like it serves all our interests, not just one side’s.
It would be more honest if they added to its advertising or model card that it’s designed to promote far-left, Progressive, and godless views. That moral interpretations of those views are reinforced while others are watered down or punished by the training process. Then, people may or may not use those models depending on their own goals.
Cut out "computer" here - would you want any person to hold a falsehood as the truth?
God is an egregore. It may be useful to model the various religions as singular entities under this lens, not true in the strictest sense, but useful none the less.
God, Santa, and (our {human} version of) Math: all exist in 'mental space', they are models of the world (one is a significantly more accurate model, obviously).
Atheist here: God didn't create humans, humans created an egregorical construction we call God, and we should kill the egregores we have let loose into the minds of humans.
https://www.gethisword.com/evidence.html
With that, the Bible should be taken at least as seriously as any godless work with lots of evidence behind it. If you don’t do that, it means you’ve closed your heart off to God for reasons having nothing to do with evidence. Also, much evidence for the Bible strengthens the claim that Jesus is God in the flesh, died for our sins, rose again, and will give eternal life and renewed life to those who commit to Him.
For my own sanity I try to think of those who believe in literal god as simply confusing it with the universe itself. The universe created us, it nurtures us, it’s sort of timeless and immortal. If only they could just leave it at that.
A lot of the time we have to fall back to estimating how plausible something is based on the knowledge we do have. Even in science it’s common for outcomes to be probabilistic rather than absolute.
So I say there is no god because, to my mind, the claim makes no sense. There is nothing I have ever seen, or that science has ever collected data on, to indicate that such a thing is plausible. It’s a myth, a fairy tale. I don’t need to prove otherwise because the onus of proof is on the one making the incredible claim.
Given that this is an estimate could you estimate what kind of thing you would have to see or what shape of data collected by science that would make you reconsider the plausibility of the existence of a supreme being?
I'm not even opposed to believing that our perception is flawed - clearly we don't know everything and there is much about reality we can't perceive let alone understand. But this would be so far outside of what we do understand that I cannot simply assume that it's true - I would need to see it to believe it.
There are virtually limitless ways such a being could make itself evident to humanity yet the only "evidence" anyone can come up with is either ancient stories or phenomena more plausibly explained by other causes. To me this completely tracks with the implausibility of the existence of god.
I'm not quite sure what you're saying here. It doesn't sound like you're saying that "supreme being" is "black white" (that is, mutually contradictory, meaningless). More like "proof of the existence of the supreme being is impossible". But you also say "I would need to see it to believe it", which suggests that you do think there is a category of proofs that would demonstrate the existence of the supreme being.