It's fair to expect GPT3-level results - not GPT 3.5 and certainly not open-source tiny GPT4 as some might think when they read "rivaling OpenAI".
- University and Universe are similar alphabetically.
- University and College are similar in meaning.
Take embeddings for those three words and `University` will be near `College`, while `Universe` will be further away, because embeddings capture meaning:
University<-->College<-------------->Universe
_
With old school search you'd need to handle the special case of treating University and College as similar, but embeddings already handle it.
With embeddings you can do math to find how similar two results are, based on how close their vectors are. The closer the embeddings, the closer the meaning.
https://platform.openai.com/docs/guides/embeddings/what-are-...
* https://huggingface.co/jinaai/jina-embeddings-v2-base-en * https://huggingface.co/jinaai/jina-embeddings-v2-small-en
8192 token input sequence length
768 embedding dimensions
0.27GB model (with 0.07GB model also available)
Tokeniser: BertTokenizer [1], 30528 token vocab [2]
Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.
[1] https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blo...
[2] https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blo...
Words that aren't in the vocabulary can still be represented by multiple tokens. Some models can input and output valid UTF-8 at the byte level (rather than needing a unique token for each codepoint). For example RWKV-World.
Less is used for qualitative data like “I love him less”. Whereas fewer is used for countable things like “I need fewer tokens.”
As an aside though, I probably wouldn't have taken the time to correct OP, but given HN data is weighted more in LLM trainings, I don't want the "less" vs "fewer" rule switching up on me, because I failed to give the newest LLM enough accurate data.
"text-ada-001" is LLM in the GPT3 family, described as "Capable of very simple tasks, usually the fastest model in the GPT-3 series, and lowest cost"
"text-embedding-ada-002" is entirely different - that page describes it as "Our second generation embedding model, text-embedding-ada-002 is a designed to replace the previous 16 first-generation embedding models at a fraction of the cost."
[1] https://openai.com/blog/new-and-improved-embedding-model (see "Model improvements")
text-embedding-ada-002 53.3
text-search-davinci-*-001 52.8
text-search-curie-*-001 50.9
text-search-babbage-*-001 50.4
text-search-ada-*-001 49.0
That's not comparing it to the davinci/curie/babbage GPT3 models, it's comparing to the "search-text-*" family.Those were introduced in https://openai.com/blog/introducing-text-and-code-embeddings as the first public release of embeddings models from OpenAI.
> We’re releasing three families of embedding models, each tuned to perform well on different functionalities: text similarity, text search, and code search. The models take either text or code as input and return an embedding vector.
It's not at all clear to me if there's any relationship between those and the GPT3 davinci/curie/babbage/ada models.
My guess is that OpenAI's naming convention back then was "davinci is the best one, then curie, then babbage, then ada".
Are you sure these have anything to do with 'text-davinci-003' or 'text-curie-001'?
Will have to agree with everyone here that OpenAI is good at being extremely confusing. It seems like the logic might be something along the lines of the 'text-search' portion being the actual type of the model, while the 'curie-001' / '<name>-<number>' format is just a personalized way of expressing the version of that type of model. And the whole 'GPT<number>' category used to be a sort family of models, but now they've just switched it to the actual name of the newer gargantuan LLMs. Then, because the 'GPT<number>' models are now that different thing altogether these days, the newest 'text-embedding' model is just named 'ada-<number>' because it's on that iteration of the 'text-embedding' type of model, adhering to the older principle of naming their models? Not sure, ha. Definitely feels like doing some detective work.
Is it though? I thought the LLM-based embeddings are even more fun for this, as you have many more interesting directions to move in. I.e. not just:
emb("king") - emb("man") + emb("woman") = emb("queen")
But also e.g.:
emb(<insert a couple paragraph long positive book review>) + av(sad) + bv(short) - c*v(positive) = emb(<a single paragraph, negative and depressing review>)
Where a, b, c are some constants to tweak, and v(X) is a vector for quality X, which you can get by embedding a bunch of texts expressing the quality X and averaging them out (or doing some other dimensional reduction trickery).
I've suggested this on HN some time ago, but only been told that I'm confused and the idea is not even wrong. But then, there was this talk on some AI conference recently[0], where the speaker demonstrated exactly this kind of latent space translations of text in a language model.
--
[0] - https://www.youtube.com/watch?v=veShHxQYPzo&t=13980s - "The Hidden Life of Embeddings", by Linus Lee from Notion.
The 8k context window is new, but isn't the 512 token limitation a soft limit anyway? I'm pretty sure I can stuff bigger documents into BGE for example.
Furthermore, I think that most (all?) benchmarks in the MTEB leaderboard deal with very small documents. So there is nothing here that validates how well this model does on larger documents. If anything, I'd pick a higher ranking model because I put little trust in one that only ranks 17th on small documents. Should I expect it to magically get better when the documents get larger?
Plus, you can expect that this model was designed to perform well on the datasets in MTEB while the OpenAI model probably wasn't.
Many also stated that a 8k context embeddings will not be very useful in list situations.
When would anyone use this model?
I can guess the Davinci and similar embeddings work better for code than MPNET and it really matters what you are encoding, not only the context length. What features are actually being extracted by the emb.engine.
I was pretty curious about the context limit. I am not an expert in this area but I always thought the biggest problem was the length of your original text. So typically you might only encode a sentence or a selection of sentences. You could always stuff more in but they you are potentially losing the specificity, I would think that is a function of the dimensionality. This model is 768, are they saying I can stuff 8k tokens worth of text and can utilize it just as well as I have with other models on a per 1-3 sentence level?
This also opens up another question though, how would that compare to using a LLM to summarize that paper and then just embed on top of that summary.
In my experience, any text is better embedded using a sliding window of a few dozen words - this is the approximate size of semantic units in a written document in english; although this will wildly differ for different texts and topics.
I can see a sliding window working for semantic search and RAG, but not so much for clustering or finding related documents.
It feels like open-source is closing the gap with "Open"AI which is really exciting, and the acceleration towards parity is faster than more advancements made on the closed source models. Maybe it's wishful thinking though?
It turns out OpenAI have used the name "Ada" for several very different things, purely because they went through a phase of giving everything Ada/Babbage/Curie/DaVinci names because they liked the A/B/C/D thing to indicate which of their models were largest.
For those unaware, if 512 tokens of context is sufficient for your use case, there are already many options that outperform text-embedding-ada-002 on common benchmarks:
In my experience, OpenAI's embeddings are overspecified and do very poorly with cosine similarity out of the box as they match syntax more than semantic meaning (which is important as that's the metric for RAG). Ideally you'd want cosine similarity in the range of [-1, 1] on a variety of data but in my experience the results are [0.6, 0.8].
Isn't the normal way of using embedding to find relevant text snippets for a RAG prompt? Where is it better to have coarser retrieval?
And not only it supports and embeds a variety of languages, it also computes the same coordinates for the same semantics in different languages. I.e. if you embed "russia is a terrorist state" and "россия - страна-террорист", both of these embeddings will have almost the same coordinates.
- 𝟐𝟖.𝟓 𝐌𝐁 jina-embeddings-v2-small-en (https://huggingface.co/do-me/jina-embeddings-v2-small-en)
- 𝟏𝟎𝟗 𝐌𝐁 jina-embeddings-v2-base-en (https://huggingface.co/do-me/jina-embeddings-v2-base-en)
However, I noted, that the base model is performing quite poorly on small text chunks (a few words) while the small version seems to be unaffected. Might this be some kind of side effect due to the way they deal with large contexts?
If you want to test, you can head over to SemanticFinder (https://do-me.github.io/SemanticFinder/), go to advanced settings, choose the Jina AI base model (at the very bottom) and run with "Find". You'll see that all other models perform just fine and find "food"-related chunks but the base version doesn't.
Here's how to try it out.
First, install LLM. Use pip or pipx or brew:
brew install llm
Next install the new plugin: llm install llm-embed-jina
You can confirm the new models are now available to LLM by running: llm embed-models
You should see a list that includes "jina-embeddings-v2-small-en" and "jina-embeddings-v2-base-en"To embed a string using the small model, run this:
llm embed -m jina-embeddings-v2-small-en -c 'Hello world'
That will output a JSON array of 512 floating point numbers (see my explainer here for what those are: https://simonwillison.net/2023/Oct/23/embeddings/#what-are-e...)Embeddings are only really interesting if you store them and use them for comparisons.
Here's how to use the "llm embed-multi" command to create embeddings for the 30 most recent issues in my LLM GitHub repository:
curl 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
| jq '[.[] | {id: .id, title: .title}]' \
| llm embed-multi -m jina-embeddings-v2-small-en jina-llm-issues - \
--store
This creates a collection called "jina-llm-issues" in a default SQLite database on your machine (the path to that can be found using "llm collections path").To search for issues in that collection with titles most similar to the term "bug":
llm similar jina-llm-issues -c 'bug'
Or for issues most similar to another existing issue by ID: llm similar jina-llm-issues 1922688957
Full documentation on what you can do with LLM and embeddings here: https://llm.datasette.io/en/stable/embeddings/index.htmlAlternative recipe - this creates embeddings for every single README.md in the current directory and its subdirectories. Run this somewhere with a node_modules folder and you should get a whole lot of interesting stuff:
llm embed-multi jina-readmes \
-m jina-embeddings-v2-small-en \
--files . '**/README.md' --store
Then search them like this: llm similar jina-readmes -c 'backup tools'https://huggingface.co/BAAI/bge-large-en-v1.5 FlagEmbedding for example describes itself as covering Chinese and English.
I wonder what would be the best way to use 8k embeddings. It’s a lot of information to keep in a vector, so things like “precision” of the embedding space and its ability to distinguish very similar large documents will be key.
Maybe it can be useful for coarse similarity matching, for example to detect plagiarism?
I believe "text-embedding-ada-002" is entirely unrelated to those old GPT-3 models. It's a recent embedding model (released in December 2022 - https://openai.com/blog/new-and-improved-embedding-model ) which OpenAI claim is their best current best available embedding model.
I understand your confusion: OpenAI are notoriously bad at naming things!
Edit: looking at the press release, the improvement over old Ada is ... marginal? And Ada-01 is/was a poor performing model, tbh. I guess I'll have to run some tests, but at first sight it doesn't seem that wow-ey.
So, if there is some information at the bottom which is dependent on something which is at the top, your embedding could be entirely wrong.
I just want to make an embedding between a conversation of me and my friend and simulate talking to them. Is this a hard thing to train to begin with?
If anyone knows or could help me with this, I would be very grateful!
What you are asking for sounds like fine tuning an existing LLM...where the data will be tokenized but the outcomes are different? There is a lot of writeups on how people have done it. You should especially follow some of the work on Huggingface. To replicate talking to your friend though, you will need a very large dataset to train off of I would think and its unclear to me if you can just fine-tune it or you would need to train a model from scratch. So a dataset with 10s of thousands of examples and then you need to train it on a GPU.
https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...
This tool helps with embedding part.
I’ve built a bunch of ”chat with your PDFs” bots, do reach out if you have any questions me at brian.jp.
- chat with documents(pdf, doc etc)
- chat with website. Like, if I integrate with an ecommerce site, I can ask questions from the website. What options do I have in free for both cloud and locally?
Note that embedding models are a different kind of thing from a Large Language Model, so it's not the kind of model you can ask questions.
It's a model which can take text and turn it into an array of floating point numbers, which you can then use to implement things like semantic search and related documents.
More on that here: https://simonwillison.net/2023/Oct/23/embeddings/
I bet you could hack this in.
https://huggingface.co/spaces/mteb/leaderboard
It’s amazing how many new and better ones there are since I last looked a few months ago. Instructor-xl was number 1, now it is number 9, and its size is more than 10x the number 2 ranked!
Things move fast!
Calculating embeddings on larger documents than smaller-window embedding models.
> My (somewhat limited) experience with long context models is they aren't great for RAG.
The only reason they wouldn't be great for RAG is that they aren't great at using information in their context window, which is possible (ISTR that some models have a strong recency bias within the window, for instance) but I don't think is a general problem of long context models.
> Isn't the normal way of using embedding to find relevant text snippets for a RAG prompt?
I would say the usual use is for search and semantic similarity comparisons generally. RAG is itself an application of search, but its not the only one.
Think of it like skipping the square root step in Euclidean distance. Perfectly valid as long as you don’t want a distance so much as a way to compare distances. And doing so skips the most computationally expensive operation.
I'd much rather know what paragraph to look in than what 25 pages to look in
wish we could create the array of floating points without openai
Great timely turnaround time, good sir. HtI've dabbled a bit with elasticsearch dense vectors before and this model should work great for that. Basically, I just need to feed it a lot of content and add the vectors and vector search should work great.
File "/opt/homebrew/Cellar/llm/0.11_1/libexec/lib/python3.12/site-packages/llm/default_plugins/openai_models.py", line 17, in <module>
import yaml
ModuleNotFoundError: No module named 'yaml'The pyyaml package is correctly listed on the formula page though: https://formulae.brew.sh/formula/llm
$ brew install llm $ llm ModuleNotFoundError: No module named 'typing_extensions'
Not sure where to report it.
It looks like that package is correctly listed in the formula: https://github.com/Homebrew/homebrew-core/blob/a0048881ba9a2...
However even so I would think about the documents themselves and figure out if it is even needed. Lets say we are talking about clustering court proceedings. I'd rather extract the abstract from these document, embed and cluster those instead of the whole text.
% python3 --version
Python 3.11.6
% which python3
/opt/homebrew/bin/python3
% brew info python-typing-extensions
==> python-typing-extensions: stable 4.8.0 (bottled) % which llm
/opt/homebrew/bin/llmMaybe I am assuming incorrectly, but I think the poor performance you are referring to is the old Ada completion model, where the output is text. That was poor indeed.
https://medium.com/@nils_reimers/openai-gpt-3-text-embedding...
If the new ada model only has marginal improvements, it seems open source is way to go.
This leaderboard doesn't compare these custom tailored embedding models vs the obvious thing of average pooling layered with any traditional LLM, which is easily implemented using sentence transformers.
It's Retrieval Augmented Generation btw.
To quote:
> The key idea is this: a user asks a question. You search your private documents for content that appears relevant to the question, then paste excerpts of that content into the LLM (respecting its size limit, usually between 3,000 and 6,000 words) along with the original question.
> The LLM can then answer the question based on the additional content you provided.
Why? Have links gone out of fashion?
I even linked directly to the relevant section rather than linking to the top of the page.
The paper that coined the term used the hyphen, though I think I prefer it without: https://arxiv.org/abs/2005.11401
Yes.
You wrote far more words than needed to answer the comment, I did it for you instead.
Sure, but then if you do it one page at a time, or one paragraph at a time, you lose ton of meaning - after all, individual paragraphs aren't independent of each other. And meaning is kind of the whole point of the exercise.
Or put another way, squashing a ton of text loses you some high-frequency information, while chunking cuts off the low-frequency parts. Ideally you'd want to retain both.
I'm not sure how I would do that after chunking.
I use a multi-pronged approach to this based on a special type of summarization. I chunk on sentences using punctuation until they are just over 512 characters, then I embed them. After embedding, I ask a foundation model to summarize (or ask a question about the chunk) and then generate keyterms for it. Those keyterms are stored along with the vector in the database. During search, I use the user's input to do a vector search for matching chunks, then pull their keyterms in. Using those keyterms, I do set operations to find related chunks. I then run a vector search against these to the top matches from the vector search to assemble new prompt text.
This strategy is based on the idea of a "back of the book index". It is entirely plausible to look for "outliers" in the keyterms and consider throwing those chunks with those keyterms in there to see if it nets us understanding of some "hidden" meaning in the document.
There is also a means to continue doing the "keyterm" extraction trick as the system is used. Keyterms from answer as well as user prompts may be added to the existing index over time, thus helping improve the ability to return low frequency information that may be initially hidden.
Should we all do the ad hominem thing? You are actually suggesting that?
I jumped straight from that to OpenAI embeddings. The results were good enough that I didn't spend time investigating other approaches.
Does that mean you'd return other docs if they share just one word?
The idea of tfidf is that it gives you a vector (maybe combined with pca or a random dimensionality reduction) that you can use just like an Ada embedding. But you still need vector search.
Then I take the top ten by score and call those the "related articles".
A good definition of “truly open” is whether the exact same results can be reproduced by someone with no extra information from only what has been made available. If that is not possible, because the reproduction methodology is closed (a common reason, like in this case) then what has been made available is not truly open.
We can sit here and technically argue whether or not the subject matter violated some arbitrary “open source” definition but it still doesn’t change the fact that it’s not truly open in spirit
The parallel can be made with model weights being static assets delivered in their completed state.
(I favor the full process being released especially for scientific reproducibility, but this is an other point)
What if someone gave you a binary and the source code, but not a compiler? Maybe not even a language spec?
Or what if they gave you a binary and the source code and a fully documented language spec, and both of 'em all the way down to the compiler? BUT it only runs on special proprietary silicon? Or maybe even the silicon is fully documented, but producing that silicon is effectively out of reach to all but F100 companies?
It's turtles all the way down...
We already have a definition of open source. I don't see any reason to change it.
It's basically like giving people a binary program and calling it open source because the compiler and runtime used are open source.
Make no mistake, I am super grateful to OSI for their efforts and most of my code out there uses one of their licenses. I just think they are limited by the circumstances. Some things I consider open are not conforming to their licenses and, like here, some things that conform might not be really open.
All sorts of intangibles end up in open source projects. This isn’t a science experiment that needs replication. They’re not trying to prove how they came up with the image/code/model.
Look into Affero GPL. Images are inert static assets. Here we are talking about the back end engine. The fact that neural networks and model weights are non-von-neumann architecture doesn’t negate the fact that they are executable code and not just static assets!
By this logic any freely downloadable executable software (a.k.a. freeware) is also open source, even though they don't disclose all details on how to build it.
If I hand you a beer for free that’s freeware. If I hand you the recipe and instructions to brew the beer that is open source.
We muddy the waters too much lately and call “free” to use things “open source”.
Yeah, but what those "open source" models are is like you handing me a bottle of beer, plus the instructions to make the glass bottle. You're open-sourcing something, just not the part that matters. It's not "open source beer", it's "beer in an open-source bottle". In the same fashion, those models aren't open source - they're closed models inside a tiny open-source inference script.
The model weights in eg TensorFlow are the source code.
It is not a von-Neumann architecture but a gigabyte of model weights is the executable part, no less than a gigabyte of imperative code.
Now, the training of the model is akin to the process of writing the code. In classical imperative languages that code may be such spaghetti code that each part would be intertwined with 40 others, so you can’t just modify something easily.
So the fact that you can’t modify the code is Freedom 2 or whatever. But at least you have Freedom 0 of hosting the model where You want and not getting charged for it an exorbitant amount or getting cut off, or having the model change out from under you via RLHF for political correctnesss or whatever.
OpenAI has not even met Freedom Zero of FSR or OSI’s definition. But others can.
The model weights aren't source code. They are the binary result of compiling that source code.
The source code is the combination of the training data and configuration of model architecture that runs against it.
The model architecture could be considered the compiler.
If you give me gcc and your C code I can compile the binary myself.
If you give me your training data and code that implements your model architecture, I can run those to compile the model weights myself.
Free(dom Respecting) Software wasn’t just about the source code.
https://www.gnu.org/philosophy/open-source-misses-the-point....
A very slow, very expensive compiler - but it's still taking the source code (the training material and model architecture) and compiling that into a binary executable (the model).
Maybe it helps to think about this at a much smaller scale. There are plenty of interesting machine learning models which can be trained on a laptop in a few seconds (or a few minutes). That process feels very much like a compiler - takes less time to compile than a lot of large C++ projects.
Running on a GPU cluster for a month is the exact same process, just scaled up.
Huge projects like Microsoft Windows take hours to compile and that process often runs on expensive clusters, but it's still considered compilation.
https://time.com/6247678/openai-chatgpt-kenya-workers/
And billion-dollar companies made their money off it:
https://www.forbes.com/sites/kenrickcai/2023/04/11/how-alexa...
That’s the dirty secret of why ChatGPT 4 is better. But they’ll tell you it has to do with chaining ChatGPT 3’s together, more fine tuning etc. They go to these poor countries and recruit people to work on training the AI.
Not to mention all the uncompensated work of humans around the world who put their content up on the Web.
It seems unreasonable to require the training data just to be called open source, given it has similar copyright challenges as game assets.
Of course, this wouldn't make the model reproducible. But that's different from open source.
Imagine if Facebook open-sourced their front-end libraries like React but not the back-end.
Imagine if Twitter or Google didn’t publish its Algorithm for how they rank things to display to different people.
You don’t need to imagine. That’s exactly what’s happening! Would you call them open source because their front end is open source? Could you host your own back end on your choice of computers?
No. That’s why I even started https://qbix.com/platform
A better analogy would be some graphics card drivers which ship a massive proprietary GPU firmware blob, and a small(ish) kernel shim to talk with said blob.
Sometimes though the software alone can be near useless without additional assets that aren't necessarily covered by the code license.
Like Quake, having the engine without the assets is useless if what you wanted was to play Quake the game. Neural nets are another prime example, as you mention. Simulators that rely on measured material property databases for usable results also fall into this category, and so on.
So perhaps what we need is new open source licenses that includes the assets needed for the user to be able to reasonably use the program as a whole.