LLMs, RAG, and the missing storage layer for AI(blog.lancedb.com) |
LLMs, RAG, and the missing storage layer for AI(blog.lancedb.com) |
The second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either. If you retrieve the top K vectors according to the vector index (instead of computing all the pairwise similarities in advance), that set of 10 vectors will be missing documents that have a higher cosine similarity than that of the K'th vector retrieved.
All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.
> second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either
Its not unstated, its called ANN for a reason
Are they? A learned embedding doesn't guarantee this and a positional embedding certainly doesn't. Our latent embeddings don't either unless you are inferring this through the dot product in the attention mechanism. But that too is learned. There are no guarantees that the similarities that they learn are the same things we consider as similarities. High dimensional space is really weird.
And while we're at it, we should mention that methods like t-SNE and UMAP are clustering algorithms not dimensional reduction. Just because they can find ways to cluster the data in a lower dimensional projection (epic mapping) doesn't mean that they are similar in the higher dimensional space. It all depends on the ability to unknot in the higher dimensional space.
It is extremely important to do what the OP is doing and consider the assumptions of the model, data, and measurements. Good results do not necessarily mean good methods. I like to say that you don't need to know math to make a good model, but you do need to know math to know why your model is wrong. Your comment just comes off as dismissive rather than actually countering the claims. There's plenty more assumptions than OP listed too. But their assumptions don't mean the model won't work, it just means what constraints the model is working under. We want to understand the constraints/assumptions if we want to make better models. Large models have advantages because they can have larger latent spaces and that gives them a lot of freedom to unknot data and move them around as they please. But that doesn't mean the methods are efficient.
They are related, and we frequently assume they are close enough that it doesn’t matter, but they are different.
If I'm using vectors for question/answer, then:
"What is a cat"
and
"What is a dog"
Should be more dissimilar than the documents answering either.
If I'm using it for FAQ filtering then they should be more similar.
hence heuristic.
code: https://github.com/jimmc414/document_intelligence/blob/main/... https://github.com/jimmc414/document_intelligence
I've interpreted transformer vector similarity as 'likelihood to be followed by the same thing' which is close to word2vec's 'sum of likelihoods of all words to be replaced by the other set' (kinda), but also very different in some contexts.
Cosine similar is a useful compromise and yes a lot of authors take this for granted. At the end of the day, an LLM product probably won't be evaluated on accuracy but rather "lift" over an alternative. And the evaluation will be in units of user happiness.
> All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.
This is usually a Series E problem, not a Series A problem.
- Full SQL support
- Has good tooling around migrations (i.e. dbmate)
- Good support for running in Kubernetes or in the cloud
- Well understood by operations i.e. backups and scaling
- Supports vectors and similarity search.
- Well supported client libraries
So basically Postgres and PgVector.
In conversational AI, providing search results appended to a long-memory context produces "human-like" results.
Less so IMO when I’m on my phone or in front of the computer.
For customer chatbots, it seems that structured data - from an operational database or a feature store adds more value. If the user asks about an order they made or a product they have a question about, you use the user-id (when logged in) to retrieve all info about what the user bought recently - the LLM will figure out what the prompt is referring to.
Reference:
1. Will that query look like this:
SELECT LLM("{user_question}", order_info)
FROM postgres_data.order_table
WHERE user_id = “101”;
2. How will a feature store, like Hopsworks, help in this app?Shameless self-plug: We are building EvaDB [1], a query engine for shipping fast AI-powered apps with SQL. Would love to exchange notes on such apps if you're up for it!
You can train a small llm on your private data to map the user question to tables in your db.
Then Just select with a limit ( or time bounded). The feature store is just another operational store that could have relevant data for the query.
I would assume the embedding model isn't trained on code and specific words that are industry/company specific.
It's not explained how vector DB is going to help while incumbents like chatgpt4 can already call functions and do API calls.
It doesn't make AI less black box, it's irrelevant and not explained..
There's already existing ways to fine tune models without expensive hardwares such as using LoRA to inject small layers with customized training data, which trains in fractions of the time and resource needed to retrain the model
Everything else may be missing, but not the storage layer.
Surely if you’re posting an article promoting miraculous AI tech you should human edit the article summary so that it’s not really obviously drafted by AI.
Or just use the prompt “tone your writing down and please remember that you’re not writing for a high school student who is impressed by nonsensical hyperbole”. I’ve started using this prompt and it works astonishingly well in the fast evolving landscape of directionless content creation.
I've seen the diagrams in DL papers etc. but I guess everyone invents their own conventions, and the diagrams often don't convey the complete flow of information.
Visualizations are highly context and usage dependent anyway. Generally, there's is no value in showing fully connected or feed forward layers in detail outside of teaching materials.
Well, in electrical circuit diagrams it is customary to draw e.g. a signal bus as a single connection, with the number of wires in the bus written next to it (with a little strike-through line). I'm guessing something similar can be done for DL networks.
As a thought-experiment for people who don't understand why you need (for example) regular relational columns alongside vector storage, consider how you would implement RAG for a set of documents where not everyone has permission to view every document. In the pgvector case it's easy - I can add one or more label columns and then when I do my search query filter to only include labels that user has permission to view. Then my vector similarity results will definitely not include anything that violates my access control. Trivial with something like pgvector - basically impossible (afaics) with special-purpose vector stores.
Or think about ranking. Say you want to do RAG over a space where you want to prioritise the most recent results, not just pure similarity. Or prioritise on a set of other features somehow (source credibility whatever). Easy to do if you have relational columns, no bueno if you just have a vector store.
And that's not to mention the obvious things around ACID, availability, recovery, replication, etc.
Maybe someone could pitch in. Is knowledge really a graph (for your problem domain), or is that just some bullshit people made up when they still thought AI could be captured mathematically? It feels to me now knowledge is much more like the way vector embeddings work, it's in a cloud where things are related to each other in an analog or statistical way, not a discrete way.
But, perhaps for similar reasons, vector embeddings haven't been super useful to me in building RAG agents yet. Knowledge is either relevant or it's not, and at least for me if it's relevant it has the keywords or tags I need, and just a straight up SQL query brings it in.
The Python and TS SDKs are designed to support drop-in replacements for the bits of LangChain that don’t scale, but nothing stops you accessing Postgres directly.
Disclosure: I’m the primary author.
Can you? You've personally done this? Deployed it to production at some kind of non trivial scale and it's working well? I'm not aware of any "small llm" that approaches the quality of gpt-3.5.
They did it with a custom language model. I really want to give this a try with llama2 embeddings but haven't had the bandwidth yet (and llama2's embedding vectors are inconveniently huge, but that's a different problem).
Are there any good sources to learn more about that?
1. Ask an LLM to return a list of questions answered by the document
2. Store the embeddings of the questions along with a document ID
3. On user query, get the embedding of the user query
4. KNN cosine similarity search the user embedding vs. the corpus of question embeddings
5. Return the highest ranked documents
You can tweak this approach depending on your use case, so that in step 1 you generate embeddings that are more similar to the types of things you want returned in step 5. If you want the answer to "What is a cat" to be similar to "What is a dog," you'd prompt/finetune the LLM in step 1 to generate broad questions that would encompass both; if you want them to be very different, you'd do the opposite and avoid generalities.https://www.sbert.net/examples/domain_adaptation/README.html https://arxiv.org/abs/2112.07577
I'm able to pull messy results directly from internet sources and re-rank on the fly with a quantized e5 model small enough to fit in a serverless function.
You don't need a vector database to do all this stuff, people who are paid off people using vector databases are the ones who are hyping them up the most.
If by "quantized e5 model small enough to fit in a serverless function" you mean e5-small-v2, FYI it actually underperforms just calling OpenAI for embeddings (text-embedding-ada-002) on the HuggingFace MTEB benchmarks. And that definitely doesn't negate using a doc2query-style approach to preprocess the documents before running them through the pretrained embedding model if you're comparing e.g. questions to answers, rather than raw document-to-document similarity. (Of course a custom trained model will be more efficient! In fact, the original doc2query paper in 2019 used a custom trained model for step 1, as did many enhancements on it e.g. doc-t5-query. What's neat is that with the advent of really good pretrained LLMs, you can get results approximating that without training your own models in like ~5mins of work.)
Considering the LLM is still doing the final pass, and the latency from the LLM is based on output length, I find the UX to be significantly improved just doing reranking in-process.
I think there's been a bit of whiplash, where people went from gatekeeping "hard ML", to "I can shove this all at a REST API", but there's a golden path laying in between for use-cases where UX matters.
I even fall back to old school NLP (like ML-less, glorified wordlist POS taggers) for LLM tasks and end up with significantly improved performance for almost 0 additional effort