LLMs, RAG, and the missing storage layer for AI

LLMs, RAG, and the missing storage layer for AI(blog.lancedb.com)

151 points by yurisagalov 2 years ago | 61 comments

panarky 2 years ago |

The first unstated assumption is that similar vectors are relevant documents, and for many use cases that's just not true. Cosine similarity != relevance. So if your pipeline pulls 2 or 4 or 12 document chunks into the LLM's context, and half or more of them aren't relevant, does this make the LLM's response more or less relevant?

The second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either. If you retrieve the top K vectors according to the vector index (instead of computing all the pairwise similarities in advance), that set of 10 vectors will be missing documents that have a higher cosine similarity than that of the K'th vector retrieved.

All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.

brigadier132 2 years ago | |

The vectors are literally constructed so that cosine similarity is semantic similarity.

> second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either

Its not unstated, its called ANN for a reason

godelski 2 years ago | | |

> The vectors are literally constructed so that cosine similarity is semantic similarity.

Are they? A learned embedding doesn't guarantee this and a positional embedding certainly doesn't. Our latent embeddings don't either unless you are inferring this through the dot product in the attention mechanism. But that too is learned. There are no guarantees that the similarities that they learn are the same things we consider as similarities. High dimensional space is really weird.

And while we're at it, we should mention that methods like t-SNE and UMAP are clustering algorithms not dimensional reduction. Just because they can find ways to cluster the data in a lower dimensional projection (epic mapping) doesn't mean that they are similar in the higher dimensional space. It all depends on the ability to unknot in the higher dimensional space.

It is extremely important to do what the OP is doing and consider the assumptions of the model, data, and measurements. Good results do not necessarily mean good methods. I like to say that you don't need to know math to make a good model, but you do need to know math to know why your model is wrong. Your comment just comes off as dismissive rather than actually countering the claims. There's plenty more assumptions than OP listed too. But their assumptions don't mean the model won't work, it just means what constraints the model is working under. We want to understand the constraints/assumptions if we want to make better models. Large models have advantages because they can have larger latent spaces and that gives them a lot of freedom to unknot data and move them around as they please. But that doesn't mean the methods are efficient.

spott 2 years ago | | |

To be fair… semantic similarity isn’t the same as relevance either.

They are related, and we frequently assume they are close enough that it doesn’t matter, but they are different.

BoorishBears 2 years ago | | |

This is kind of a moot argument, semantic similarity is higher dimensionality than cosine similarity can capture.

If I'm using vectors for question/answer, then:

"What is a cat"

and

"What is a dog"

Should be more dissimilar than the documents answering either.

If I'm using it for FAQ filtering then they should be more similar.

sgt101 2 years ago | | |

yes - but calculating the consine similarity for all the candidates is prohibitively expensive.

hence heuristic.

Jimmc414 2 years ago | |

Switching to Word2Vec embeddings led to a substantial improvement in my cosine similarity evaluations for text similarity, but granted I was looking for actual similarity, not relevance. I tried many different methods and had lots of mediocre results initially.

code: https://github.com/jimmc414/document_intelligence/blob/main/... https://github.com/jimmc414/document_intelligence

Nowado 2 years ago | | |

Interesting, do you happen to have some quantitative results on this/additional insights/etc?

I've interpreted transformer vector similarity as 'likelihood to be followed by the same thing' which is close to word2vec's 'sum of likelihoods of all words to be replaced by the other set' (kinda), but also very different in some contexts.

sandGorgon 2 years ago | | |

this is very interesting. you had better results here than the openai ada02 and other embeddings like bge ?

bugglebeetle 2 years ago | | |

As opposed to sentencebert or what?

NhanH 2 years ago | |

Could you please explain a bit on your 2nd paragraph. I couldn’t quite understand either the problem statement nor the reasoning itself.

choppaface 2 years ago | |

"Cosine similarity != relevance" In all ML search products, there's a tradeoff between precision and recall, and moreover there's almost never any "gold" data that ensures the "correctness" of surfaced results. I mean, Bing and Google have both invested millions of dollars in labeling web pages and even evaluating search results, but those labels can become useless as your set of documents change.

Cosine similar is a useful compromise and yes a lot of authors take this for granted. At the end of the day, an LLM product probably won't be evaluated on accuracy but rather "lift" over an alternative. And the evaluation will be in units of user happiness.

> All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.

This is usually a Series E problem, not a Series A problem.

saliagato 2 years ago | |

Azure Cognitive Search takes care of all of this combining semantic search with other layers of traditional search methods

ianpurton 2 years ago |

As an architect working on LLM applications I have these criteria for a database.

- Full SQL support

- Has good tooling around migrations (i.e. dbmate)

- Good support for running in Kubernetes or in the cloud

- Well understood by operations i.e. backups and scaling

- Supports vectors and similarity search.

- Well supported client libraries

So basically Postgres and PgVector.

freedmand 2 years ago |

I don’t fully understand the fascination with retrieval augmented generation. The retrieval part is already really good and computationally inexpensive — why not just pass the semantic search results to the user in a pleasant interface and allow them to synthesize their own response? Reading a generated paragraph that obscures the full sourcing seems like a practice that’s been popularized to justify using the shiny new tech, but is the generated part what users actually want? (Not to mention there is no bulletproof way to prevent hallucinations, lies, and prompt injection even with retrieval context.)

sdenton4 2 years ago | |

On the modeling side, it's compelling to separate the memory from the linguistic skills. Vector search is hella fast and can be very good. So you can off load the memorization part of the problem, and let the language model focus on the language. This should allow better performance with much smaller models.

nottheengineer 2 years ago | |

I really like using LLMs to learn stuff because they can explain anything at the exact level I need. Hallucination is a big problem with that and RAG pretty much solves it. If I give chatGPT a good stackoverflow post and tell it to dumb it down for me, it does very well. RAG just automates that process with the added benefit of not letting the LLM decide which information to retrieve, which should greatly reduce the chance of accidentally biasing the model with your prompt.

matchagaucho 2 years ago | |

In a strict "one question / one response" search, raw semantic search results are a great solution. And consumes far fewer tokens.

In conversational AI, providing search results appended to a long-memory context produces "human-like" results.

jorgemf 2 years ago | |

The main reason is that you might not want the raw information but some reasoning above. LLM is not only the context but all the information it has been trained with. For example a math student is making a question, it doesn't want the raw theorems but some reasoning with them, and currently LLM can do that. It will make mistakes sometimes because of hallucinations, but for not very difficult questions it usually gives you the right answer. And that helps a lot when you are not an expert in the domain. And that is the reason GPT4 is a great tool for students, it helps you to understand the basics as if you have a teacher with you.

zawaideh 2 years ago | |

Sometimes what I want is to ask Google/Alexa/Siri a question and get a summary response along with the source. I think that would be a good application of the above.

Less so IMO when I’m on my phone or in front of the computer.

jamesblonde 2 years ago |

It's not clear to me that only a vector DB should be used for RAG. Vector DBs give you stochastic responses.

For customer chatbots, it seems that structured data - from an operational database or a feature store adds more value. If the user asks about an order they made or a product they have a question about, you use the user-id (when logged in) to retrieve all info about what the user bought recently - the LLM will figure out what the prompt is referring to.

Reference:

https://www.hopsworks.ai/dictionary/retrieval-augmented-llm

jarulraj 2 years ago | |

Thanks for sharing that observation on customer chatbots.

1. Will that query look like this:

  SELECT LLM("{user_question}", order_info)  
  FROM postgres_data.order_table  
  WHERE user_id = “101”;

2. How will a feature store, like Hopsworks, help in this app?

Shameless self-plug: We are building EvaDB [1], a query engine for shipping fast AI-powered apps with SQL. Would love to exchange notes on such apps if you're up for it!

[1] https://github.com/georgia-tech-db/evadb

jamesblonde 2 years ago | | |

Why would your projection be this - SELECT LLM("{user_question}", ?

You can train a small llm on your private data to map the user question to tables in your db.

Then Just select with a limit ( or time bounded). The feature store is just another operational store that could have relevant data for the query.

J_Shelby_J 2 years ago | |

And for technical documentation or code I'm unclear how well semantic search works for CEQ.

I would assume the embedding model isn't trained on code and specific words that are industry/company specific.

Charon77 2 years ago |

A lot of things mentioned are too handwaved and not explained well.

It's not explained how vector DB is going to help while incumbents like chatgpt4 can already call functions and do API calls.

It doesn't make AI less black box, it's irrelevant and not explained..

There's already existing ways to fine tune models without expensive hardwares such as using LoRA to inject small layers with customized training data, which trains in fractions of the time and resource needed to retrain the model

antupis 2 years ago | |

There is lots of things like which you don’t want leak eg customer specific data. For those cases vectors are great.

juxtaposicion 2 years ago |

We use Lance extensively at my startup. This blog post (previously on HN) details nicely why: https://thedataquarry.com/posts/vector-db-4/ but essentially it’s because Lance is a “just a file” in the same way SQLite is a “just a file” which makes it embedded and serverless and straightforward to use locally or in a deployment.

zwaps 2 years ago |

I find it quite comical to speak of a "missing storage layer" during your own self-promotion, considering that the market for vector databases is literally overflowing right now.

Everything else may be missing, but not the storage layer.

saaaaaam 2 years ago |

Does ChatGPT always start articles with “in the rapidly evolving landscape of X”?

Surely if you’re posting an article promoting miraculous AI tech you should human edit the article summary so that it’s not really obviously drafted by AI.

Or just use the prompt “tone your writing down and please remember that you’re not writing for a high school student who is impressed by nonsensical hyperbole”. I’ve started using this prompt and it works astonishingly well in the fast evolving landscape of directionless content creation.

amelius 2 years ago |

Unrelated question: is there a standard way for writing down neural network diagrams? I'm thinking of how it is done in electrical circuit schematics, which capture all relevant information in a single diagram, in a (mostly) standardized way.

I've seen the diagrams in DL papers etc. but I guess everyone invents their own conventions, and the diagrams often don't convey the complete flow of information.

gillesjacobs 2 years ago | |

There are conventions and most libraries have libraries to export diagrams to LaTex or image (e.g., TorchViz).

Visualizations are highly context and usage dependent anyway. Generally, there's is no value in showing fully connected or feed forward layers in detail outside of teaching materials.

amelius 2 years ago | | |

> Generally, there's is no value in showing fully connected or feed forward layers in detail outside of teaching materials.

Well, in electrical circuit diagrams it is customary to draw e.g. a signal bus as a single connection, with the number of wires in the bus written next to it (with a little strike-through line). I'm guessing something similar can be done for DL networks.

eth0pal 2 years ago |

Shameless self promotion

dr_dshiv 2 years ago |

404