Ran a 5k queries on 50k documents to understand the file vs. vector RAG debate

2 points by gdad 149 days ago | 0 comments

title: Ran a 5k queries on 50k documents to understand the file vs vector rag debate

Was curious about the noise on file-based RAG as opposed to vector-RAG. So benchmarked Tantivy vs. Chroma to quantify the trade-offs in modern RAG pipelines. I used 5 datasets: CodeXGlue, MS MARCO, SQuAD, HotpotQA, and SciQ.

- Indexing/Embedding was 76x slower for Vectors ($O(s)$ vs $O(ms)$). Query latency was 11x slower

- In SciQ, keyword search outperformed vectors by 32% (MRR). Terms like "Mitochondria" are specific keys, not semantics. Vectors tended to drift toward semantically similar but factually incorrect answers.

- In HotpotQA, I noticed a trend where vectors find the "answer" document but miss the "bridge" document because it isn't semantically similar to the prompt. Finding the right document is not the same as having enough context to prove the answer.

The Data (MRR):

| :--- | :--- | :--- | :--- | :--- |

Curious to learn if others have similar observations or views.

No comments yet