Vector indexing all of Wikipedia on a laptop

Vector indexing all of Wikipedia on a laptop(foojay.io)

513 points by tjake 2 years ago | 140 comments

> JVector, the library that powers DataStax Astra vector search, now supports indexing larger-than-memory datasets by performing construction-related searches with compressed vectors. This means that the edge lists need to fit in memory, but the uncompressed vectors do not, which gives us enough headroom to index Wikipedia-en on a laptop.

It's interesting to note that JVector accomplishes this differently than how DiskANN described doing it. My understanding (based on the links below, but I didn't read the full diff in #244) is that JVector will incrementally compress the vectors it is using to construct the index; whereas DiskANN described partitioning the vectors into subsets small enough that indexes can be built in-memory using uncompressed vectors, building those indexes independently, and then merging the results into one larger index.

OP, have you done any quality comparisons between an index built with JVector using the PQ approach (small RAM machine) vs. an index built with JVector using the raw vectors during construction (big RAM machine)? I'd be curious to understand what this technique's impact is on the final search results.

I'd also be interested to know if any other vector stores support building indexes in limited memory using the partition-then-merge approach described by DiskANN.

Finally, it's been a while since I looked at this stuff, so if I mis-wrote or mis-understood please correct me!

- DiskANN: https://dl.acm.org/doi/10.5555/3454287.3455520

- Anisotropic Vector Quantization (PQ Compression): https://arxiv.org/abs/1908.10396

- JVector/#168: How to support building larger-than-memory indexes https://github.com/jbellis/jvector/issues/168

- JVector/#244: Build indexes using compressed vectors https://github.com/jbellis/jvector/pull/244

pbadams 2 years ago | |

It's not mentioned in the original paper, but DiskANN also supports PQ at build-time via `--build_PQ_bytes`, though it's a tradeoff with the graph quality as you mention.

One interesting property in benchmarking is that the distance comparison implementations for full-dim vectors can often be more efficient than those for PQ-compressed vectors (straight-line SIMD execution vs table lookups), so on some systems cluster-and-merge is relatively competitive in terms of build performance.

jbellis 2 years ago | |

That's correct!

I've tested the build-with-compression approach used here with all the datasets in JVector's Bench [1] and there's near zero loss in accuracy.

I suspect that the reason the DiskANN authors used the approach they did is that in 2019 Deep1B was about the only very large public dataset around, and since the vectors themselves are small your edge lists end up dominating your memory usage. So they came up with a clever solution, at the cost of making construction 2.5x as expensive. (Educated guess: 2x is from adding each vector to multiple partitions and the extra 50% to merge the results.)

So JVector is just keeping edge lists in memory today. When that becomes a bottleneck we may need to do something similar to DiskANN but I'm hoping we can do better because it's frankly a little inelegant.

[1] https://github.com/jbellis/jvector/blob/main/jvector-example...

gfourfour 2 years ago |

Maybe I’m missing something but I’ve created vector embeddings for all of English Wikipedia about a dozen times and it costs maybe $10 of compute on Colab, not $5000

burgerrito 2 years ago |

I made a side project that uses Wikipedia recently too, and found out that there are database dump available to be downloaded: https://en.wikipedia.org/wiki/Wikipedia:Database_download

isoprophlex 2 years ago |

$5000?! I indexed all of HN for ... $50 I think. And that's tens of millions of posts.

xandrius 2 years ago | |

To be fair Wikipedia has over 60 million pages and this is for 300+ languages. But yeah, the value shows that they might not be using the cheapest service out there.

criddell 2 years ago | |

How are you using that index?

isoprophlex 2 years ago | | |

https://www.searchhacker.news/

A tool that (hopefully) surfaces interesting HN discussion threads; I wanted an excuse to investigate (hybrid) full text and vector search at a substantial scale beyond toy datasets.

Sadly (well not really) I changed jobs soon after building the first version. Life caught up and I never got around to adding more features and polishing up the frontend (eg. the broken back button

Ideas for new features are very welcome :)

tjake 2 years ago |

You can demo this here: https://jvectordemo.com:8443/

GH Project: https://github.com/jbellis/jvector

jbellis 2 years ago | |

[article author]

The source to build and serve the index are at https://github.com/jbellis/coherepedia-jvector

xandrius 2 years ago | |

Internal sever error :(

jbellis 2 years ago | | |

Oops, HN maxed out the free Cohere API key it was using. Fixed.

bytearray 2 years ago | | |

Same.

HammadB 2 years ago |

"The obstacle is that until now, off-the-shelf vector databases could not index a dataset larger than memory, because both the full-resolution vectors and the index (edge list) needed to be kept in memory during index construction. Larger datasets could be split into segments, but this means that at query time they need to search each segment separately, then combine the results, turning an O(log N) search per segment into O(N) overall."

How is a log N search over S segments O(N)?

jbellis 2 years ago | |

I was trying to make the point that the dominant factor becomes linear instead of logarithmic, but more accurately it's O(S log N) = O(N log N) because S is proportional to N.

HammadB 2 years ago | | |

Sure, I see. I think this is an area where complexity analysis doesn’t lead to useful information.

To be more correct it’s O(N/C log C) where C is the capacity of a segment. In this case you can ignore 1/C and log C as constant. So now sure, you actually just have O(N). But this is not super useful as it says that a segmented hnsw approach and brute force approach are the same - when this is really not the case in practice.

Also O(N log N) > O(N) so I’m not sure why we would ever do anything with segmentation according to that analysis if it were correct.

thfuran 2 years ago | |

Doesn't doubling N double S?

jl6 2 years ago |

The source files appear to include pages from all namespaces, which is good, because a lot of the value of Wikipedia articles is held in the talk page discussions, and these sometimes get stripped from projects that use Wikipedia dumps.

worldsayshi 2 years ago | |

I'm curious what the main value you see in the talk pages? I almost never look at them myself.

jl6 2 years ago | | |

They’re not so interesting for mundane topics, but for anything remotely controversial, they are essential for understanding what perspectives aren’t included in the article.

hot_gril 2 years ago |

How many dimensions are in the original vectors? Something in the millions?

jbellis 2 years ago | |

1024 per vector x 41M vectors

hot_gril 2 years ago | | |

1024-dim vectors would fit into pgvector in Postgres, which can do cosine similarity indexing and doesn't require everything to fit into memory. Wonder how the performance of that would compare to this.

noufalibrahim 2 years ago |

What are the good solutions in this space? Vector databases I mean. Mostly for semantic search across various texts.

I have a few projects I'd like to work on. For typical web projects, I have a "go to" stack and I'd like to add something sensible for vector based search to that.

StrauXX 2 years ago | |

In my experience its usually easiest to use a vector store extension for an off-the-shelf database like postgres (pgvector is nice). That way you don't have to manage another, rapidly changing, service and you can easily combine queries on the vectors with regular columns, join them and so on.

jbellis 2 years ago | |

JVector (the index used in TFA) is available as a service with a friendly API from DataStax. https://www.datastax.com/products/datastax-astra

[article author, I work on JVector and Astra]

riku_iki 2 years ago | | |

Could you tell how scalable JVector is? How many vectors it can handle, like millions, billions, hundreds of billions?

noufalibrahim 2 years ago | | |

Nice. I wanted to try something out on a machine before moving to hosted soclutions.

localhost 2 years ago |

This is a giant dataset of 536GB of embeddings. I wonder how much compression is possible by training or fine-tuning a transformer model directly using these embeddings, i.e., no tokenization/decoding steps? Could a 7B or 14B model "memorize" Wikipedia?

anonymousDan 2 years ago |

How do embeddings created by state of the art open source models compare to the free embeddings mentioned in the article? Would they actually cost 5k to create given a reasonable local GPU setup?

lilatree 2 years ago |

“… turning an O(log N) search per segment into O(N) overall.”

Can someone explain why?

StrangeDoctor 2 years ago | |

When it’s all in memory you get to amortize the cost of the initial load. Or just pay it when it’s not part of the hot path. When it’s segmented, you’re doing that because memory is full and you need to read in all the segments you don’t have. That’ll completely overwhelm the log n of the search you still get

jbellis 2 years ago | | |

I was trying to make the point that the dominant factor becomes linear instead of logarithmic, but more accurately it's O(S log N) = O(N log N) because S (number of segments) is proportional to N (number of vectors).

opdahl 2 years ago |

Would be interesting if you could try implementing the Cohere Reranker into this. Should be fairly easy, and could lead to quite a bit of performance gain.

arnaudsm 2 years ago |

In expert topics, is vector search finally competitive with BM25-like algorithms? Or do we still need to mix the 2 together ?

jbellis 2 years ago | |

ColBERT gives you best of both worlds.

https://arxiv.org/abs/2004.12832

https://thenewstack.io/overcoming-the-limits-of-rag-with-col...

Khelavaster 2 years ago |

This is how Microsoft powered it's academic paper search in 2016, before rolling it into Bing in 2020!

Mathnerd314 2 years ago |

> Enough RAM to run a JVM with 36GB of heap space

Are there laptops like that? Maybe an upgraded MacBook, but I have been looking for Windows/Linux laptops and they generally top out at 32GB. I checked Lenovo's website and everything with 64GB and up is not called a laptop but a "mobile workstation".

cbolton 2 years ago | |

You can configure a Lenovo Z13 Gen 2 with 64GB for little extra money (and choose between Windows, Ubuntu, Fedora, or no OS preinstalled).

_zoltan_ 2 years ago | |

You can buy an M3 Max with 128GB memory.

danaugrs 2 years ago |

issafram 2 years ago |

Would a docker container help running it on Windows?

jbellis 2 years ago | |

Technically it does run on windows, you just can't build the entire dataset without adding the sharding code mentioned. Set divisor=100 in config.properties and it will happily build an index over 1% of the dataset.

cosmojg 2 years ago | |

Just use WSL. Or dual boot.

traverseda 2 years ago |

>Disable swap before building the index. Linux will aggressively try to cache the index being constructed to the point of swapping out parts of the JVM heap, which is obviously counterproductive. In my test, building with swap enabled was almost twice as slow as with it off.

This is an indication to me that something has gone very wrong in your code base.

m3kw9 2 years ago |

He should have asked HN on the cheapest way to embed Wikipedia before starting

jbellis 2 years ago | |

I'm baffled that so many people fixate on the estimated cost and miss the fact that it's a public dataset. As in, free.

syllogistic 2 years ago | | |

It's in the first sentence of the article too :)

m3kw9 2 years ago | | |

Getting the embeddings ain’t free

Atotalnoob 2 years ago |

Why is the author listing himself as datastax cto?

He isn’t according the Wikipedia, my friend who works there, and their company website. https://www.datastax.com/our-people

That’s kind of weird

jbellis 2 years ago | |

I guess I'm kind of a CTO emeritus now -- I mostly write code, by choice. https://github.com/jbellis

lemarchr 2 years ago | |

See https://www.datastax.com/our-people/jonathan-ellis

Wikipedia lists them as a founder. Perhaps their author bio is outdated, or Wikipedia is. Not sure about your friend.

Atotalnoob 2 years ago | | |

They were definitely a founder, but they are not the current cto

metadat 2 years ago | |

What are you talking about? The datastax site lists it:

> SANTA CLARA, Calif. – September 28, 2020 – DataStax today announced that DataStax Co-Founder and CTO Jonathan Ellis will deliver a keynote address at ApacheCon @Home 2020

https://www.datastax.com/press-release/datastax-co-founder-a....

As an aside, I'm an ApacheCon presenter but there was no press release about the hot excitement of my involvement. Maybe next time :)

Atotalnoob 2 years ago | | |

That’s from 2024. They aren’t the cto of datastax currently