Vector indexing all of Wikipedia on a laptop(foojay.io) |
Vector indexing all of Wikipedia on a laptop(foojay.io) |
It's interesting to note that JVector accomplishes this differently than how DiskANN described doing it. My understanding (based on the links below, but I didn't read the full diff in #244) is that JVector will incrementally compress the vectors it is using to construct the index; whereas DiskANN described partitioning the vectors into subsets small enough that indexes can be built in-memory using uncompressed vectors, building those indexes independently, and then merging the results into one larger index.
OP, have you done any quality comparisons between an index built with JVector using the PQ approach (small RAM machine) vs. an index built with JVector using the raw vectors during construction (big RAM machine)? I'd be curious to understand what this technique's impact is on the final search results.
I'd also be interested to know if any other vector stores support building indexes in limited memory using the partition-then-merge approach described by DiskANN.
Finally, it's been a while since I looked at this stuff, so if I mis-wrote or mis-understood please correct me!
- DiskANN: https://dl.acm.org/doi/10.5555/3454287.3455520
- Anisotropic Vector Quantization (PQ Compression): https://arxiv.org/abs/1908.10396
- JVector/#168: How to support building larger-than-memory indexes https://github.com/jbellis/jvector/issues/168
- JVector/#244: Build indexes using compressed vectors https://github.com/jbellis/jvector/pull/244
One interesting property in benchmarking is that the distance comparison implementations for full-dim vectors can often be more efficient than those for PQ-compressed vectors (straight-line SIMD execution vs table lookups), so on some systems cluster-and-merge is relatively competitive in terms of build performance.
I've tested the build-with-compression approach used here with all the datasets in JVector's Bench [1] and there's near zero loss in accuracy.
I suspect that the reason the DiskANN authors used the approach they did is that in 2019 Deep1B was about the only very large public dataset around, and since the vectors themselves are small your edge lists end up dominating your memory usage. So they came up with a clever solution, at the cost of making construction 2.5x as expensive. (Educated guess: 2x is from adding each vector to multiple partitions and the extra 50% to merge the results.)
So JVector is just keeping edge lists in memory today. When that becomes a bottleneck we may need to do something similar to DiskANN but I'm hoping we can do better because it's frankly a little inelegant.
[1] https://github.com/jbellis/jvector/blob/main/jvector-example...
A tool that (hopefully) surfaces interesting HN discussion threads; I wanted an excuse to investigate (hybrid) full text and vector search at a substantial scale beyond toy datasets.
Sadly (well not really) I changed jobs soon after building the first version. Life caught up and I never got around to adding more features and polishing up the frontend (eg. the broken back button
Ideas for new features are very welcome :)
GH Project: https://github.com/jbellis/jvector
The source to build and serve the index are at https://github.com/jbellis/coherepedia-jvector
How is a log N search over S segments O(N)?
To be more correct it’s O(N/C log C) where C is the capacity of a segment. In this case you can ignore 1/C and log C as constant. So now sure, you actually just have O(N). But this is not super useful as it says that a segmented hnsw approach and brute force approach are the same - when this is really not the case in practice.
Also O(N log N) > O(N) so I’m not sure why we would ever do anything with segmentation according to that analysis if it were correct.
I have a few projects I'd like to work on. For typical web projects, I have a "go to" stack and I'd like to add something sensible for vector based search to that.
[article author, I work on JVector and Astra]
Can someone explain why?
https://arxiv.org/abs/2004.12832
https://thenewstack.io/overcoming-the-limits-of-rag-with-col...
Are there laptops like that? Maybe an upgraded MacBook, but I have been looking for Windows/Linux laptops and they generally top out at 32GB. I checked Lenovo's website and everything with 64GB and up is not called a laptop but a "mobile workstation".
This is an indication to me that something has gone very wrong in your code base.
He isn’t according the Wikipedia, my friend who works there, and their company website. https://www.datastax.com/our-people
That’s kind of weird
Wikipedia lists them as a founder. Perhaps their author bio is outdated, or Wikipedia is. Not sure about your friend.
> SANTA CLARA, Calif. – September 28, 2020 – DataStax today announced that DataStax Co-Founder and CTO Jonathan Ellis will deliver a keynote address at ApacheCon @Home 2020
https://www.datastax.com/press-release/datastax-co-founder-a....
As an aside, I'm an ApacheCon presenter but there was no press release about the hot excitement of my involvement. Maybe next time :)
It's insane to me that someone, this early in the gold rush, would be mining in someone else's mine, so to speak
As a first step they are using PQ anyways. It seems natural to just assume all English docs have the same centroid and search that subspace with hnswlib.
Are there some benchmarks available that compare it with the openai model?
I'm not sure on what planet all of these people here live that they have success with Linux swap. It's been broken for me forever and the first thing I do is disable it everywhere.
echo y >/sys/kernel/mm/lru_gen/enabledTBH this was sloppy on my part. I tested multiple runs of the index build and early on kswapd was super busy. I assumed Linux was just caching recently read parts of the source dataset, but it's also possible it was something external to the index build since it's my daily driver machine. After I turned off swap I had no issues and didn't look into it harder.
Edit: see for instance https://insights.oetiker.ch/linux/fadvise.html
I'll let my personal laptop swap, though. Especially if my wife is also logged in and has tons of idle stuff open.
I suspect they just need to pass in -xmx options to the jvm to avoid this.
If you're referring to full GC you can configure how often that happens and by default it doesn't just wait until memory is nearly full.
There has to be more interesting things to discuss than this.
P.s. I think you meant 2020.
I haven't seen any news that indicates this has changed, but by all means give it a try!
What's your alternative when you can't build an index larger than C?
I don't know how Linux does this in particular, but intuitively swapping can make sense if part of your allocated RAM isn't being accessed often and the disk is. The kernel isn't going to know for sure of course, and seems in my case it guessed wrong.
It is much better to swap out anonymous memory that isn't being used than to flush file data that is being used. Or another way of looking at it if you run out of memory you will thrash with or without swap enabled. The difference is that with swap the kernel can evict the most rarely accessed memory whether file-backed or anonymous. Without swap the kernel is forced to only consider file-backed memory, even if it is being used more frequently than the anonymous memory.
In theory this would be an efficiency boost but the performance math can be tricky.
This is a coarse way to tackle that
Otherwise, JNA is probably the easiest way, and how Cassandra does it.
https://docs.oracle.com/en/java/javase/21/core/calling-c-lib...
There are a lot of related sources of similarity, but they’re slightly different. And I have no idea what Cohere is doing. Additionally, it’s not clear to me how queries can and should be embedded. Queries are typically much shorter than their associated documents, so they typically need to be trained jointly.
Selling “embeddings as a service” is a bit like selling hashing as a service. There are a lot of different hash functions. Cryptographic hashes, locality sensitive hashes, hashes for checksum, etc.
Are there other semantic search systems? What happened to the entire field of Information Retrieval - is vector search the only method? Are all the stemming, linguistic analysis, all that - all obsoleted by vectors?
Or is it purely because vector search is quick? That's just an engineering problem. I'm not convinced it's the only method here. Happy to be corrected!
My sense is that you can currently break the whole thing down into two groups: the proverbial grownups in the room are typically building pipelines that are still doing it basically how the top-performing systems did in the '90s, with a souped up keyword and metadata search engine for the initial pass and an embedding model for catching some stuff it misses and/or result ranking. This isn't how most general-purpose search engines work, but it's likely how the ones you don't particularly mind using work. Web search, for example.
And then there's the proverbial internet comments section, which wants to skip past all the boring labor-intensive oldschool stuff, and instead just begin and end with approximate nearest neighbors search using an off-the-shelf embedding model. The primary advantage to this approach - and I should admit here that I've tried it myself - is that you can bodge it together over a weekend and have the blog post up by Monday.
I guess what I'm getting at is, the people producing content on the Internet and the people producing effective software aren't necessarily the same people. I mean, heck, look at me, I'm only here to type this comment because I'm slacking off at work today.
1: https://www.oreilly.com/radar/what-we-learned-from-a-year-of...
Not a semantic search but stemming + BM25 often works surprisingly well and is a fast and cheap baseline.
There is more information here, though: https://cohere.com/blog/introducing-embed-v3
and there are some standard hash functions in the lib, which cover 98% of usecases. I think the same is embeddings, you can train some foundational multitask model, and embedding will work for variety of tasks too.
I have no association with Cohere, but in their docs clearly say that their embedding were trained so two similar vectors have similar "semantic meaning". Which is still pretty vague, but it's at least clear what their goals were.
> Selling “embeddings as a service” is a bit like selling hashing as a service.
Coincidentally, Cohere also aggressively advertises that they want you to fine-tune and co-develop custom models (with their proprietary services).
That said, first guess, if you do want to evaluate Cohere embeddings for a commercial application, using this dataset could be a decent basis for a lower-cost spike.
Would Digital Ocean or Hetzner meet your needs?
Wikipedia has a lot of tables so I was wondering if content-aware sentence chunking would be good enough for Wikipedia.
What I wonder though is - we've been a year and a half into the LLM craze and we still don't see a really good information processing system for them. Yes, there's chatbots, some that let you throw in images and PDFs.
But what we need is more like a ground-up rethink of these UIs. We need to invent the "desktop" of LLMs.
But the keys here, I think, are that
a) the LLMs are only part of the solution. A chat interface is immature and not enough.
b) external information is brought in by the user, and augmented by a universe of knowledge given by the provider
c) being overly general is probably a trap. Yes, LLMs can talk about everything - but why not solve a concrete vertical?
Semantic search helps with a part of this, but is just one component.
You can even see some of this play out a bit over the course of the web's nearly 30 year history. 20 years ago, informational websites tended to be brief, highly structured, and minimally chatty. Nowadays, people produce walls of text that you have to dig through to find the actual content. Why the change? Search engine optimization. Which I'd argue is an example of essentially the same folks who give us AI basically dragging us back to a world where natural language dominates. Not because it's actually better for anyone, but because it's what they can more easily build a one-size-fits-all algorithm around.
But we clearly have an ouroboros situation. If publishers lose views, they lose money and the ability to craft good information. Less new info to incorporate into LLMs.
LLM training over the internet corpus has really been a massive heist. Pulling a wool over publishers' heads, undercutting their business, hoarding the information.
But it's really unavoidable at this point. Everything has been democratized: compute on cloud platforms, data via Common Crawl, OSS algorithms and tool-kits. No one can put a stop to this, and there's powerful economic incentives to actually get some benefit out of the hundreds of billions that have been poured in already.