Quickwit 0.8: Indexing and Search at Petabyte Scale(quickwit.io) |
Quickwit 0.8: Indexing and Search at Petabyte Scale(quickwit.io) |
I also decide to use Tantivy (the rust library powering/written by Quickwit) for my own bookmarking search tool by embedding it in Elixir, and the API and docs have been quite pleasant to work with. Hats of to the team, looking forward to what's coming next!
[1] https://github.com/signoz/signoz [2] https://signoz.io/blog/logs-performance-benchmark/
https://github.com/openobserve/openobserve/blob/v0.7.0/.env.... is some "onoz" for me, but just recently someone submitted https://github.com/aenix-io/etcd-operator to the CNCF sandbox so maybe things have gotten better around keeping that PoS alive
mind elaborating? we built loki for some pretty massive scale but I've always tried to make it work at super small scale to. what went wrong?
Here is a postgres extension that uses it to provide full text search
PS: it’s tantivy!!!
It's very healthy to take maximum bandwidth limits into consideration when reasoning about performance. For instance, for temporal stores, the bottlenecks you see are due to RAM latency and memory parallelism, because of the write-allocate. The load/store uarch can actually retire way more data from SIMD registers.
So there's already some headroom for CPU-bound tasks. For instance 11MB/s is very slow for JIT baseline compiler. But if your particular problem demands arbitrary random access that exceed L3 regularly, maybe that speed is justified.
The largest work we do is building an inverted index. Oversimplified, it is equivalent to this:
inverted_index = defaultdict(list)
for (doc_id, doc_json) in enumerate(doc_jsons):
c = json.loads(payload)
for (field, field_text) in c.items():
for (position, token) in enumerate():
inverted_index[token].push((doc, position))
serialize_in_compressed_way_that_allows_lookup(inverted_index)You can implement it in a couple of hours in the language of your choice to get a proper baseline.
I am sure we can still improve our indexing throughput... but I have never seen any search engine indexing as fast as tantivy.
If someone knows a project I should know of, I'd be genuinely keen on learning from it.
After all, it does not matter much if a log search query answers in 300ms or 1s. However, there are use cases where a few GB just does not cut it.
The tale saying that you can always prune your dataset using timestamp and tags is simply not always valid.
It is possible to scan NVMe at a speed of multiple GB/sec, scans can be parallel and happen on multiple disks, over compressed data (10 Gb of logs ~ 1Gb to scan), data can be segmented and prefaced with Blum filters, to quickly check if a segment is worth scanning.
Assuming 3 GB/s SSD, 10 SSDs, and a compression as you suggested of 10x, a query for finding a string in the text would take 10000 / 3 / 10 / 10 = 33 seconds.
With an index, you can easily get it 100x faster, and that factor gets larger as your data grows.
In general it's just that O(log(n)) wins over O(n) when n gets large.
I didn't take your Bloom filter idea into consideration as it is not immediately obvious how a Bloom filter can support all filter operations that an index can. Also, the index gives you the exact position of the match, when the bloom filter only gives you existence, thus potentially still resulting in a large read amplication factor of a scan in the segment vs direct random access.