Sparkey – Key/value storage by Spotify(github.com) Sparkey is a simple constant key/value storage library. It is mostly suited for read heavy systems with infrequent large bulk inserts. |
Sparkey – Key/value storage by Spotify(github.com) Sparkey is a simple constant key/value storage library. It is mostly suited for read heavy systems with infrequent large bulk inserts. |
I always like it when a company focus on their real problems in job interviews and manage to avoid the brain teaser trap. That way you can have a feeling about the job that you are going to work on there, and see if you really like both work and people.
It is really closely aligned to what our core service is (distributing and streaming files) and is a great chance to talk with the interviewee and figure out where their strengths are.
Totally encourage other people to interview this way. It's what I've done at the past few companies I've been at and really worked excellently — just think of a problem you're working or have worked on, and distill it into an interview problem.
[1]: https://github.com/StefanKarpinski/bam [2]: http://cmph.sourceforge.net/
That said, I intend to publish some sort of performance comparison code / results. The downside with me doing it is that: 1) I know the sparkey code much better than I know level-db or any other solution, so the tuning parameters will probably be suboptimal for the other solutions. 2) I will only focus on our specific usecase (write large bulks, do lots of random reads), which may seem a bit unfair to the most general solutions.
The sparkey usage is fairly optimized, but I just randomly put something together for the level-db, so consider the results extremely biased.
How does your test compare to http://symas.com/mdb/microbench/ ? If you're going to try to talk about numbers, talk about them in a meaningful context. Right now you're just handwaving.
On monday I added some slightly more proper benchmark code, you can find it on https://github.com/spotify/sparkey/blob/master/src/bench.c
I didn't add the level-db code to this benchmark however, since I 1) didn't want to manage that dependency 2) didn't know how to write optimized code for usage of it.
I'm using very small records, a couple of bytes of key and value. The insert order is strictly increasing (key_0, key_1, ...), though that doesn't really matter for sparkey since it uses a hash for lookup instead of ordered lists or trees.
As for the symas mdb microbench, I only looked at it briefly but it seems like it's not actually reading the value it's fetching, only doing the lookup of where it actually is, is that correct?
"MDB's zero-memcpy reads mean its read rate is essentially independent of the size of the data items being fetched; it is only affected by the total number of keys in the database."
Doing a lookup and not using the values seems like a very unrealistic usecase.
Here's the part of the benchmark I'm referring to: for (int i = 0; i < reads_; i++) { const int k = rand_.Next() % reads_; key.mv_size = snprintf(ckey, sizeof(ckey), "%016d", k); mdb_cursor_get(cursor, &key, &data, MDB_SET); FinishedSingleOp(); }
leveldb on the other hand, supports concurrent writes and provides features to handle data consistency and cheap gradual reindexing.
A quick glance over LevelDB's features gives me the impression that its bulk-write performance would not be sufficient.
What problem are you solving?
If existing solutions existed what hurdles did you face with them and how did you overcome them with your custom solution?
How do you compare from a performance view? (granted they still need to do this, but at least put in a section about it)
Not only that but it provides lightning-fast conjunctive normal form queries, a.k.a logical combinations of primitive keys. Plus it has Python / Erlang bindings.
The command line argument processing is also quite haphazardly done, it's not like it using getopt or whatever that poses compatibility issues. Is writing and packaging with a Makefile that difficult?
I think it's a miracle they produced something they feel comfortable sharing with the world. If you write a database in house, and the tool chain and the argument processing are the only things done haphazardly, then hats off to you :)
I wrote something like this to optimize disk seeks heavily by returning a reference of 8 byte and keeping a hashtable in memory. A mostly-append only records store that allowing mutations of same key and by rounding size of blobs by power of 2. Written to optimize storage layer for Membase.
If you're interested in the internals of LevelDB, I strongly recommend watching this talk http://www.youtube.com/watch?v=vo88IdglU_8 (slides here https://speakerdeck.com/basho/optimizing-leveldb-for-perform...)
http://symas.com/mdb/memcache/ and http://symas.com/mdb/hyperdex/ give results where the records are transmitted across the network.
In any case, I think the easiest way to get a fair benchmark is to at least iterate over the value, possibly also compare it. If that time turns out to be significant (perhaps even dominant) compared to the actual lookup time, then further optimization of the actual storage layer is pretty meaningless.
This was run on my Dell M4400 laptop, Intel Q9300 2.53GHz quadcore CPU, 8GB RAM. The maximum DB size is around 4GB so this is a purely in-memory test. Your hash lookup is faster than the B+tree, but with compression you lose the advantage.
I am not sure why you changed the key format to "key_%09d" - is that an optimization for lmdb, to make sure the insertion order is the same as the internal tree ordering? If so, why is that needed for the benchmark?
I noticed that the wall time and cpu time for the sparkey 100M benchmarks were a bit disjoint, it would seem that your OS was evicting many pages or was stalling on disk writes. The Sparkey files were slightly larger than 4 GB while lmdb was slightly smaller, but I am not sure that really explains it on an 8 GB machine.
I am not sure I agree about the non-linear creation time difference, the benchmarks indicate that both sparkey and lmdb are non-linear. The sparkey creation throughput went from 1206357.25 to 1109604.25 (-8.0%) while lmdb's went from 2137678.50 to 2033329.88 (-4.8%)
Regarding the lookup performance "dropping off a cliff", I think that is related to the large difference in wall time vs cpu time, which indicates a lot of page cache misses.
lmdb seems really interesting for large data sets, but I think it's optimized for different use cases. I'd be curious to see how it behaves with more randomized keys and insertion order. I didn't think of doing that in the benchmark since sparkey isn't really affected by it, but it makes sense for when benchmarking a b-tree implementation.
Sparkey is optimized for our use case where we mlock the entire index file to guarantee cache hits, and possibly also mlock the log file, depending on how large it is.
The way you append stuff to sparkey (first fill up a log, then build a hash table as a finalization) is really useful when you need to use lots of memory while building and can't affort random seek file operations, and in the end when most of the work is done and your memory is free again, finalize the database. Of course, you could do the same thing with lmdb, first writing a log and then converting that into a lmdb file.
Thanks for taking the time to adapt the benchmark code to lmdb, it's been very interesting.
Still don't understand what happened to sparkey at 100M. The same thing happens using snappy, and the compressed filesize is much smaller than LMDB's, so it can't be pagecache exhaustion.
Also suspicious of the actual time measurements. Both of these programs are single-threaded so there's no way the CPU time measurement should be greater than the wall-clock time. I may take a run at using getrusage and gettimeofday instead, these clock_gettime results look flaky.
Also, I think it would be more interesting to see a comparison with lmdb using random writes instead of sequential.
As for the cpu time measurement, the wallclock is very inprecise, so it could be some small quantum larger than cpu time, but it should never be more than the system specific wall clock quantum.
Sparkey's lookup performance drops off a cliff at 100M elements. This doesn't seem to be related to raw size because it occurs regardless of compression. LMDB's performance degrades logarithmically, as expected of an O(logN) algorithm.
Hashing is inherently cache-unfriendly, and hashes are inherently wasteful - hash tables only perform well when they're mostly empty. They're completely hopeless when scaling to large datasets.