B+ Trees and why I love them, part 1

B+ Trees and why I love them, part 1(ayende.com)

97 points by rayvega 12 years ago | 31 comments

gtrubetskoy 12 years ago |

I too "discovered" B-Trees (and B+Trees) recently and wrote about it: http://grisha.org/blog/2013/05/11/relational-database-on-top...

I even implemented a key/value backed SQL database (using Thredis, which is Redis and SQLite) http://grisha.org/blog/2013/05/29/sqlite-db-stored-in-a-redi...

SQLite3 has an awesome C implementation with copious comments if you really want to understand them, BTW.

The most fascinating thing about B-Trees is that the entire data structure is stored in equally sized blocks (pages), and getting a page by number is a very easy operation in block storage, which is what our RAM and hard drives are (even though back then they were tapes and punch cards).

Few people realize that it's not just DB's that use B-Trees, our filesystems are also largely B-Tree based.

It's remarkable how few today's "developers" understand how they work or even know what they are, yet they are so fundamental to everything computers do.

Without B-Trees our present world of computing wouldn't exist as we know it. It could be the coolest data structure ever :)

Flow 12 years ago | |

I've read your first link and I don't get why you talk about "A relational database needs to store rows in order". AFAIK you have to add an "order by" clause to ensure ordering of any kind.

Relational algebra is probably not dependent of ordering for the common operations. Of course, the actual implementation of a relational database might need ordering of keys to be reasonably efficient. Is that what you meant?

Millennium 12 years ago | | |

The question is "what in order" means. The DB needs to store rows in some order that makes sense to it, but that might or might not match whatever order a user might want.

That's where ORDER BY comes in: presenting the data in some order that makes sense to the user. That might be completely different from what's stored in the DB. But the DB needs its own order, in order to be able to access things in a reasonably efficient manner.

gtrubetskoy 12 years ago | | |

It needs to store rows in order to support indexes, so that a row could be found without having to scan the whole table, and that's exactly what B-Trees let you do, and also add or remove rows without having to rewrite the whole table, but only affected blocks.

koverstreet 12 years ago |

It should be noted that textbook B+ trees, like those described, are nowhere near state of the art performance wise. It's possible to do orders of magnitude better.

elbee 12 years ago | |

The B-tree implementations used in a lot of databases have tweaks, but they are surprisingly similar to the textbook descriptions. In general, B-trees are actually a very useful data structure when: a) Your data is much larger than main memory (e.g. terabytes/petabytes of data and gigabytes of RAM). b) You want to support random insert/update/delete, seek, and next/previous key. c) There has to be a reasonable constraint on storage space (i.e. data density/compaction). In those cases the log(n) performance of a b-tree is extremely hard to improve on.

Alternate data structures a useful if: 1. The ratio of ram/disk is more favourable, in which case using an improved in-memory data structure can be more efficient. 2. Not all operations have to be supported. In particular eliminating the next/previous key operations allows something like a hashtable to be used.

koverstreet 12 years ago | | |

There's two reasons a textbook b+ tree is never going to perform very well:

* Binary searches suck

The memory accesses of a binary search look pretty much like complete random access, so prefetching is useless. Once your b+ tree is big enough that most of it won't fit in L2, you've got a tight inner loop that's waiting on DRAM at every single iteration.

This one is at least more or less solvable without changing the rest of the design, but then if you also want to solve the other main issue it gets harder.

* With small btree nodes, the overhead of looking up and locking btree nodes (even assuming they're all cached in memory) kills you.

Trouble is if you make your btree nodes big enough to fix that overhead, they're now so big that rewriting them when you insert a single key is just ridiculous.

So where you end up is to get anywhere near the highest possible performance you're forced into log structured btree nodes. Which actually has its advantages for solving the first problem, but now what you've got is really a hybrid b+ tree/compacting data structure.

Source: bcache author.

leif 12 years ago | | |

Actually, you can beat B-trees pretty handily across the board in exactly the scenario you described (a, b, c).

The log(N) performance of a B-tree is not just extremely hard to improve on (for searches), it's impossible to improve on. The lower bound for searching (in the DAM model) is log(N)/log(B), and B-trees meet that.

But B-trees are also log(N)/log(B) for insertions, which is, it turns out, pretty damn slow. There's an optimal trade-off curve between insertions and queries (the fastest data structure for insertions is to just log them, but that forces all queries to read all the data), and B-trees are on it, but there are many more interesting points on that curve.

There is a family of data structures that matches B-trees' performance on queries while blowing it completely out of the water for insertions. The COLA and Cache-Oblivious Streaming B-tree are where you'll find them in academia, and the implementation I work on has a very marketing-flavored name: the Fractal Tree. All that means is that we took the theory and started implementing it and came up with enough innovations in the implementation that it sort of needed a new name, but it's spiritually the same concept.

I've written a fair amount about this that I'll link to at the end, but here's a brief description so you don't think I'm making this all up.

Basically what we do is we take a B-tree and, on all the internal nodes, we stick large buffers that accumulate messages. Want to insert something? Just stick it in the root's buffer, you don't need to do any I/O to find the proper leaf node it needs to go in (yet). If the buffer's full, flush it down by taking all its messages, sorting them between the root's children, and putting messages in the buffers in the children. This flushing can cascade as you'd imagine, and splits and merges work about the same as in a B-tree.

So now what's the analysis?

Well, the tree has the same shape as a B-tree, so searches have to look at the same log(N) number of nodes (which in practice is almost always just 1 for the leaf node after cache hits on the internal nodes, both for B-trees and for Fractal Trees, but the asymptotic analysis also tells you they have the same query cost).

For insertions though, let's walk through the process. The tree has height log(N), which means that, for an insertion to get "fully inserted", meaning it reaches the leaf node and won't be flushed any more, it has to get flushed down log(N) times. And what's the cost to flush a buffer full of messages? That's just O(1) I/Os, for the parent and children (and in practice it's actually only 2 I/Os because you can just flush to a single child, not all of them). But a buffer flush does work for O(B) messages at once, so the amortized cost to flush a single message down one level is just O(1/B). Do that log(N) times, and the insertion cost is O(log(N)/B), which is in practice around 100x less expensive than the B-tree's O(log(N)/log(B)) insertion cost.

On top of this, while for B-trees you want small leaf nodes because you're going to be reading and writing them all the time, for Fractal Trees, since the goal is to get a lot done with each I/O, you actually want large leaves, on the order of a few megabytes each. This has two nice effects:

1) While range queries on a B-tree can slow down as the tree ages and the leaves start to get randomly placed on disk, range queries on a Fractal Tree stay fast because each time you do a disk seek, you get to read and report a few megabytes' worth of data. This basically solves the "B-tree fragmentation" problem that makes database users run optimize table, or vacuum, or reIndex() or compact() operations like madmen.

2) Compression algorithms (like zlib, our default) can compress large blocks of data much more effectively than they can compress small blocks. So InnoDB, which has small blocks like most B-trees, if you turn on compression, apart from eating CPU as it tries and fails and re-tries to compress your data to fit it into its block size, it'll only get at most about 4x compression. In contrast, TokuDB (our MySQL storage engine using Fractal Trees [4]) routinely gets 10-20x compression without breaking a sweat.

I have some blog posts about this [1] and [2], and our benchmarks page is [3]. We also have a version of MongoDB in which we've replaced all the storage code with Fractal Trees, we call it TokuMX [5]. Don't mind the marketing haze, it's all serious tech under the hood.

[1]: http://www.tokutek.com/2011/09/write-optimization-myths-comp...

[2]: http://www.tokutek.com/2011/10/write-optimization-myths-comp...

[3]: http://www.tokutek.com/resources/benchmarks/

[4]: http://www.tokutek.com/products/tokudb-for-mysql/

[5]: http://www.tokutek.com/products/tokumx-for-mongodb/

awda 12 years ago | |

Please, tell me more.

rodrigoavie 12 years ago |

This is very awesome. Please consider writing about other tree data structures in the future :)

Specially, sofistications on AVL and Red-Black trees