Siberite: A Simple LevelDB-Backed Message Queue in Go

Siberite: A Simple LevelDB-Backed Message Queue in Go(github.com)

77 points by Bogdanovich 10 years ago | 35 comments

eis 10 years ago |

Why would you choose a LSM Tree based storage mechanism for a message queue?

The only reason I can come up with would be because it's a read-to-use library you can just plug in which gives OK performance and some handy features because you can use the KV store for other things. But it doesn't scale well and backups with LevelDB are not really easy either (close DB, copy all files).

Message queues when they are ordered (at least on the local node/queue level) usually just need some kind of append-only log file. You don't do random reads or writes into the middle of the queue, you only modify the head and tail.

InfluxDB, albeit being a time series db has similar write patterns to a message queue, learned it the hard way when they first tried to use a LSM Tree database (LevelDB), then switched to a B+Tree (BoltDB/LMDB) but that also doesn't scale once the DB gets big and the tree has quite some depth. They kindly did a nice writeup of their journey: https://influxdb.com/docs/v0.9/concepts/storage_engine.html

Why not do it simple and use append-only files without complex structure and management?

Check out Kafka for a better storage format for message queues of this kind.

PS: every message queue should first clearly explain what guarantees it provides.

rakoo 10 years ago | |

A LSM tree is actually a good idea if you think about it.

The R/W patterns for a message queue are simple:

- Messages are key/value

- key is an autoincrementing id

- Writes are at the end, Reads are from the beginning

- Once a message is processed, it's deleted

So in practice this means that the items are written in an append-only fashion, get merged in bigger chunks, and then get progressively deleted. So at higher levels you don't see the huge latencies due to compaction because all records are deleted. Knowing that keys are only incrementing could also lead to a simple optimization: the compaction phase can be a simple concatenation of files.

So you get an append-only system that progressively removes older entries as they are deleted without resorting to mad science hackery [1]. Why didn't it work for InfluxDB ? All I can guess is that individual entries for each series are all mixed together (InfluxDB wants to be able to manage many series with many tags) and older entries are not deleted as frantically, so you get the latencies we all know with compaction and unpredictable reads.

Now, this is purely theoretical and of course further experimentations are needed to make sure this is correct, but LSM is in my opinion a correct pattern here.

[1] https://gist.github.com/CAFxX/571a1558db9a7b393579

hyc_symas 10 years ago | | |

A queue is the correct pattern for a queue. A tree, of any form, offers no advantage.

The InfluxDB experience is definitely illuminating. Their problems with LMDB were mainly due to misuse of the API. https://disqus.com/home/discussion/influxdb/benchmarking_lev...

For batched sequential writes, there is no other DB anywhere near as fast as LMDB http://symas.com/mdb/microbench/ (Section E, Batched Writes)

But even so - the reason LMDB can do this so quickly is because for batched sequential writes it cheats - it's just performing Appends, there's no complicated tree construction/balancing/splitting of any kind going on.

If you know that your workload will only be producer/consumer, with sequentially generated data that is sequentially consumed, it's a stupid waste of time to mess with any other structure than a pure linear queue. (Or a circular queue, when you know the upper bounds of how much data is outstanding.)

As for your initial statement - no, an LSM tree is not a correct pattern here. If your consumers are actually running as fast (or faster) than your producer then it should never flush from Level0/memory to Level1/disk. In that case all you've got is an in-memory queue that evaporates on a system crash.

If your consumers are running slower, that means data is accumulating in the DB, which means you will have compaction delays. And the compaction delays will only get slower over time, as more and more levels need to be merged. (Remember that merge operations are O(N). Then remember that there are N of them to do. O(N^2) is a horrible algorithmic complexity.) LSM is never a correct pattern.

biot 10 years ago | |

And you can find a fantastic list of questions about queue guarantees/properties here: https://news.ycombinator.com/item?id=8709146

Bogdanovich 10 years ago | |

Yes, goleveldb was chosen because it's a ready to use library with a decent write and read performance, and no external non-Go dependencies. It can also be used to store multiple consumers offsets in future.

Regarding provided guarantees, with simple 'get work_queue' reads it provides at-most-once delivery. With two phase reliable reads 'get work_queue/open', 'get work_queue/close' it provides at-least-once delivery (although message is kept in memory on server during a reliable read and will be lost if you SIGKILL siberite. On SIGTERM and SIGINT siberite will gracefully abort the read and save the message).

dwenzek 10 years ago | | |

I'm puzzled by your mention of consumer offsets.

Indeed, either Siberite is a queue system which purpose is to dispatch each message to one and only one consumer for further processing and which requires the consumers to acknowledge fully processed messages ;

or Siberite is a journal system (in the spirit of Kafka) which purpose is to replay the full log to any consumer asking for it and which offers the consumer a watermark mechanism to keep track of their progress.

In the former case, the queue system is responsible of what to do in case of a missing or late acknowledgement (choosing between "at least once" or "at most once" message delivering). In the later case, the consumers are responsible of how to maintain an atomic view of message consumption and message processing (for instance using a transaction to persist an offset with a state).

biokoda 10 years ago | | |

Why is it the queues responsibility to store consumer offsets? Consumer is the only side that knows how far along his processing is. Why is the queue storing this data, when all the consumer has to do is tell it: send me events for topic X from point P forward.

krat0sprakhar 10 years ago |

This couldn't have come at a better time - I was actually looking for a durable message-queue written in Go. Is there any way to read more about the architecture of this system? I find systems like these to be quite fascinating but taking the time to go through the code can sometimes be very time-consuming. It would be awesome if more projects have a writeup as detailed as cockroachdb[0]!

Aside: There used to be a site sometime back which used to distribute compiled binaries of Go code for all platforms? Is it still up any chance?

[0] - https://github.com/cockroachdb/cockroach#architecture

justinsaccount 10 years ago | |

http://nsq.io/

http://nsq.io/overview/internals.html

The service you are thinking of might be https://github.com/ddollar/godist

You can see the download links on https://github.com/ddollar/forego use it.

Bogdanovich 10 years ago | |

It's really simple. Each queue is a separate leveldb database on disk. Messages are stored as key/value using incremental ids. Head and tail of the queue are kept in memory and get initialized on startup via db scan.

dave_ops 10 years ago | | |

Also, you have to be paying one hell of a compaction penalty if this isn't a grow-only dataset. By ordering your keys you're at least minimizing the overhead of compaction on write by utilizing the happy-path for how LevelDB moves data out of the write buffer and into the SSTs.

But deletes are going to have a big impact still, and (working from my failing memory of LevelDB internals) I think might actually be the pathologically sad case.

dave_ops 10 years ago | | |

Why don't you just store the head and tail as K/V entries? You have a durable K/V store at your disposal.

xrstf 10 years ago |

Sounds interesting. For my usecases, which require few (< 10) messages/sec and no clustering, would I gain anything by using Siberite over Beanstalk?

Bogdanovich 10 years ago | |

You can have large queue sizes (larger than RAM size) and siberite would still consume small amount of resident memory. You basically don't need a separate server with decent amount of memory for it. You can also can get benefit from two-phase reliable fetch - if your client gets disconnected without confirming a message, the message will be served to another client (very convenient if you use amazon spot instances for your workers).

eis 10 years ago | | |

Note that this also means that messages can be delivered more than once and/or that the clients need to remember the messages that they processed. In some setups that can be a showstopper.

clumsysmurf 10 years ago |

Can you describe how the queue was represented as key/value?

Bogdanovich 10 years ago | |

Yes, as id/value with autoincrement key. Head and tail ids are kept in memory and get initialized on startup via leveldb database scan.