The article mentions that with AOF persistence there is a problem about fsync. I'll try to go in further details here.
Basically when using Redis AOF you can select among three levels of fsync: fsync always that will call fsync after every command, before returning the OK message to the client. Bank-alike security that data was written on disk, but very very slow. Not what most users want.
Then there is 'fsync everysec' that just calls fsync every second. This is what most users want. And finally 'fsync never' that will let the OS decide when to flush things on disk. With Linux default config writing buffers on disk can be delayed up to 30 seconds.
So with fsync none, there are no problems, everything will be super fast, but durability is not great.
With fsync everysec, there is the problem that form time to time we need to fsync. Guess what? Even if we fsync in a different thread, write(2) will block anyway.
Usually this does not happen, as the disk is spare. But once you start compacting the log with the BGREWRITEAOF command, the disk I/O increases as there is a Redis child trying to perform the compaction, so the fsync() will start to be slow.
How to fix that? For now we introduced in Redis 2.2 an option that will not fsync the AOF file while writing IF there is a compaction in progress.
In the future we'll try to find even Linux-specific ways to fsync without blocking. We just want to say the kernel: please flush the current buffers, but even if you are doing so, new writes should go inside the write buffer, so don't try to delay new writes if the fsync in progress is not yet completed. This way we can just fsycn every second in another thread.
Another option is to write+fsync the AOF log in a different process, talking with the main process via a pipe. Guess what? The current setup at Bump is somewhat doing this already with the master->slave setup. But there should be no need to do this.
So surely things will improve.
About diskstore, this is I think better suited for a different use case, that is: big data, much bigger than RAM, but mostly reads, and need to restart the server without loading everything in memory. So I think Bump is already using Redis in the best way, just we need to improve the fsync() process.
> With fsync everysec, there is the problem that form time to time we need to fsync. Guess what? Even if we fsync in a different thread, write(2) will block anyway
Yep, but this could be avoided if a thread was devoted to all I/O incl. write() (and then line-level buffering really would be possible as well). Communication with this thread would be on a thread-safe queue--the main thread would never block on disk I/O, and only two threads would mean mutex contention for the queue lock would be low. This would be one solution, correct? This is a variation of your "two processes + pipe" suggestion.
> How to fix that? For now we introduced in Redis 2.2 an option that will not fsync the AOF file while writing IF there is a compaction in progress.
Well, we enabled that.. but, we found that it's still a problem in a couple of circumstances:
1. Something other than the AOF recompaction makes the disk busy. Like, say, even a moderate amount of disk activity by another process.
2. Redis's own logging to stdout, if redirected to a file, itself can cause the redis main thread to block if stdout is being flushed onto a busy disk.
Basically, if any I/O which may hit a disk (AOF record/flush or even logging) is being done on the single epoll-driven thread redis uses to process incoming requests, the system must make very good guarantees that those I/O calls will not block. We have found these guarantees practically impossible to make on a very busy master, so we've given on up having the master do AOF work altogether.
Exactly the logging process can well be a thread for better performances, thanks for the hint!
About the other scenarios where fsync will perform poorly, indeed every other I/O is going to be a problem.
I guess the "all the AOF business in a different thread" is the most sensible approach to follow probably, unless there is an (even Linux specific syscall) that is able to avoid blocking but just to force commit of old data.
I can say, empirically, none of the many, many challenges we've had building and scaling Bump, have been related to Redis's capabilities as a messaging bus. So "good enough" wins again.
I've often thought it would be useful to have a redis equivalent of MongoDB's capped collections, specifically to make things like recent activity logs easier to implement. At the moment you can simulate it with an rpush followed by an ltrim, but it would be nice if using two commands wasn't necessary.
sending LPUSH+LTRIM in a pipeline is the same as having a special command for this. But having a special command for this, and for other use cases, makes Redis somewhat less general. What I mean is that if we consider every added feature a cost (complexity cost, not development cost), why don't instead add a feature that allows for a use case currently not covered?
Btw there is an interesting pattern so you actually need to rarely send the LTRIM. Imagine this: you want a list to save the user timeline, you are interested only in the latest 100 messages. So for every entry you can LPUSH+LTRIM. But after all you can just LTRIM 10% of the times. Your list will fluctuate in length between 100 and 110, but as you access things using LRANGE the additional elements wil not create any problem. So the cost of the LTRIM, while already very very small, can be made 90% smaller with this simple trick.
I'd be interested in hearing if they tried to use Scribe for the same task and found it wanting, or if there's some other story.
To answer your question, simply put, no one here had heard about Scribe.
I do have mongodb replicated across two other machines, but could you briefly shed light on what the problems between redis and mongo on a single box were?
Well, mainly because Scribe was purpose-built to do log aggregation on a large scale, and has nice features to prevent data loss in the event of network and node failure. It's also pretty well-tested at this point, given its origins and community. Check the wiki to which I linked.
I didn't mean my comment to imply anything negative. I was just trying to point out to the parent comment that there's now a better option than rolling a custom log aggregator on top of Redis. That may not have been true when you started your system. Mea culpa.
I'd rather spend a day compiling libraries than spend a week re-writing a piece of basic infrastructure.
With Redis in an entry level Linux box you can process 100k messages per second per core. I'm not sure if current AMQP systems can handle this amount of messages with commodity hardware.
Another reason why Redis can be a good approach I think is that it is a simple system: simple to understand, simple to deploy, very stable. There is support for Pub/Sub. It also supports many other data types, so for instance, want a priority queue? Do it with a sorted set. What some special semantic for your message passing? possibly with BRPOPLPUSH you can do it. And so forth.
A common case of this flexibility is shown by RPOPLPUSH (without the "B" so the non blocking version). Using this command with the same source and destination list will "rotate' the list providing a different element to every consumer, but the data will remain on the list. At the same time producers can push against the list. This has interesting real world applications when things must be processed against and again (fetching RSS feeds, for instance).
So Redis is a pretty flexible tool for messaging, and I think there are for sure great use cases for AMQP but with big overlaps with Redis, and also with use cases where Redis is definitely a better alternative.
AMQP does have extra capabilities and is a good messaging system and has advanced routing features, but you need to learn what exchanges, queues and bindings are and how they relate before it is useful.
I've used rabbit in production on multiple systems and it is still running on some of those. But I have switched to redis for most of my messaging needs because of the built in data structures and persistence. It makes it a much more versatile server and it is much easier to admin and much more stable. Rabbit is too easy to push over when you run it out of memory.
But I had to chime in and refute your 100k messages per core on rabbit. 20k maybe with the java client, more like 5-7k with a ruby or python client.
I can still get 80k/sec with a ruby client on redis.
The two servers are very different, redis is a swiss army knife of data structures over the network, that is why it is so useful. AMQP and rabbit are targeted more at enterprise messaging and integration where raw speed doesn't matter as much as complex hierarchies of brokers and middleware.
Until it grinds to a halt for no obvious reason</snark>
We've dropped RabbitMQ on a project a while ago not for performance but for stability and opacity reasons.
I went into detail about the issues in a post a few months ago so I'll skip that here. My take-home is that unless you have complex routing requirements that can only be realized with AMQP then you should think long and hard if you really want the aircraft carrier of a component that is RabbitMQ in your dependency list.
Redis/Resque, beanstalk and Celery cover the overwhelming range of use-cases just fine and are complete no-brainers to operate in contrast to wrestling with AMQP topologies, flakey client libraries and an opaque (to most) erlang blackbox.
https://github.com/jamwt/diesel/blame/e360313d3950a952110b1d...