Redis at Bump: many roles, best practices, and lessons learned

Redis at Bump: many roles, best practices, and lessons learned(devblog.bu.mp)

94 points by jmintz 15 years ago | 31 comments

antirez 15 years ago |

Thank you for writing this article. As a way to show my appreciation I want to focus on the bad side of the matter.

The article mentions that with AOF persistence there is a problem about fsync. I'll try to go in further details here.

Basically when using Redis AOF you can select among three levels of fsync: fsync always that will call fsync after every command, before returning the OK message to the client. Bank-alike security that data was written on disk, but very very slow. Not what most users want.

Then there is 'fsync everysec' that just calls fsync every second. This is what most users want. And finally 'fsync never' that will let the OS decide when to flush things on disk. With Linux default config writing buffers on disk can be delayed up to 30 seconds.

So with fsync none, there are no problems, everything will be super fast, but durability is not great.

With fsync everysec, there is the problem that form time to time we need to fsync. Guess what? Even if we fsync in a different thread, write(2) will block anyway.

Usually this does not happen, as the disk is spare. But once you start compacting the log with the BGREWRITEAOF command, the disk I/O increases as there is a Redis child trying to perform the compaction, so the fsync() will start to be slow.

How to fix that? For now we introduced in Redis 2.2 an option that will not fsync the AOF file while writing IF there is a compaction in progress.

In the future we'll try to find even Linux-specific ways to fsync without blocking. We just want to say the kernel: please flush the current buffers, but even if you are doing so, new writes should go inside the write buffer, so don't try to delay new writes if the fsync in progress is not yet completed. This way we can just fsycn every second in another thread.

Another option is to write+fsync the AOF log in a different process, talking with the main process via a pipe. Guess what? The current setup at Bump is somewhat doing this already with the master->slave setup. But there should be no need to do this.

So surely things will improve.

About diskstore, this is I think better suited for a different use case, that is: big data, much bigger than RAM, but mostly reads, and need to restart the server without loading everything in memory. So I think Bump is already using Redis in the best way, just we need to improve the fsync() process.

jamwt 15 years ago | |

Hi Antirez, thanks again for Redis. Despite our few problems with it, it rocks. A few comments:

> With fsync everysec, there is the problem that form time to time we need to fsync. Guess what? Even if we fsync in a different thread, write(2) will block anyway

Yep, but this could be avoided if a thread was devoted to all I/O incl. write() (and then line-level buffering really would be possible as well). Communication with this thread would be on a thread-safe queue--the main thread would never block on disk I/O, and only two threads would mean mutex contention for the queue lock would be low. This would be one solution, correct? This is a variation of your "two processes + pipe" suggestion.

> How to fix that? For now we introduced in Redis 2.2 an option that will not fsync the AOF file while writing IF there is a compaction in progress.

Well, we enabled that.. but, we found that it's still a problem in a couple of circumstances:

1. Something other than the AOF recompaction makes the disk busy. Like, say, even a moderate amount of disk activity by another process.

2. Redis's own logging to stdout, if redirected to a file, itself can cause the redis main thread to block if stdout is being flushed onto a busy disk.

Basically, if any I/O which may hit a disk (AOF record/flush or even logging) is being done on the single epoll-driven thread redis uses to process incoming requests, the system must make very good guarantees that those I/O calls will not block. We have found these guarantees practically impossible to make on a very busy master, so we've given on up having the master do AOF work altogether.

antirez 15 years ago | | |

Thanks for the in deep reply,

Exactly the logging process can well be a thread for better performances, thanks for the hint!

About the other scenarios where fsync will perform poorly, indeed every other I/O is going to be a problem.

I guess the "all the AOF business in a different thread" is the most sensible approach to follow probably, unless there is an (even Linux specific syscall) that is able to avoid blocking but just to force commit of old data.

rbranson 15 years ago | |

You could also do AOF with group commit where every N milliseconds you do an fsync and only ack write commands after the fsync completes. I hacked group commit for Redis:

https://github.com/rbranson/redis

rbranson 15 years ago |

I don't get what is so difficult about AMQP. These are clearly talented programmers, so what gives? Even if you're not a bank, simple features like message timeouts can make your infrastructure tremendously more resilient.

jamwt 15 years ago | |

I'd chalk it up to the general benefits of eschewing needless complexity.

I can say, empirically, none of the many, many challenges we've had building and scaling Bump, have been related to Redis's capabilities as a messaging bus. So "good enough" wins again.

rbranson 15 years ago | | |

FWIW, I'd encourage you to still take a deeper look at AMQP, just because it includes features you may not know you need. While I can't pretend to know anything about your scaling challenges or the intimate details of how messaging is utilized, I can say that there are the lessons of deep experience with messaging baked into AMQP. You may have (or perhaps you already have) to implement some of these features in the future. I know that I gave AMQP the cold shoulder for a while, only to finally come around and find out it solved many of the frustrations we were facing.

bdr 15 years ago | |

Agreed. Also, the ability to scale out easily, assuming you're using RabbitMQ.

LiveTheDream 15 years ago |

In January, Bump reported allocating a whopping 700GB of RAM for redis[1].

[1] http://devblog.bu.mp/haskell-at-bump

ahuibers 15 years ago | |

We (Bump) have 12 redis machines now with 72 or 96GB each. 6 masters and 6 slaves. The slaves are hot spares and persist to disk, per the blog.

rkalla 15 years ago | | |

Given the inherently small foot print of Redis, your data sets are HUGE. Looking forward to reading the Mongo article when it is ready and how it is performing.

simonw 15 years ago |

I really like the idea of pushing log messages in to a redis list and then flushing them out to disk with another process.

I've often thought it would be useful to have a redis equivalent of MongoDB's capped collections, specifically to make things like recent activity logs easier to implement. At the moment you can simulate it with an rpush followed by an ltrim, but it would be nice if using two commands wasn't necessary.

antirez 15 years ago | |

Hello Simon,

sending LPUSH+LTRIM in a pipeline is the same as having a special command for this. But having a special command for this, and for other use cases, makes Redis somewhat less general. What I mean is that if we consider every added feature a cost (complexity cost, not development cost), why don't instead add a feature that allows for a use case currently not covered?

Btw there is an interesting pattern so you actually need to rarely send the LTRIM. Imagine this: you want a list to save the user timeline, you are interested only in the latest 100 messages. So for every entry you can LPUSH+LTRIM. But after all you can just LTRIM 10% of the times. Your list will fluctuate in length between 100 and 110, but as you access things using LRANGE the additional elements wil not create any problem. So the cost of the LTRIM, while already very very small, can be made 90% smaller with this simple trick.

conorh 15 years ago | | |

I use this exact pattern at Boxcar to keep some lists short and it works great for us.

timr 15 years ago | |

I don't know...that part sounded like a hacked-up, half-implementation of scribe: https://github.com/facebook/scribe/wiki

I'd be interested in hearing if they tried to use Scribe for the same task and found it wanting, or if there's some other story.

jamwt 15 years ago | | |

Could you say more about why Bump's implementation of network-based queued logging is "hacked-up" while facebook's (by implication) isn't?

To answer your question, simply put, no one here had heard about Scribe.

jubos 15 years ago | | |

scribe is a very powerful logging tool, but it also comes with its dependency costs. Compiling boost, thrift, fb303, and all the scribe logging libraries as well. If you are already a thrift shop, it can make a lot of sense, but otherwise, there is a lot of legwork to get it up and running.

ladon86 15 years ago |

OK, so I'm running mongodb on the same machine as redis.

I do have mongodb replicated across two other machines, but could you briefly shed light on what the problems between redis and mongo on a single box were?

wmoss 15 years ago | |

The quick answer is that MongoDB mmaps it's entire data set, so if you've got more data than ram (likely) the OS is going to constantly have all excess ram allocated to Mongo. This becomes an issue because Redis (very reasonably) assumes that malloc will return quickly, however, if the OS decides it's going to give Redis a dirty page, that malloc call just became disk bound.

sfphotoarts 15 years ago |

I was curious about the Redis sets used for social graph storage and using intersection of sets to find nodes in common. Would anyone have an idea about the complexity of this as both the number of nodes in the graph and the number of edges each node has?

delano 15 years ago | |

You can find out: write a script to populate sets in Redis and run some commands. You'll likely find it's fast enough for your needs.

moe 15 years ago |

Just a note about logging: It seems you're making it harder than it needs to be. Syslog supports remote-logging.