Open-sourcing a 10x reduction in Apache Cassandra tail latency

Open-sourcing a 10x reduction in Apache Cassandra tail latency(engineering.instagram.com)

408 points by mikeyk 8 years ago | 164 comments

the8472 8 years ago |

> The graph shows that a Cassandra server instance could spend 2.5% of runtime on garbage collections instead of serving client requests. The GC overhead obviously had a big impact on our P99 latency

No, this is not obvious. If you have a fully concurrent GC then spending 25 out of 1000 CPU cycles on memory management does not "obviously" have an impact on your 99th percentile latency. It would primarily impact your throughput (by 2.5%), just like any other thing consuming CPU cycles.

> We defined a metric called GC stall percentage to measure the percentage of time a Cassandra server was doing stop-the-world GC (Young Gen GC) and could not serve client requests.

Again, this metric doesn't tell you anything if you don't know how long each of the pauses are. If they are at the limit infinitesimally small then you are again only measuring the impact on throughput, not latency.

Certainly, GCs with long STW pauses do impact latency, but then you need to measure histograms of absolute pause times, not averages of ratios relative to application time. That's just a silly metric.

And neither does the article mention which JVM or GC they're using. Absent further information they might have gotten their 10x improvement relative to some especially poor choice of JVM and GC.

_ivvf 8 years ago | |

you clearly didn't read the post very closely. They said 2.5% of CPU cycles were spent on stop-the-world young generation collections, not on the sum total of all memory mangement. That means that 2.5% of the time the app is entirely stalled on just these collections. Given that stop-the-world pauses are never evenly distributed throughout time, it should be very much expected that this much GC stalling would affect p99 latencies.

It's pretty much accepted everywhere that GCs perform terribly for databases. Modern GCs are great at handling small, very short-lived memory allocations, and that's about it. Just about any other workload and manual memory management ends up being a much better use of your time than GC tuning.

b4lancesh33t 8 years ago | | |

> Given that stop-the-world pauses are never evenly distributed throughout time

That is not a given. And, even distribution is only part of the equation. If they are sufficiently short, then even being somewhat unevenly distributed should not have much of an impact on latency. For example, if the max length of a pause were 1ms, and 99p latency were 15ms, you'd have to be fairly unlucky to see a 33% increase in latency99 due to GC. That would entail 5 of 25 pauses happening during a 20ms period in a 1s window.

(This idea is not purely hypothetical. For example, Go's GC has very low STW periods.)

> It's pretty much accepted everywhere

Eh. Apparently everyone thinks C is the best language for cryptography and other secure but not particularly perf sensitive code. Go figure. Sometimes the wisdom of the masses is not wisdom. Best not to appeal to it during argumentation.

nvarsj 8 years ago | | |

So why do people keep building latency sensitive things in the JVM? And then they manage to get hugely popular?

Cassandra is a constant struggle with the GC. I’d guess the cost of running it is at least an order of magnitude greater compared to if it had been implemented in c++ or something more sensible.

foolfoolz 8 years ago | |

classic hacker news comment.

this thing you built and open sourced, has gotten you real measurable results? allow me to list the many ways you’re probably wrong and doing it incorrectly

discoursism 8 years ago | | |

Measurable results are all well and good, but it can be helpful to know how the baseline was established. Measurable results aren't "portable" without a well-established baseline.

fdeliege 8 years ago | |

Both code and benchmark are open sourced. We'd love to hear how it performs for you.

viraptor 8 years ago | | |

This is a valid criticism of the methodology / explanation. It's not about the results. You can agree with the positive results (and they're great! - you've done awesome work and clearly show an improvement) and still say the explanation how/why they were achieved is not great.

teacpde 8 years ago | |

> If you have a fully concurrent GC then spending 25 out of 1000 CPU cycles on memory management does not "obviously" have an impact on your 99th percentile latency.

I try to understand the meaning. Is it saying the latency caused be GC is applied to all requests, not just the ones that observe 99th percentile latency?

dtparr 8 years ago | | |

It's saying that whether it affects latency or just throughput depends on how those pauses are distributed in absolute terms, not just the ratio. There's a big difference in 99th percentile latency between a 1ms pause every 400ms and a 10 second pause every 67 minutes, but they both work out to 2.5% by the ratio metric.

So yes, at the `infinitesimally small` end, time would be 'stolen' evenly from all request threads and would not be a contributing factor to the 99th percentile.

the8472 8 years ago | | |

No, that would be an incremental GC working in very small time slices.

A concurrent GC spends CPU cycles on different cores to do its work, which means it will not cause latency outliers in the threads processing the requests. They are still CPU cycles you don't have to serve other requests, hence they still affect throughput.

That is a simplified explanation of course, there are a lot of caveats.

In my original post I was mostly speaking about the measurement though, since they are measuring throughput when they are concerned about latency, those are somewhat related but depending on circumstances only weakly so.

dikanggu 8 years ago |

We do want to contribute our work back to the Cassandra upstream, instead of keeping it as a fork. So that more users from C* community can benefit from the improvements. The pluggable storage engine is an ambitious project (https://issues.apache.org/jira/browse/CASSANDRA-13474). Any help will be appreciated!

russellspitzer 8 years ago | |

Saw you talking about this on the Distributed Data Show

https://academy.datastax.com/content/distributed-data-show-e...

gfosco 8 years ago |

RocksDB is used all over Facebook, powers the entire social graph. Great storage engine that pairs well with multiple DBMS: MySQL, Mongo, Cassandra... We'll be at Percona Live 2018 in April, giving several talks, and are looking forward to hanging out and talking with users in our lounge area. We're working hard to support our open source community as well! https://github.com/facebook/rocksdb

openasocket 8 years ago |

I'm not an expert on these things, but it seems to me if you're implementing a database in Java you wouldn't want to keep your data on the JVM Heap, as this seems to indicate. My understanding is that in most applications (like servers) the average object lives for a very short period of time, and most GC implementations are built from that idea. But, in a database, especially an in-memory database, the majority of the objects are going to live for a very long time. That makes the mark phase of GC a lot more expensive, puts more pressure on the generations, etc.

Is my guess here correct, or are there things I'm missing or mistaken on?

haglin 8 years ago |

"To reduce the GC impact from the storage engine, we considered different approaches and ultimately decided to develop a C++ storage engine to replace existing ones."

I wonder how the numbers would have looked with the new low latency GC for Hotspot (ZGC). https://wiki.openjdk.java.net/display/zgc/Main

Early results from SPECjbb2015 are impressive. https://youtu.be/tShc0dyFtgw?t=5m1s

tibbetts 8 years ago | |

Yes, also Azul Zing. Really anytime someone says they have a problem with GC and suggests spending a million dollars of engineer time building a new system, they should consider Zing first. It works and is a way more efficient way of spending money to fix GC latency problems.

ADefenestrator 8 years ago | | |

For a small to medium sized shop, sure. For someplace with thousands or tens of thousands of nodes, the new system ends up cheaper in the long run.

majidazimi 8 years ago | | |

Because GC related issues don't undergo from "problem" state to "solved" state. It is just a never ending stream of issues, that the team need to resolve, specially in a database realm in which metrics are hugely workload dependent.

itronitron 8 years ago | |

yes, a comparison across multiple JVMs would be nice

Thaxll 8 years ago |

Weird, did they try to use https://www.scylladb.com/?

tschellenbach 8 years ago |

For Stream's feed tech we also moved from Cassandra to an in-house solution on top of RocksDB. It's been a massive performance and maintenance improvement. This StackShare explains how Stream's stack works. It's based on Go, RocksDB and Raft: https://stackshare.io/stream/stream-and-go-news-feeds-for-ov...

3uclid 8 years ago |

Unrelated: as a CS undergrad, I read this article and was immediately inspired. This is definitely the type of work I want to be doing when I graduate (infrastructure engineering). But my next thought was: where do I start?!

Any advice?

en4bz 8 years ago | |

CMU Database Group Lectures: https://www.youtube.com/channel/UCHnBsf2rH-K7pn09rb3qvkA

therealdrag0 8 years ago | |

I'd say no matter what kind of job you get, you can put 10% of your time into similar problems. Even simple CRUD apps can have interesting problems like this. In my experience every project has instances of engineers shooting themselves in the foot, or unforeseen problems cropping up. If you have a bit of self-motivation you can dig into them and learn a lot and improve things. I do this and find it very satisfying.

jjirsa 8 years ago | |

Happy to help you get started working on Cassandra. http://cassandra.apache.org/doc/latest/development/patches.h... Has some basic entry pointers. There’s also a dev mailing list that’s reasonable active.

ddorian43 8 years ago | |

Still in school ? (don't understand different <type>grad). See: GSOC Seastar Framework https://summerofcode.withgoogle.com/organizations/6190282903...

3uclid 8 years ago | | |

Yeah, still in school (3rd year). I have intern experience, but it seems like these type of positions are way too advanced for me at the moment. Just unsure how to progress...

StreamBright 8 years ago |

In a similar situation we just adjust the GC and started to use G1GC which resulted in similar numbers.

coryfoo 8 years ago | |

I bet that didn't take N engineers 12 months to build out, either

threeseed 8 years ago | | |

Cassandra uses G1GC by default.

If it was as simple as tweaking a few GC settings to get 10x improvement pretty sure Datastax would've done it by now.

StreamBright 8 years ago | | |

2 engineers, 2 weeks because we had to evaluate every change we made with production traffic.

fdeliege 8 years ago |

Join our meetup to chat with some of the developers: https://www.meetup.com/Apache-Cassandra-Bay-Area/events/2483...

jjirsa 8 years ago | |

So sad I’m not in town that week

en4bz 8 years ago |

Has any tried running Casandra on Azul Zing[1]? The slowdown here is not surprisingly related to GC pauses which Azul has eliminated in Zing.

[1] https://www.azul.com/products/zing/

rbranson 8 years ago | |

The licensing cost of Zing generally makes this a bad trade-off. It's much cheaper to purchase more hardware. Zing is targeted at vertically scaling very large JVM heaps, where it's valuable to have massive amounts of data on a single, big machine.

nitsanw 8 years ago | | |

As an ex-Azul employee I can say there's a good number of Azul clients using a Zing+Cassandra setup, so the price point is right for some people at the very least. Zing licence cost has also changed in recent years (3.5k per server last I looked, and that is before you haggle some bulk deal) so not sure if your impression is calibrated to that new price point.

jjirsa 8 years ago | |

Have friends who have used it, they report that it works reasonably well. Especially in p99.

truth_seeker 8 years ago | | |

By what factor/magnitude p99 was improved ? Any idea ?

spockz 8 years ago | |

Actually, it appears that is one of the premises[1] they sell Zing on.

[1]: https://www.azul.com/solutions/cassandra/

adrianratnapala 8 years ago |

As a Java scoffer trying to be fair-minded, I resisted the urge to joke that "it's was the GC, stupid" and assume that a big project like Cassandra had somehow worked around the GC latency problems.

But, what? It turns out the article is really about replacing Java with C++.

cestith 8 years ago | |

It's about using something in one language for its features and only porting the critical sections to C++ via a clean API. This is the sort of advice we've been giving people for decades. Choose the language for what you want to build, measure and profile performance if necessary, find the bottleneck on the hot path, decouple that from the bulk of the code, and reach to a lower level for performance only in that clearly defined section.

They managed to generalize one application that meets their feature needs to be a front end to another existing application with fewer features but better performance as a back end. They're optimizing their hot path by decoupling it from the rest of the application and handing off to C++ code they didn't even have to write. Adding pluggable storage engines to Cassandra means that if they make the API smooth enough they can have engines in C, C++, Erlang, Go, Rust, ML, or whatever in the future without changing their front end. That's a big win even beyond this tail latency issue.

majidazimi 8 years ago | | |

Well, other than storage engine, the next big part of a database software is the query planner/optimizer which Cassandra doesn't have (due to simple KV nature of it). So there isn't much remaining. In a long term plan, rewrite them all and you have single code base and you'll benefit from mighty C++ in other components of the database. And there is still room for more optimizations: SIMD, ...

The GC problem is not limited to C*. This shit(virtual machine) is hitting the whole Hadoop stack: HDFS, Hive, Spark, Flink, Pig...

Immense number of tickets in any fairly large cluster is related somewhat to GC and JVM behavior.

cmrdporcupine 8 years ago |

I remember using quite early versions of Cassandra back in an ad-tech startup I was at back in 2009 or 2010, spending unfortunate amounts of time fighting the JVM GC and trying to tune things so it behaved responsibly. It was a real problem then and I know a lot of work went into fixing GC behaviour. Then I stopped using Cassandra for work, but it's unfortunate this is still an issue?

What I took out of that is that I really feel like something like Cassandra is better suited to implementation in a language like C++ or Rust. And I believe others have since come along and done this.

I really liked the gossip-based federation in Cassandra though.

ADefenestrator 8 years ago | |

It's still an issue, but a lot less of one. 3.0 is a big improvement in terms of GC behavior. Haven't tested 3.11.x yet, but it look in theory like a decent improvement in terms of rounding off corner cases and adding instrumentation.

estebank 8 years ago | |

It sounds like you might be interested in TiKV.

https://github.com/pingcap/tikv

cmrdporcupine 8 years ago | | |

Thanks.

Since coming to Google I don't get the opportunity to compare/evaluate/deploy tools like this anymore. Smarter people than me make choices like that :-)

bfrog 8 years ago |

Meanwhile scylladb looks like a better option for numerous reasons

yazr 8 years ago |

Or just try and benchmark Azul VM with pause-less GCs ?!

(I have used Azul in low-latency production environments. It has pros and cons but it certainly beats re-writing the storage layer... )

truth_seeker 8 years ago | |

Curious to know the cons of using it, except being commercial.

manigandham 8 years ago | | |

> except being commercial

That's the biggest, especially for when it's a Facebook company. Otherwise it works well but can be pricey.

The JVM is getting a new fully concurrent collector though called Shenandoah: https://www.google.com/search?q=shenandoah+gc

yazr 8 years ago | | |

Needs a stronger machine to be effective (more cores & more memory)

Minor configuration issues (we had a very complex environment, custom kernel, weird network stuff, JNIs)

jjirsa 8 years ago |

Nicely done! Looking forward to the pluggable storage engine.

pas 8 years ago | |

The JIRA tickets don't really shine with much hope :/

https://issues.apache.org/jira/browse/CASSANDRA-13474 [2 comments from 2017 Apr] https://issues.apache.org/jira/browse/CASSANDRA-13475 [~100 comments, but the last one is from 2017 Nov, by the InstaG engineer]

And the Rocksandra fork is already ~3500 commits behind master, so upstreaming this will be interesting.

Oh, and the Rocksandra fork is already kind of abandoned - no commits since 2017 Dec. (which probably means this is not actually the code that runs under Instagram.)

dikanggu 8 years ago | | |

This is the rocksandra branch, https://github.com/Instagram/cassandra/tree/rocks_3.0, we develop it on top of Cassandra 3.0. It's the code we are running on our production servers.

jjirsa 8 years ago | | |

I'm a committer, I'm familiar with the JIRA ticket.

kiril-me 8 years ago | |

It would be great. But I don't think it could happen. The pluggable storage engine would greatly increase the cognitive complexity of the code.

jjirsa 8 years ago | | |

Of course it could happen. Pluggable engine has a lot of positives - not only does it enable features like this, it also helps modularize the codebase making it more testable, and there are other people who will develop storage engines for their own use case over time (look at the evolution of - for example - MySQL storage backends for examples of this).

rbranson 8 years ago |

Did you all find that there were changes to the Java heap/GC configuration that would make tuning this setup different? I imagine if most everything that "sticks" is moved off heap, the GC could be tuned more heavily for young gen throughput vs trying to balance it with long-lived objects.

dikanggu 8 years ago | |

Yeah, for Rocksandra, we are able to use much smaller heap size, and most of the objects are recycled during the young gen GC.

agnivade 8 years ago |

> We also observed that the GC stalls on that cluster dropped from 2.5% to 0.3%, which was a 10X reduction!

Umm .. shouldn't the stalls go to 0, because now you have moved to C++ ? Or is this the time it takes for the manual garbage collection to occur ?

steeve 8 years ago |

Why not use ScyllaDB ? (Serious)

cnlwsu 8 years ago | |

Answered a bit before but they have a team that knows c* well. Cassandra is proven to handle petabytes at scale in production systems.

xuanyue 8 years ago |

Is there any trade off after replacing LSM tree-based storage engine to RocksDB storage engine?

irfansharif 8 years ago | |

RocksDB is also an LSM structured KV store.

welder 8 years ago |

Great, now can you fix the Python Cassandra Driver to work in a multi-threaded application environment without the connection pooling bugs and default synchronous app-blocking (vs lazy-init) connection setup?

https://github.com/datastax/python-driver

ismail 8 years ago |

So question:

Any thoughts on replacing HDFS + Yarn + Hive + HBASE with GulsterFS + Kubernetes + Cassandra

ddorian43 8 years ago | |

Hbase is sync+globally sorted, while cassandra is not, so probably not.

alsadi 8 years ago |

Can we add lz4 to the blend to reduce disk IO?