Open-sourcing a 10x reduction in Apache Cassandra tail latency(engineering.instagram.com) |
Open-sourcing a 10x reduction in Apache Cassandra tail latency(engineering.instagram.com) |
No, this is not obvious. If you have a fully concurrent GC then spending 25 out of 1000 CPU cycles on memory management does not "obviously" have an impact on your 99th percentile latency. It would primarily impact your throughput (by 2.5%), just like any other thing consuming CPU cycles.
> We defined a metric called GC stall percentage to measure the percentage of time a Cassandra server was doing stop-the-world GC (Young Gen GC) and could not serve client requests.
Again, this metric doesn't tell you anything if you don't know how long each of the pauses are. If they are at the limit infinitesimally small then you are again only measuring the impact on throughput, not latency.
Certainly, GCs with long STW pauses do impact latency, but then you need to measure histograms of absolute pause times, not averages of ratios relative to application time. That's just a silly metric.
And neither does the article mention which JVM or GC they're using. Absent further information they might have gotten their 10x improvement relative to some especially poor choice of JVM and GC.
It's pretty much accepted everywhere that GCs perform terribly for databases. Modern GCs are great at handling small, very short-lived memory allocations, and that's about it. Just about any other workload and manual memory management ends up being a much better use of your time than GC tuning.
That is not a given. And, even distribution is only part of the equation. If they are sufficiently short, then even being somewhat unevenly distributed should not have much of an impact on latency. For example, if the max length of a pause were 1ms, and 99p latency were 15ms, you'd have to be fairly unlucky to see a 33% increase in latency99 due to GC. That would entail 5 of 25 pauses happening during a 20ms period in a 1s window.
(This idea is not purely hypothetical. For example, Go's GC has very low STW periods.)
> It's pretty much accepted everywhere
Eh. Apparently everyone thinks C is the best language for cryptography and other secure but not particularly perf sensitive code. Go figure. Sometimes the wisdom of the masses is not wisdom. Best not to appeal to it during argumentation.
Cassandra is a constant struggle with the GC. I’d guess the cost of running it is at least an order of magnitude greater compared to if it had been implemented in c++ or something more sensible.
this thing you built and open sourced, has gotten you real measurable results? allow me to list the many ways you’re probably wrong and doing it incorrectly
I try to understand the meaning. Is it saying the latency caused be GC is applied to all requests, not just the ones that observe 99th percentile latency?
So yes, at the `infinitesimally small` end, time would be 'stolen' evenly from all request threads and would not be a contributing factor to the 99th percentile.
A concurrent GC spends CPU cycles on different cores to do its work, which means it will not cause latency outliers in the threads processing the requests. They are still CPU cycles you don't have to serve other requests, hence they still affect throughput.
That is a simplified explanation of course, there are a lot of caveats.
In my original post I was mostly speaking about the measurement though, since they are measuring throughput when they are concerned about latency, those are somewhat related but depending on circumstances only weakly so.
https://academy.datastax.com/content/distributed-data-show-e...
Is my guess here correct, or are there things I'm missing or mistaken on?
I wonder how the numbers would have looked with the new low latency GC for Hotspot (ZGC). https://wiki.openjdk.java.net/display/zgc/Main
Early results from SPECjbb2015 are impressive. https://youtu.be/tShc0dyFtgw?t=5m1s
Any advice?
If it was as simple as tweaking a few GC settings to get 10x improvement pretty sure Datastax would've done it by now.
But, what? It turns out the article is really about replacing Java with C++.
They managed to generalize one application that meets their feature needs to be a front end to another existing application with fewer features but better performance as a back end. They're optimizing their hot path by decoupling it from the rest of the application and handing off to C++ code they didn't even have to write. Adding pluggable storage engines to Cassandra means that if they make the API smooth enough they can have engines in C, C++, Erlang, Go, Rust, ML, or whatever in the future without changing their front end. That's a big win even beyond this tail latency issue.
The GC problem is not limited to C*. This shit(virtual machine) is hitting the whole Hadoop stack: HDFS, Hive, Spark, Flink, Pig...
Immense number of tickets in any fairly large cluster is related somewhat to GC and JVM behavior.
What I took out of that is that I really feel like something like Cassandra is better suited to implementation in a language like C++ or Rust. And I believe others have since come along and done this.
I really liked the gossip-based federation in Cassandra though.
Since coming to Google I don't get the opportunity to compare/evaluate/deploy tools like this anymore. Smarter people than me make choices like that :-)
(I have used Azul in low-latency production environments. It has pros and cons but it certainly beats re-writing the storage layer... )
That's the biggest, especially for when it's a Facebook company. Otherwise it works well but can be pricey.
The JVM is getting a new fully concurrent collector though called Shenandoah: https://www.google.com/search?q=shenandoah+gc
Minor configuration issues (we had a very complex environment, custom kernel, weird network stuff, JNIs)
https://issues.apache.org/jira/browse/CASSANDRA-13474 [2 comments from 2017 Apr] https://issues.apache.org/jira/browse/CASSANDRA-13475 [~100 comments, but the last one is from 2017 Nov, by the InstaG engineer]
And the Rocksandra fork is already ~3500 commits behind master, so upstreaming this will be interesting.
Oh, and the Rocksandra fork is already kind of abandoned - no commits since 2017 Dec. (which probably means this is not actually the code that runs under Instagram.)
Umm .. shouldn't the stalls go to 0, because now you have moved to C++ ? Or is this the time it takes for the manual garbage collection to occur ?
Any thoughts on replacing HDFS + Yarn + Hive + HBASE with GulsterFS + Kubernetes + Cassandra
??
This hybrid approach gives the benefit of a managed runtime and safety of GC for most of your code, but allows the performance of raw pointers/malloc for key code paths.
Some examples of this pattern on the JVM:
- The Neo4j Page Cache, Muninn, https://github.com/neo4j/neo4j/blob/3.4/community/io/src/mai...
- The Netty projects implementation of jemalloc for the JVM: https://github.com/netty/netty/blob/4.1/buffer/src/main/java...
Everything is relative, I guess.
I wonder how long before we see a similar result from ElasticSearch. (Only other huge JVM based store I can think of).
We still have "other" things on-heap. The biggest contributor to GC pain tends to be the number of objects allocated on the read path, so this patch works around that by pushing much of that logic to rocksdb.
There are certainly other things you can do in the code itself that would also help - one of the biggest contributors to garbage is the column index. CASSANDRA-9754 fixes much of that (jira is inactive, but the development work on it is ongoing).
When you already know Cassandra, and you already know RocksDB, and you already have an engineering team, it makes far more sense to combine the two things you know how to use at scale than to try to use some new thing NOBODY has run at scale.
Outbrain uses ScyllaDB in production at scale across multiple data centers. Not sure if it's Instagram scale, but still enough to prove it's reliability and performance.
https://www.outbrain.com/techblog/2016/08/scylladb-poc-not-s...
Scylla is AGPL for the OSS version though so testing it out would not be an option without getting a commercial license first.
Huh? The AGPL is not a non-commercial-use-only license.
If you have proprietary software that you would like to combine with AGPL code (i.e., not interact with as a service) and is available to the general public over the Internet, and you want keep your code proprietary, sure, you may not want to use the AGPL. But you could say the same thing about proprietary software you want to combine with GPL code and sell to the general public.
If you're either using the software through it's existing defined public interfaces, or you're okay releasing anything you modify or link into the software, the AGPL (and the GPL) are fine. Lots of people distribute proprietary products that include GPL code, like Chromebooks, Android phones, routers, GitHub Enterprise, etc. We figured out years ago that the Linux kernel is not just a non-commercial product. Why are we having the same misconceptions about the AGPL?
The issue isn't so much Java the language, as it is being aware of the GC, and developing with it in mind.
I don't believe one would need a commercial license just to test a product in any way? They are not making that part of any product at that point, so no concerns here.
Noone claims that a product using the MySql driver is a derivative work of the MySql server?
Edit: Of course, IANAL...
- how many experienced scylladb devops are there globally that we can hire?
Those questions asked at BigTechCo before it adopts somebody elses tech.
FB already operates RocksDb and Cassandra so there's way less technical, career, financial risk for just hacking the two together with some aggressive refactoring.
Makes sense ? Another project you may look at is https://github.com/phaistos-networks/Trinity which is a library and should be simpler than a whole framework/db (trinity is like lucene/rocksdb compared to full-db cassandra/scylladb).
1. https://wiki.openjdk.java.net/display/shenandoah/Main 2. https://wiki.openjdk.java.net/display/zgc/Main
And upstream Cassandra is already 11 minor releases away. Won't that become a problem with something as fundamental/low-level as pluggable storage engines?
That said: the IG folks certainly know how to tune JVMs. There are IG (and former IG, I saw rbranson post) folks in this thread that know how to tune the collectors, so assume that the 10x they see is beyond what you'd get from simple tuning.
A permanent memory block on the JVM heap can't use pointers to refer to it, since GC moves objects around. And even though those blocks will never be collected, they make up additional work for the GC to track.
This is typically handled by https://en.wikipedia.org/wiki/Write_barrier#In_Garbage_colle...
"The GNU General Public License permits making a modified version and letting the public access it on a server without ever releasing its source code to the public.
The GNU Affero General Public License is designed specifically to ensure that, in such cases, the modified source code becomes available to the community. It requires the operator of a network server to provide the source code of the modified version running there to the users of that server. Therefore, public use of a modified version, on a publicly accessible server, gives the public access to the source code of the modified version."
But for testing, I don't see any impediment.
There are some meaningful changes between 3.0 and 3.11 (notably a compressed chunk cache for storing some intermediate data blocks and a significant change to the way the column index is deserialized) that do help tail latencies, and there's certainly quite a bit more low hanging fruit, but the biggest contributor to p99 latencies is the GC collections, and the read path still contributes the most JVM garbage, so this is still probably a meaningful improvement over 3.11.
Spending 2.5% of cycles on GC doesn't mean those cycles are perfectly distributed. The distribution of GC work onto cores is bound to have some cores doing more work than others, which would (I would think) manifest itself as some requests that land on an unlucky core getting more latency. Isn't it totally expected that this would cause a P99 latency spike?
And you could also use more CPU cores than request workers, that way you will always have spare core capacity and thus your latency will not be directly impacted. That is if you really really value latency more than throughput.
Again, my main point is that throughput and latency are not the same thing. There is some relation in so far that you cannot fulfill latency promises if your throughput is insufficient and your queues start filling up. But below the saturation point it's a lot more complicated, especially in parallel systems with bursty arrivals.
That makes sense, how about GC for single-threaded languages, e.g. Nodejs?
They will write c++ in java eventually. Depending on how much performance you REALLY need.
The same for elasticseach, if you want performance you need to do the same thing scylladb did to cassandra (per-core-sharding, skip filesystem across cores etc)
There are blog posts speeding lucene by 2x+ by changing some stuff to c/c++. There are libraries (trinity) claiming 2x+ performance .
There is google-engineer saying "bigtable is 3x faster than hbase" that I've read.
I think it's DataStax. Databricks is the company behind Spark.
That's definitely not what I was suggesting in my comment.
On the flip side, recognizing when you are not efficiently using your time to solve a problem with the current approach and deciding to switch to another one is what I would consider a vital part of good software engineering, and also one of the hardest things to become good at without a lot of experience.
I'm not sure that's the case here, but I don't doubt it's possible they could easily have wasted more time testing and tweaking Java than it took to implement this solution. Whether that's how it would have (or possible even did to a degree) played out is an open question.
//Edit: removed 2nd part, which wasn't really that important anyway...
So you actually can't release a permissively licensed client for an AGPL server. I mean, they did, clearly, but the AGPL itself would seem to make that inconsistent.
But then none of this has ever been litigated and both the AGPL and GPL themselves are very confusingly worded so shrug.
So you must offer the source of the database everyone who connects to the database over the network, under the AGPL. But if you deliver a web app, not a database-as-a-service, your users don't connect to the database. And since this database uses the Cassandra protocol, I'd say your web app isn't a derived work of the database in any way.
Of course, that last part is the sticky bit. But if applications using database servers via a well defined protocol are judged to be derived works, we might have other problems - hence the reference to MySql in my first post.
MongoDB muddied the waters here by deciding to interpret the AGPL differently, but I wouldn’t risk your business on it.
At the end of the day, it's not worth risking yourself (or your company) when the owners of the library claims a software license works a certain way and you disagree. Sure you might be right and you might even prevail in court, but the potential legal fees usually aren't worth the trouble.
I ran into this issue when I was selecting a library to generate PDFs for my internship over the summer: https://itextpdf.com/AGPL
Requiring GPL/AGPL software as a dependency even if you don’t link to it, but instead talk to it over the network does not mean you haven’t developed a derivative work in terms of the letter and spirit of the license. This is in-part why I steer 100% clear of MongoDB, there’s nothing stopping them from changing their view on the license and deciding to pursue legal action against people who use it in non-AGPL compatible manners down the road.
The wording is ambiguous and as far as I know there have been no court cases yet that have yet to define what constitutes a connection between the end user and whether transitive connections count. If it's ambiguous to a software developer then corporate lawyers are definitely going to say no.
[1] https://news.ycombinator.com/item?id=16523858
EDIT: I realize that AGPL is valid for commercial use but since its terms are so onerous, especially once the lawyers get involved, it effectively makes the AGPL unusable in a larger corporation.
Fake data, non-userfacing servers, sure.
If I'm trying a products evaluation license, you can be sure I've tried literally every other option under the sun first, including investigating the possibility rolling my own if situationally appropriate. No form of development is slower than the kind where I have to wait for a company in another timezone to give me permission to use their software, so it's always last on my options list unless the company has frankly amazing reviews that pique my curiosity.
I don't like the AGPL because it's unclear on this exact sort of thing, but it does seem to me like the obvious reading of "all users interacting with it remotely through a computer network" does not encompass the connection between Instagram end users and their internal Cassandra.
And, in any case, they released sources for the thing they came up with - which is all that the AGPL requires. If they're okay with doing that, they can definitely use the AGPL for production commercial software.
So if you're latency sensitive then all of your code needs to be aggressive at avoiding object creation. All of your code becomes part of "the fast path", even if it's in a different thread.
Or you isolate your fast path in a different process or a non-GC'd runtime, the later being the approach taken here by Instagram.
[1] http://15721.courses.cs.cmu.edu/spring2018/papers/02-inmemor...
Well, Java has the advantage of being platform (and to a certain degree, runtime) independent, plus a robust set of best practices and ecosystem when it comes to modules and library handling, which is pretty hard to get done right for C/C++ projects.
What is the benefit of that? Who on earth runs a DB written in Java on windows? Any useful server software will end up using platform native features, be it SQL server, MySQL, HBase, ...
I guess the weird case is that when I'm using the Instagram app, I wouldn't say I'm personally interacting with even the Instagram front-end servers (the way I am in a browser), I'm just interacting with the app which happens to use the servers. And that does sound like not what the license authors would like.
In my opinion it's a colossal waste of resources. Classic example of using the wrong tool for the job.
If you were starting a database-focused company from the beginning than choosing C++ is a better decision, which is exactly what ScyllaDB has done with their cassandra clone. Along with general algorithm and decision improvements, it'll provide 10-100x the same performance at lower latency on the same server.
We're also starting to see more projects written in Go now, which is still a managed runtime but usually better at handling these kinds of low-level systems.
.NET Core on Linux is very fast and there are some great developments around fast low-level (yet managed) managed memory manipulation that can lead to some very fast software.
Something like Java makes implementation easier, but operation more difficult and costly.
There still are lots of Windows-only shops.