Broadly speaking, this is the correct style of architecture for a database engine on modern hardware. It is vastly more efficient in terms of throughput than the more traditional architectures common in open source. It also lends itself to elegant, compact implementations. I've been using similar architectures for several years now.
While I have not benchmarked their particular implementation, my first-hand experience is that these types of implementations are always at least 10x faster on the same hardware than a nominally equivalent open source database engine, so the performance claim is completely believable. One of my longstanding criticisms of open source data infrastructure has always been the very poor operational efficiency at a basic architectural level; many closed source companies have made a good business arbitraging the gross efficiency differences.
At some point several years ago, a few people noticed that if you attack the problem of scalable distribution within a single server the same way you would in large distributed systems (e.g. shared nothing architectures) that you could realize huge performance increases on a single machine. The caveat is that the software architectures look unorthodox.
The general model looks like this:
- one process per core, each locked to a single core
- use locked local RAM only (effectively limiting NUMA)
- direct dedicated network queue (bypass kernel)
- direct storage I/O (bypass kernel)
If you do it right, you minimize the amount of silicon that is shared between processes which has surprisingly large performance benefits. Linux has facilities that make this relatively straightforward too.
As a consequence, adjacent cores on the same CPU have only marginally more interaction with each other than cores on different machines entirely. Treating a single server as a distributed cluster of 1-core machines, and writing the software in such a way that the operating system behavior reflects that model to the extent possible, is a great architecture for extreme performance but you rarely see it outside of closed source software.
As a corollary, garbage-collected languages do not work for this at all.
"The Scylla design, right, is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC."
Now my question is how portable Scylla be in terms of NIC vendors?
Interesting. From: http://www.scylladb.com/technology/architecture/
Just to pull an example from memory, the ubiquitous Intel 82599 10GbE NIC silicon has up to 128 TX and RX queues in hardware. IIRC, these are bundled in pairs for direct access in virtualized environments, so in principle you could have 64 virtual cores each with their own dedicated physical hardware queue. This is almost certainly what they were talking about. That is the whole point of this feature in Ethernet silicon; it gives cores (virtual or physical) dedicate network hardware off a single NIC.
Personal pet-peeve of mine. Using "TPS" or "Transactions/sec" to measure something that is in no way transactional. Maybe ops/sec, reads/sec, updates/sec, or something...
Has it been through Jepsen yet?
All the external facing things for scylla is the same as Cassandra. That includes all the ring stuff and all network protocols.
So you should expect similar cluster behavior.
i would expect nothing.
If theire numbers were astounding with a 10, 100, 1000 node cluster, they would have published numbers with such set-ups. I call shenanigans on a report that is purposely out of line with the expected use case.
There is nothing commodity about a server with 128GB RAM.
When you introduce other nodes, you get chatter and network traffic....
Modifications to the database software must be shared, yes, but your client application is outside the reach of the AGPL and can remain proprietary.
Same trend, by the way, is in Android development.
One of the latest benchmarks I've seen is "Comparison of Programming Languages in Economics" [1] for code without any IO just number crunching, has a 1.91 to 2.69 speedup of using C++ compared to Java. So any code involving IO is going to be slower.
Replacing bad Java code with excellent machine aligned C++ a 10x speedup is possible.
[1] https://github.com/jesusfv/Comparison-Programming-Languages-...
Java has a ton of overhead that C++ doesn't. Each object has metadata which results in more "cold data" in the cache. Each object is a heap allocation (unless you're lucky enough to hit the escape analysis optimization), which again leads to less cache locality because things are distributed around memory. Then there's the garbage collector. Then bounds checking.
a) IO is such a large portion of the problem b) Hypertable isn't just way, way faster.
2. You seem to have missed the "bad Java" part, and my reply to Mechanical Sympathy.
> 1 DB server (Cassandra / Scylla)
The whole point of Cassandra is to run a cluster of servers to handle load at scale with minimal friction instead of having to buy a big single machine or spend all your time/money trying to run a clustered RDBMS. This test doesn't measure the correct thing.
Except I can launch higher than that on EC2, so that's not fact.
"Cassandra lightweight transactions are not even close to correct. Depending on throughput, they may drop anywhere from 1-5% of acknowledged writes–and this doesn’t even require a network partition to demonstrate. It’s just a broken implementation of Paxos. In addition to the deadlock bug, these Jepsen tests revealed #6012 (Cassandra may accept multiple proposals for a single Paxos round) and #6013 (unnecessarily high false negative probabilities)."
That's four bugs independent of the conflict resolution issue.
Their Java throughput is about 70% of their C++11 throughput, and that's with a pretty synthetic benchmark where there is not any logic behind those messages. Once you add in some real logic there, it gets even thinner.
They aren't doing user space networking, but that actually ought to allow Java to do even better.
Uh-huh... that's all pretty common for databases. Cassandra would fit that description.
> Never blocks on IO or page faults because all IO bypasses the kernel.
That just seems nonsensical. Sometimes, you are waiting for IO. That's just reality. It is conceivable you bypass the kernel for I/O, but that creates a lot of complexity and limitations. Near as I can tell though, they do talk to the kernel for IO.
Scylla never touches the page cache. All IO in seastar is direct IO and then scylla caches everything itself. We always know when the disk access is going to happen. The OS paging mechanism does not do a thing.
As for waiting for IO, of course IO does not complete immediately. But you can either block and wait for it, as Cassandra does (it doesn't even have the option not to in the case of the mmaped regions) or you can do something fully async like seastar that guarantees you never block waiting for IO.
By the way, I think you're replying to one of the devs of Scylla.
I'm very interested in a cluster benchmark therefor, say 10 servers, as Cassandra claims to scale very well. With a cluster IO has a higher performance impact than one server with local RAID IO.
That's why they benchmarked this workload on a 4x SSD RAID configuration :). Given that i/o bandwidth and throughput continues to increase, processor frequency isn't, and core counts are going up, it's prudent to design a system that can take advantage of this.
I'm sure there is a way to set up IO subsystems so that Cassandra becomes a huge bottleneck, but that's a pretty specialized context.
So for some abstractions such as column and other analytics databases, a sharded architecture works very well, but if you need isolated online transactions, strong consistency etc., then sharding no longer cuts it. Instead, you should rely on algorithms that minimize coherence traffic while maintaining a consistent shared-memory abstraction. In fact, there are algorithms that provide the exact same linear-scalability as sharding with an added small latency constant (not a constant factor but a constant addition) while providing much richer abstractions.
Similarly, your statement about "garbage-collected languages" is misplaced. First, there's that abstraction issue again -- if you need a consistent shared memory, then a good GC can be extremely effective. Second, while it is true that GCs don't contribute much to a sharded infrastructure (and may harm its performance), GCs and "GCed languages" are two different things. For example, in Java you don't have to use the GC. In fact, it is common practice for high-performance, sharded-architecture Java code to use manually-managed memory.
And yes, you can write high performance Java, but for whatever reasons the Cassandra codebase isn't an example of that. They just did a big storage engine rewrite and the result is slower.
And if that happens to be exactly all you need then that's great! :)
> And yes, you can write high performance Java, but for whatever reasons the Cassandra codebase isn't an example of that.
I don't know anything about the Cassandra codebase, but one thing I'm often asked is if you're not going to use the GC in Java, why write Java at all. The answer is that the very tight core code like a shard event loop turns out to be a rather small part of the total codebase. There's more code dedicated to support functions (such as monitoring and management) than the core, and relying on the GC for that makes writing all that code much more convenient and doesn't affect your performance.
The architecture at the link looks unusual compared to open source databases but it is actually a common architecture pattern in closed source databases with significant benefits, particularly when it comes to performance. There is a lot of what I would call "guild knowledge" in advanced database engine design, much like with HPC, things the small number of experts all seem to know but no one ever writes down.
It is a path dependency problem. Most open source databases were someone's first serious attempt at designing a database, a project that turned into a product. This is an excellent way to learn but it would be unrealistic to expect a thoroughly expert design for a piece of software of such complexity on the first (or second, or third) go at it. The atypical quality of PostgreSQL is a testament to the importance of having an experienced designer involved in the architecture.
Some researchers made a big splash at OSDI by doing this securely:
http://people.inf.ethz.ch/troscoe/pubs/peter-arrakis-osdi14....
As a naive person about this...
Is possible to reap some of the benefits from this kind of architecture in a managed environment (like .NET)?
If a try to implement, for example a sqlite-like/kdb+, memory only database in .NET, how far is possible to go?
Or how avoid some common traps?
I have no idea how to do any of these things. What are the system/api calls to lock a process to a kernel? How do you bypass kernel IO?
And to be fair, some people are pretty productive in modern C++. It's a shame the JNI isn't better so you can have the best of both worlds.
(for the record, Quasar is #1 on my list of libraries to try if I go back to Java)
My point is that if you're providing a shared-memory abstraction to your user (like arbitrary transactions) -- even at a very high level -- then your architecture isn't "shared-nothing", period. Somewhere in your stack there's an implementation of a shared-memory abstraction. And if you decide to call anything that doesn't use shared memory at CPU/OS level "shared nothing", then that's an arbitrary and rather senseless distinction, because even at the CPU/OS level, shared memory is implemented on top of message-passing. So the cost of a shared abstraction is incurred when it's provided to the user, and is completely independent of how it's implemented. The only way to avoid it is to restrict the programming model and not provide the abstraction. If doing that is fine for the user -- great, but there's no way to have this abstraction without paying for it.
And JNI is better now[1] (I've used JNR to implement FUSE filesystems in pure Java). JNR will serve as the basis for "JNI 2.0" -- Project Panama[2]. And thanks!