A Newbie’s Guide to Cassandra(blog.insightdatascience.com) |
A Newbie’s Guide to Cassandra(blog.insightdatascience.com) |
It took a lot of experimentation to get right, but once I did, scaling started to mean smaller drives and more nodes, which meant a more expensive cluster, for which I was largely paying for my CPUs to repair and garbage collect data.
Other than ops however, Cassandra is a great tool and does everything it says it does on the box.
Certainly something we can do better - how would you break it up? Adding a key to dump an individual partition to json?
This transition has been causing confusion for at least 5 years now, and it appears people are still using the old terminology! https://www.datastax.com/dev/blog/thrift-to-cql3
"Total Newbie" apparently means well-versed in database paradigms and terminology.
It's partitioned: Cassandra is a clustered database that will automatically route data to the right nodes. It does this by partitioning a token ring among members of the cluster. If you need more capacity, you add nodes and they claim more of the "token ring".
The row store: Cassandra groups data within partitions (see above) which determines which hosts get the data. Within each partition, Cassandra sorts the CQL rows based on your schema. If you had a table of "employees", you could have them partitioned by last initial, and then clustered by last name - all of the employees with last name starting with "J" would be on the same machines, and then they'd be sorted on disk "Ja...", "Je...", etc
Searching YouTube for Cassandra summit talks is probably second
There was a push to do some better docs on the ASF website but it's just manpower that is currently spending time writing code instead - we have no real full time doc writers that focus on the open source product. Maybe some day someone will volunteer (and if you want to volunteer, I'll commit the docs for you - the site has a how to contribute guide, but honestly I'll take GitHub PRs if they're nontrivial even though it's an annoying workflow for our non-GitHub master).
The lesson here is to think long and hard about how you are going to access your data before switching to a database like Cassandra. This will help you decide if Cassandra is the right database to fit your use-cases. If so, be sure to model your data appropriately.
In this case, based on how the company wants to query the data, they would have been better off with PostgeSQL.
That's literally every Cassandra database I've ever encountered in the wild.
If you use Cassandra, you WILL need to duplicate data across tables for lookups. Don't use Cassandra if you can't stomach that fact (and the disk bills that come with it).
Use Cassandra when you're going to need to grow our database cluster often and don't have tooling to handle resharding
Use Cassandra when you do millions of simple queries (per second), not a handful of complex JOINs
I've used Cassandra at 3 different employers now, and I can't imagine using anything else for many use cases, but there will always be some where it's the wrong choice.
In all other cases you'll probably be better off with Postgres, MySQL or similar.
https://academy.datastax.com/courses
Once you make it past the videos trying to sell you on NoSQL, they are incredibly informational.
I think that Cassandra is best thought of as a fancy K/V store that lets extra data ride along with query results. Don't think of rows/columns at first, it will just screw you up in your modeling. Also keep in mind that the cost for very fast queries is a lot of extra time spent figuring out how to model new data access patterns in the future.
Anywhoo, one huge file is fine, what's not fine is having one huge json object -- streaming parsers might be ubiquitous in the XML world, but definitely not in json land. Something simple like small json documents separated by newlines would work.