A Newbie’s Guide to Cassandra

A Newbie’s Guide to Cassandra(blog.insightdatascience.com)

119 points by ddrum001 8 years ago | 24 comments

nemothekid 8 years ago |

One thing that Cassandra doesn't have a good story of, and what intro guides continue to gloss over is the ops situation. I've recently moved some our largest Cassandra tables to BigTable for this reason. The compaction / repair / garbage collection death cycle is probably the most difficult thing to manage, and in the past 3 years of using Cassandra, managing it has gotten worse. Tools have been deprecated (like OpsCenter) and new features can exacerbate the problem. There is still no reliable way to detect when repairs have finished, and if you have a large enough table, repairs can take a week to finish. Combine that with the fact that if a table is that large, then it probably has a high write volume - meaning it has a lot of compactions as well. So you have repairs and compactions going on which thrash the heap, and now you also have a GC tuning problem.

It took a lot of experimentation to get right, but once I did, scaling started to mean smaller drives and more nodes, which meant a more expensive cluster, for which I was largely paying for my CPUs to repair and garbage collect data.

Other than ops however, Cassandra is a great tool and does everything it says it does on the box.

Boxxed 8 years ago | |

Completely mirrors my experience with Cassandra. I think they'd have a real contender on their hands if operating a cassandra cluster didn't basically take a full time engineer. Its backup story is absolutely abysmal, and tooling is atrocious -- during a support incident a DataStax guy suggested I dump a table with sstable2json (or something like that) which generated a 100GB json file. When I pointed out that basically nothing could consume it because it was one 100GB hash object, he said "Yeah, I guess no one ever uses this stuff."

jjirsa 8 years ago | | |

As a long time Cassandra user: people use sstable2json all the time, but most people don't have 100gb sstables (or 20gb sstables that make 100gb of json)

Certainly something we can do better - how would you break it up? Adding a key to dump an individual partition to json?

schmichael 8 years ago |

This article does a massive disservice by using the pre-CQL Column Family and Row terminology. While it's the Cassandra data modelling I'm the most accustomed to personally, it causes endless confusion for users who find themselves in the CQL documentation trying to understand how it all maps to Primary Keys, Partition Keys, static columns, etc.

This transition has been causing confusion for at least 5 years now, and it appears people are still using the old terminology! https://www.datastax.com/dev/blog/thrift-to-cql3

pfarnsworth 8 years ago | |

As a recent user of Cassandra, I found exactly this to be a huge problem. Any type of googling would return too many different terms, and the relevance or context were completely missing so I was confused for a while, until I realized that the terms changed. The unintended consequence of such a quick change in terminology is that it makes for a very hard experience for newbies.

jjirsa 8 years ago | | |

It's not really that quick of a change - it's been in flight for something like 4 years? Maybe 5? And thrift is still supported until 4.0, so you can use the old style for quite some time.

mi100hael 8 years ago |

> Cassandra’s data model is a partitioned row store with tunable consistency where each row is an instance of a column family that follows the same schema

"Total Newbie" apparently means well-versed in database paradigms and terminology.

jjirsa 8 years ago | |

It's not you, it's the author. That's a horrible way to describe it.

It's partitioned: Cassandra is a clustered database that will automatically route data to the right nodes. It does this by partitioning a token ring among members of the cluster. If you need more capacity, you add nodes and they claim more of the "token ring".

The row store: Cassandra groups data within partitions (see above) which determines which hosts get the data. Within each partition, Cassandra sorts the CQL rows based on your schema. If you had a table of "employees", you could have them partitioned by last initial, and then clustered by last name - all of the employees with last name starting with "J" would be on the same machines, and then they'd be sorted on disk "Ja...", "Je...", etc

theflork 8 years ago | |

Agreed. If anyone has recommended resources for someone coming from from SQL world and wanting to learn more about databases like Cassandra and HBase space, please share!

jjirsa 8 years ago | | |

Datastax academy is probably the best free source

Searching YouTube for Cassandra summit talks is probably second

There was a push to do some better docs on the ASF website but it's just manpower that is currently spending time writing code instead - we have no real full time doc writers that focus on the open source product. Maybe some day someone will volunteer (and if you want to volunteer, I'll commit the docs for you - the site has a how to contribute guide, but honestly I'll take GitHub PRs if they're nontrivial even though it's an annoying workflow for our non-GitHub master).

errantmind 8 years ago |

These days I work with Cassandra on a daily basis. The company I am contracting with switched to Cassandra a while back for their primary data store. A few poor decisions later and they were spending tens of thousands of dollars a month running Cassandra in Azure. The cost was high because they modeled and queried their data like they were still using a SQL database which was incredibly inefficient.

The lesson here is to think long and hard about how you are going to access your data before switching to a database like Cassandra. This will help you decide if Cassandra is the right database to fit your use-cases. If so, be sure to model your data appropriately.

In this case, based on how the company wants to query the data, they would have been better off with PostgeSQL.

mi100hael 8 years ago | |

> The cost was high because they modeled and queried their data like they were still using a SQL database which was incredibly inefficient.

That's literally every Cassandra database I've ever encountered in the wild.

If you use Cassandra, you WILL need to duplicate data across tables for lookups. Don't use Cassandra if you can't stomach that fact (and the disk bills that come with it).

bulldoa 8 years ago | |

any recommendation and resources to read up on for when to use cassandra and how to design the schema?

jjirsa 8 years ago | | |

Use Cassandra when you need real time HA cross datacenter without having to manually fail over

Use Cassandra when you're going to need to grow our database cluster often and don't have tooling to handle resharding

Use Cassandra when you do millions of simple queries (per second), not a handful of complex JOINs

I've used Cassandra at 3 different employers now, and I can't imagine using anything else for many use cases, but there will always be some where it's the wrong choice.

sheeshkebab 8 years ago | | |

When you need a key value store that can easily and mostly consistently and with low latency replicate across multiple data centers (or aws regions), in multi master setups.

In all other cases you'll probably be better off with Postgres, MySQL or similar.

trashcan 8 years ago | | |

Check out their training:

https://academy.datastax.com/courses

Once you make it past the videos trying to sell you on NoSQL, they are incredibly informational.

bkeroack 8 years ago |

CQL is the best and worst thing about Cassandra. The pro is that obviously it is very similar to SQL so it's easy to understand, the con is that C* is nothing like a RDBMS so you can be easily fooled into doing dumb/inefficient things with the nice CQL syntax.

I think that Cassandra is best thought of as a fancy K/V store that lets extra data ride along with query results. Don't think of rows/columns at first, it will just screw you up in your modeling. Also keep in mind that the cost for very fast queries is a lot of extra time spent figuring out how to model new data access patterns in the future.

Boxxed 8 years ago | |

I'd say it's absolutely the worst. For about ten seconds it seems like it's a nice higher level abstraction, but then you realize how that abstraction is hiding exactly what you care about: how your data maps to the underlying storage. Making it look like SQL was also a huge mistake because it gives the impression of a certain level of expressivity that it by no means has.

jjirsa 8 years ago | |

The clustering provides a ton of speed improvements if your data model can keep partition sizes small - especially if you use spinning disks, cassandras storage engine will make reading adjacent rows in a partition nearly free (especially with modern OS behavior like readahead).