Migrating from MongoDB to Cassandra

Migrating from MongoDB to Cassandra(fullcontact.com)

114 points by jeremiahjordan 12 years ago | 63 comments

DigitalSea 12 years ago |

When I read posts like this all but confirming MongoDB isn't the great product 10Gen make it out to be, I wonder how the heck MongoDB are still even relevant and then I remind myself of the fact that 10Gen have one of the best marketing and sales teams in the game at the moment.

While MongoDB has improved greatly over previous versions, I can't help but feel if 10Gen put as much effort into improving their product as they do selling it, Mongo would be a force to be reckoned with!

MongoDB is good at some things, but I think most people that try and fail with it fall into one of two camps: 10Gen sold them into it or they bought into the hype without assessing project requirements and ensuring MongoDB was a sensible choice.

rogerbinns 12 years ago | |

There is one thing MongoDB does spectacularly well - you can feed in arbitrary JSON and get the same JSON back out. (No need to define schemas or play any kind of system/db administrator.) Even the queries have the same "shape" as JSON, so no need for another arbitrary query language.

It will eventually bite you, and bite you hard. But you'll be well into the millions of records before that happens. Developers and products below that number will have very smooth sailing. And some live there permanently. One project I worked on years ago involved a music catalogue. Did you know there are only about 20 million songs?

The main problem is things get very painful as you get bigger, especially for writes. A doubling of write activity can lead to calamitous drops in performance. This is especially bizarre as the data model means they can easily have multiple concurrent writers. Heck having a lock per 2GB data file would quickly help with concurrency.

They have this same "single" approach in other places. For example building an index is single threaded. I did a restore the other day and then had to wait 8 days while it rebuilt indexes. One cpu was pegged but everything else was idle!

It also consumes huge amounts of space - at least double as the same data in JSON. There are known fixes https://jira.mongodb.org/browse/SERVER-164 https://jira.mongodb.org/browse/SERVER-863 (note how popular they are and how many years they have been open!)

I wish they would focus on making better use of the resources available - it should be possible to max out cpu, RAM and I/O.

We've ended up in the same situation as the article, figuring out where to migrate to with Cassandra being the front runner.

leif 12 years ago | | |

It's telling that those jira tickets are both in the top five most voted on open server tickets and are from 2010 and before. (I know you know this roger, but) TokuMX completely resolves both of them.

yeukhon 12 years ago | | |

Yes. It does bite you quickly. As you add more models you start to duplicate a lot more information and by that time you'd think relational makes sense but you have to continue to use MongoDB. The option you got is either embed or reference. And still, there is no JOIN in mongo so you'd iterate many collections and do combine within your application code.

I think as PostgreSQL continue to improve its JSON data type people will look at SQL again even if they need to a basic model working. Because at the end working with constraints can help. Well, either side will bite you but one has to weigh...and sure that's a tough question.

leif 12 years ago | |

I agree. We're putting a lot of effort into this at Tokutek with TokuMX: http://www.tokutek.com/2014/02/introducing-tokumx-1-4-major-...

Personally, I think the company formerly known as 10gen is doing a lot of good work, particularly with the aggregation framework, but it's orthogonal to the improvements we've made with the storage subsystem.

chaostheory 12 years ago | |

Like any other NoSQL datastore, things like MongoDB aren't as general purpose as a RDBMS no matter how the marketing and sales teams make them out to be. Cassandra is no different.

People need to read the documentation (say what you will about mongo but 10Gen docs are pretty damn good) more thoroughly before blindly implementing stuff on NoSQL and complaining. (I've been guilty of this in the past.)

threeseed 12 years ago | | |

For me anyone that uses the term "NoSQL" to group completely different databases simply doesn't understand what they are talking about.

Cassandra and MongoDB are completely fine as general purpose databases. With CQL3 Cassandra is just as easy to use as any RDBMS with the advantage of being infinitely more scalable and easier to manage.

rdtsc 12 years ago | |

> 10Gen have one of the best marketing and sales teams in the game at the moment.

No doubt. I have 2 MongoDB mugs even though I vowed to never touch a product produced by them again (it was, after the whole -- we'll throw your data write requests over the wall and pray fiasco).

bdcravens 12 years ago | | |

I have one from an event they did here. What's telling is that I'm in Houston - far outside the SF/SV magic bubble.

digitalzombie 12 years ago | | |

I got one from SCALE x10. Southern California Linux Expo

CptCodeMonkey 12 years ago | |

Knowing some of the FC people first hand, MongoDB did actually serve them fairly well for a substantial amount of time. Until they started hitting max limits, it didn't really make sense to move to c.

Going straight to C or something like it would have been almost cargo-cultish ( eg If we build industrial strength, we will get industrial levels of traffic ).

jbellis 12 years ago | | |

It sounds like you're assuming that productivity and power are opposites. Four years ago or even two, Cassandra was much harder to develop against than MongoDB, but that's not the case today.

CptCodeMonkey 12 years ago | | |

Just as a historical note, I meant to type C Star ( C\* ) but HN's formatting changed c\* into C. C\* is shorthand for Cassandra.

threeseed 12 years ago | | |

Exactly. Can you imagine all of the Java/C developers telling Ruby developers that they are idiots and don't know anything about programming ? Simply because they choose a technology that is designed for developer productivity at the expense of scalability.

Because that what seems to happen for every database discussion.

wobbleblob 12 years ago | |

I'm not disagreeing with you, but this is not what the article is about:

"MongoDB was not a mistake. It let us iterate rapidly and scaled reasonably well. Our mistake was when we decided to put off fixing MongoDB’s deployment, instead vertically scaling (...) By the time we had cycles to spend, it was too late to shard effectively It would have been terribly painful and unacceptably slowed our cluster for days."

It seems they weren't sharding their data. The advantage of the popular NoSQL databases like MongoDB is that they allow easier horizontal scaling than general purpose RDMS (though this is debatable - Postgres and Oracle allow you to make the same trade offs as NoSQL databases, they just don't force you to)

When you read the rest of the article, they explain that when they had to make a painful transition anyway, they chose Cassandra, to a different set of advantages and disadvantages that suited their needs.

MongoDB is still relevant because it is well supported. The product is well documented, excellent tooling is available, it is widely adopted, so that when you have a question, you can often google the answer.

It has an elegant query language; especially the built in aggregation framework is far more convenient than having to write map reduce functions for every query. It is easy to deploy and use. All these things make it a product that is pleasant to use from the point of view of a developer or DBA. I think you underestimate the importance of non-technical advantages and disadvantages

I am just not convinced it is the best database because I just don't see a use case for a 'general purpose' NoSQL database. For general purpose storage, RDMS like Postgres and Oracle are great. They support sharding if you really want it, and even allow indexing of unstructured data these days. They don't force you to use joins and transactions if you don't need them, but at least they support these features when you do.

tootie 12 years ago | |

As someone who works extensively in "Enterprise IT" it's the unifying factor in most successful middleware companies. I'm working with a CMS that has barely updated a single core feature in almost 10 years, consistently releases with glaring bugs and rakes in millions in license fees. How? Sales team always show a polished demo and don't give out dev licenses for evaluation.

jb007 12 years ago | |

Mongodb has marketed the database as a general purpose one when in reality it doesn't even come close to one. And the case is the same for all other NoSQL systems. No More general purpose database? The developers at http://www.amisalabs.com are tackling today's database problems.

ddorian43 12 years ago | | |

you know they have been tackling without a release for many months now

brown9-2 12 years ago | |

From TFA:

We were a young startup and made a few crucial mistakes. MongoDB was not a mistake. It let us iterate rapidly and scaled reasonably well.

A good solution for X scale might not be so good for 10X scale. But by the same token, a 10X scale solution for a 1X scale problem isn't a good idea either.

dopamean 12 years ago | | |

This is what I haven't understood about criticism of MongoDB. Don't some people have projects that dont need to scale to an incredible degree?

roycehaynes 12 years ago | |

I think part of the lesson learned from reading the post is MongoDB is best when you start sharding from the git-go, especially if you know you're going to have pounds of data.

mason55 12 years ago | | |

Rebalancing shards in MongoDB sucks, especially if you have any kind of traffic. Which means that even if you start out with just a few shards you either need to keep up with growing your number of shards so that none of them are ever more than ~30% full or else you're in for a painful reshard experience.

At least this is what happened to me and my encounters with MongoDB (nee 10gen) were unsuccessful in speeding up our resharding.

jchrisa 12 years ago |

Viber, one of the largest over the top messaging apps, recently shared their conversion from Mongo to Couchbase. They ended up requiring less than half of the original servers, and better performance.

If you want to see a video of their engineer telling the story, it's available here: http://www.couchbase.com/presentations/couchbase-tlv-2014-co...

prottmann 12 years ago | |

Like always: "Use the right tool for the job". I did not think that this was MongoDB(10gen)s fault, they (viber) choose the wrong database type for their needs.

pessimizer 12 years ago | | |

That's really easy to say, so it's important to show your specific reasoning.

krenoten 12 years ago | |

Usually people who get burned by hypedb think twice before making the same mistake again.

rco8786 12 years ago |

read first half of article

get spammed

leave immediately

jbeja 12 years ago | |

OMG, it make jump a little, seriously ;).

brightsize 12 years ago | |

Same here.

Xorlev 12 years ago | | |

Author here, sorry about that.

It's not supposed to display on engineering posts but something must have changed/been broken recently. We know you guys aren't interested in marketing content.

Again, sorry about that and thanks for bringing it up.

monkey26 12 years ago |

This article caught my interest as I've been reading into Cassandra. But some previous research had me thinking that Cassandra works best with under a TB/node. Is SQL still better when you have really large nodes (16-32TB) and only really want to scale out for more storage?

I'm currently humming along happily with Postgres, but some of the distributed features, and availability of Cassandra look really nice.

jbellis 12 years ago | |

Cassandra 2.0 can handle 5TB per node easily, 10TB with some care. Best to scale out, not up.

That said, if someone else has already made the hardware choice for you, you can always run multiple C* nodes on a single machine. I know several production clusters that fit this description.

olavgg 12 years ago | | |

PostgreSQL can handle petabytes easily. However if you need to query a petabyte of data, then you need to rethink your solution. PrestoDB + Hive + Hadoop may be what you need.

krenoten 12 years ago | |

It's much more about desired usage patterns than amount of storage. Cassandra and RDBMS's differ quite a lot in how you replicate, consistency guarantees, performant read patterns, performant write patterns, how you handle recovery, etc... If you intend to bring anything to scale it helps to understand the strengths and weaknesses of the underlying architecture.

cnlwsu 12 years ago | |

We run at about 1TB a node and it works well (high write load things like metrics and telemetry data). But we also use SQL server where appropriate (i.e. transactional account stuff).

I am a fan of using the right tool for the right job providing you have the team to support it.

fiatmoney 12 years ago |

Sounds the intended use case for ElasticSearch.

"Given some input piece of data {email, phone, Twitter, Facebook}, find other data related to that query and produce a merged document of that data"

Xorlev 12 years ago | |

I will say ElasticSearch features heavily in our infrastructure elsewhere, but for the Person API product, it's purely a primary-key lookup.

My coworker wrote a bit about how the search functionality of our offering works here: http://www.fullcontact.com/blog/sherlock_search_engine_that_...

That might make more sense why we do PK lookups.

brown9-2 12 years ago |

Is DynamoDB never a serious option for people in situations like this and already heavily on AWS?

Xorlev 12 years ago | |

One of the key detractors for us was the 64KB limit.

"The total size of an item, including attribute names and attribute values, cannot exceed 64KB."

While we don't often have values over 64KB, it's possible. We didn't want to have to store profiles separate from their metadata.

lynchdt 12 years ago |

"To buy us time, we ‘sharded’ our MongoDB cluster. At the application layer. We had two MongoDB clusters of hi1.4xlarges, sent all new writes to the new cluster, and read from both..."

I'm curious about this. Why were you doing the sharding manually in your application layer? Picking a MongoDB shard key - something like the id of the user record - would produce some fairly consistent write-load distribution across clusters. Regardless - it seems like write-load was a problem for you, yet you sent all the write load to the new cluster - why not split it?

Xorlev 12 years ago | |

As explained, it was a stop-gap solution for data storage only, we did not have a problem with write load on SSDs.

We were at the point that MongoDB sharding was just as difficult to deploy as moving to Cassandra, which better fit our goals of availability. MongoDB sharding isn't instant by any means for existing clusters.

talas9 12 years ago |

Yet another shining example of throwing money and time away to work within AWS constraints when bare metal and openstack (1) would have solved it cheaper (2) and arguably faster.

1 (if you insist on cloud provisioning instances, even though it makes little sense if the resources are as strictly dedicated as they are in this case)

2 (VASTLY, over time -- these guys are pissing money away at AWS and I hope their investors know it)

Xorlev 12 years ago | |

We understand that AWS comes at a premium, however we find the opportunity cost of losing the agility we have on AWS at this stage of our organization more expensive than the delta in cost between moving on to our own hardware and AWS.

Our organization is acutely aware of our costs and still strives to minimize them. Our move to Cassandra saved 79% over continuing to run our reserved SSD nodes & backup replica.

bfrog 12 years ago | |

AWS in general is a waste of time and money for most standard web hosting requirements I think

mcot2 12 years ago |

Tokumx would solve all of these issues. 2TB goes a long way in tokumx with lzma compression.