Migrating from MongoDB to Cassandra(fullcontact.com) |
Migrating from MongoDB to Cassandra(fullcontact.com) |
While MongoDB has improved greatly over previous versions, I can't help but feel if 10Gen put as much effort into improving their product as they do selling it, Mongo would be a force to be reckoned with!
MongoDB is good at some things, but I think most people that try and fail with it fall into one of two camps: 10Gen sold them into it or they bought into the hype without assessing project requirements and ensuring MongoDB was a sensible choice.
It will eventually bite you, and bite you hard. But you'll be well into the millions of records before that happens. Developers and products below that number will have very smooth sailing. And some live there permanently. One project I worked on years ago involved a music catalogue. Did you know there are only about 20 million songs?
The main problem is things get very painful as you get bigger, especially for writes. A doubling of write activity can lead to calamitous drops in performance. This is especially bizarre as the data model means they can easily have multiple concurrent writers. Heck having a lock per 2GB data file would quickly help with concurrency.
They have this same "single" approach in other places. For example building an index is single threaded. I did a restore the other day and then had to wait 8 days while it rebuilt indexes. One cpu was pegged but everything else was idle!
It also consumes huge amounts of space - at least double as the same data in JSON. There are known fixes https://jira.mongodb.org/browse/SERVER-164 https://jira.mongodb.org/browse/SERVER-863 (note how popular they are and how many years they have been open!)
I wish they would focus on making better use of the resources available - it should be possible to max out cpu, RAM and I/O.
We've ended up in the same situation as the article, figuring out where to migrate to with Cassandra being the front runner.
I think as PostgreSQL continue to improve its JSON data type people will look at SQL again even if they need to a basic model working. Because at the end working with constraints can help. Well, either side will bite you but one has to weigh...and sure that's a tough question.
Personally, I think the company formerly known as 10gen is doing a lot of good work, particularly with the aggregation framework, but it's orthogonal to the improvements we've made with the storage subsystem.
People need to read the documentation (say what you will about mongo but 10Gen docs are pretty damn good) more thoroughly before blindly implementing stuff on NoSQL and complaining. (I've been guilty of this in the past.)
Cassandra and MongoDB are completely fine as general purpose databases. With CQL3 Cassandra is just as easy to use as any RDBMS with the advantage of being infinitely more scalable and easier to manage.
No doubt. I have 2 MongoDB mugs even though I vowed to never touch a product produced by them again (it was, after the whole -- we'll throw your data write requests over the wall and pray fiasco).
Going straight to C or something like it would have been almost cargo-cultish ( eg If we build industrial strength, we will get industrial levels of traffic ).
Because that what seems to happen for every database discussion.
"MongoDB was not a mistake. It let us iterate rapidly and scaled reasonably well. Our mistake was when we decided to put off fixing MongoDB’s deployment, instead vertically scaling (...) By the time we had cycles to spend, it was too late to shard effectively It would have been terribly painful and unacceptably slowed our cluster for days."
It seems they weren't sharding their data. The advantage of the popular NoSQL databases like MongoDB is that they allow easier horizontal scaling than general purpose RDMS (though this is debatable - Postgres and Oracle allow you to make the same trade offs as NoSQL databases, they just don't force you to)
When you read the rest of the article, they explain that when they had to make a painful transition anyway, they chose Cassandra, to a different set of advantages and disadvantages that suited their needs.
MongoDB is still relevant because it is well supported. The product is well documented, excellent tooling is available, it is widely adopted, so that when you have a question, you can often google the answer.
It has an elegant query language; especially the built in aggregation framework is far more convenient than having to write map reduce functions for every query. It is easy to deploy and use. All these things make it a product that is pleasant to use from the point of view of a developer or DBA. I think you underestimate the importance of non-technical advantages and disadvantages
I am just not convinced it is the best database because I just don't see a use case for a 'general purpose' NoSQL database. For general purpose storage, RDMS like Postgres and Oracle are great. They support sharding if you really want it, and even allow indexing of unstructured data these days. They don't force you to use joins and transactions if you don't need them, but at least they support these features when you do.
We were a young startup and made a few crucial mistakes. MongoDB was not a mistake. It let us iterate rapidly and scaled reasonably well.
A good solution for X scale might not be so good for 10X scale. But by the same token, a 10X scale solution for a 1X scale problem isn't a good idea either.
At least this is what happened to me and my encounters with MongoDB (nee 10gen) were unsuccessful in speeding up our resharding.
If you want to see a video of their engineer telling the story, it's available here: http://www.couchbase.com/presentations/couchbase-tlv-2014-co...
get spammed
leave immediately
It's not supposed to display on engineering posts but something must have changed/been broken recently. We know you guys aren't interested in marketing content.
Again, sorry about that and thanks for bringing it up.
I'm currently humming along happily with Postgres, but some of the distributed features, and availability of Cassandra look really nice.
That said, if someone else has already made the hardware choice for you, you can always run multiple C* nodes on a single machine. I know several production clusters that fit this description.
I am a fan of using the right tool for the right job providing you have the team to support it.
"Given some input piece of data {email, phone, Twitter, Facebook}, find other data related to that query and produce a merged document of that data"
My coworker wrote a bit about how the search functionality of our offering works here: http://www.fullcontact.com/blog/sherlock_search_engine_that_...
That might make more sense why we do PK lookups.
"The total size of an item, including attribute names and attribute values, cannot exceed 64KB."
While we don't often have values over 64KB, it's possible. We didn't want to have to store profiles separate from their metadata.
I'm curious about this. Why were you doing the sharding manually in your application layer? Picking a MongoDB shard key - something like the id of the user record - would produce some fairly consistent write-load distribution across clusters. Regardless - it seems like write-load was a problem for you, yet you sent all the write load to the new cluster - why not split it?
We were at the point that MongoDB sharding was just as difficult to deploy as moving to Cassandra, which better fit our goals of availability. MongoDB sharding isn't instant by any means for existing clusters.
1 (if you insist on cloud provisioning instances, even though it makes little sense if the resources are as strictly dedicated as they are in this case)
2 (VASTLY, over time -- these guys are pissing money away at AWS and I hope their investors know it)
Our organization is acutely aware of our costs and still strives to minimize them. Our move to Cassandra saved 79% over continuing to run our reserved SSD nodes & backup replica.
Unfortunately this seems to happen all the time.
iOS has a standard way to jump to the top (tap top of screen), there is no need to interfere with the content.
Write concurrecy: yes, TokuMX does not have a database level reader/writer lock.
Index Building: yes, fractal trees can write data much more efficiently, so if index building is a problem, I bet TokuMX solves it.
Practically reducing file size: to be honest, I am not sure because due to our great compression, this has not been a general issue for our users. Our reindex command could reduce file size, but I cannot point to examples.
One of our big goals is to address storage issues MongoDB has.
This cropped up in the "Don't use Mongo" FUD from a few years back.
Eliot from MongoDb responded:
https://news.ycombinator.com/item?id=3202959
In an ideal world, you'd monitor and plan your capacity proactively - I don't think there's really any magic button for - "My system is at capacity - horizontally scale it now, with no downtime!".
Whenever I hear this, I feel that the person saying it either doesn't have enough experience or that every problem to them looks like a nail for their particular flavor of NoSQL. That's like saying a street bike is a general purpose bicycle. I'm sure someone out there can make it work on an outdoor offroad trail, but is it ideal?
Cassandra is really awesome at writes but reads are really expensive. Atomicity at the row level only (unless you want that 30% performance hit) and eventually consistency make this not great at real time. CQL3 is an improvement but it's still not as good as ANSI SQL. I still have to do a lot of extra work.
With Mongodb it does really well with tree based data. The second you try to do join queries outside of that doc tree... major performance hit. (Maybe this has changed knowing the speed mongo evolves.) When data is larger than RAM... major performance hit... not so great for something like logging.
Using an RDBMS for everything isn't great, but the pain you suffer is a lot less especially since these days more and more them have some sort of federation or cluster solution. That said, some NoSQL datastores are better for certain applications than an RDBMS, just not for all of them.
Cassandra also exploits SSDs quite well by avoiding write-amplification problems, given that all write IO is sequential only.
I'm a huge fan of both traditional RDBMSes and alternative data stores. The most important rule for touring data stores is to be sure not to bring it into work when it's not appropriate and to evaluate the hell out of any solution before you write a single line of code.
Most of the time, PostgreSQL will do 99.9% of what you need and then some.
This is getting old, and is just plain untrue. Reads are more expensive than the extremely cheap writes, but they really aren't expensive.
> Atomicity at the row level only
Partition level, really, in modern terminology (apologies if that's what you meant).
It has improved a lot over time but it isn't a myth due to the fact that it's not good out of the box. You have to spend some time with configuration and mulling over the schema to mitigate it. It's something you don't have to really think about as much on other data stores.
As a developer, using C for all data in any complex application is going to be harder up front than using a RDBMS. From what I have heard and learned about C, though, it is less painful to use C for a highly performant/scalable database than trying to scale most or all RDBMSs. From what I have read, Mongo is also hard to scale.