Failing with MongoDB(blog.schmichael.com) |
Failing with MongoDB(blog.schmichael.com) |
Reason #1: Devs aren't Ops.
Reason #2: Devs need something new on their resume.
Reason #3: Certain type of Devs would read blogs and get excited and skipping scientific mumbo-jumbo and directly take the blogs as _the_ source of truth.
Reason #4: It's easy to bootstrap (schemaless, etc) your weekend project. Dealing with DB apparently is tedious for devs.
I'm sure others can add more...
Let me feel your love HN-ers ;)
What I hear you saying is unfortunately - it's worse for ops, so noone should use it.
Those who do not study the history of databases are doomed to repeat it. Soon we'll add back row-level write locks, transaction logging, schemas, multiple indexes and one day they wake up with MongoSQL.
It's really easy to work with. This is why people keep using it.
How come a DB lose data so frequently and it sill call itself web-scale? It just breaks when you need scale!
For auto-sharding it's also super unreliable, tried once and it failed, and now we are using a lib that do application level sharding. We are also considering move to other databases that at least know not losing data is the first and most important thing of a DB.
Some one summarized the issues of mongodb, http://pastebin.com/raw.php?i=FD3xe6Jt , we experienced most problems in the article. So just a remind for someone who want to create serious product using mongodb, read the article, it's not FUD, it's just so true that I hope I read it 1 year ago, so we don't have to try moving so much legacy data to a new database solution.
There's no other project I know of, which provides: schemaless json documents, indexing on any part of them, server-side mapreduce, lots of connectors for different languages, atomic updates on part of the document. If there is one and it's better than mongo, I'd switch any moment.
These absolutely were failures.
The author listed several instances in which the database became unavailable, the vendor-supplied client drivers refused to communicate with it, or both. Some of these scenarios included the primary database daemon crashing, secondaries failing to return from a "repairing" to an "online" state after a failure (and unable to serve operations in the cluster), and configuration servers failing to propagate shard config to the rest of the cluster -- which required taking down the entire database cluster to repair.
Each of the issues described above would result in extended application downtime (or at best highly degraded availability), the full attention of an operations team, and potential lost revenue. The data loss concern is also unnerving. In a rapidly-moving distributed system, it can be difficult to pin down and identify the root cause of data loss. However, many techniques such as implementing counters at the application level and periodically sanity-checking them against the database can at minimum indicate that data is missing or corrupted. The issues described do not appear to be related to a journal or lack thereof.
Further, the fact that the database's throughput is limited to utilizing a single core of a 16-way box due to a global write lock demonstrates that even when ample IO throughput is available, writes will be stuck contending for the global lock, while all reads are blocked. Being forced to run multiple instances of the daemon behind a sharding service on the same box to achieve any reasonable level of concurrency is embarrassing.
On the "1GB / small dataset" point, keep in mind that Mongo does not permit compactions and read/write operations to occur concurrently. As documents are inserted, updated, and deleted, what may be 1GB of data will grow without bound in size, past 10GB, 16GB, 32GB, and so on until it is compacted in a write-heavy scenario. Unfortunately, compaction also requires that nodes be taken out of service. Even with small datasets, the fact that they will continue to grow without bound in write/update/delete-heavy scenarios until the node is taken out of service to be compacted further compromises the availability of the system.
What's unfortunate is that many of these issues aren't simply "bugs" that can be fixed with a JIRA ticket, a patch, and a couple rounds of code review -- instead, they reach to the core of the engine itself. Even with small datasets, there are very good reasons to pause and carefully consider whether or not your application and operations team can tolerate these tradeoffs.
Oh, this mystery is a failure all right, and even the most charitable interpretation would call it a misfeature.
MongoDB is flaky. CouchDB is a maintainability nightmare, so I hear.
Riak? Cassandra? Or does everything else have some other equally huge down-side?
For every X sucks article, ther is Y is awesome.
In the nosql world the only way to choose is around the problems they solve... They are each specializing and optimizing for certain nitches. mongo is the most mysql-esque, but dosnt do things that redis, couch or cassandra do that you may need.
There is no clear winner (fortunately or unfortunately dependng on what you were hoping for)
Yet we've also seen in the past that shedding such a reputation is not strictly required to be popular. And marketing budgets do matter.
Perhaps because both of your premises are wrong? I've used Mongo for over a year now with ~1000 writes/sec and haven't seen any of these problems. I'm not saying they don't exist (some are confirmed bugs that have been fixed), but they're not nearly as prevalent as your 'Do you still beat your wife?'-style question implies.
Then, as you use it, the system optimizes itself (or makes suggestions) based on actual access patterns. A subset of objects could be a formal, indexed table? Have it happen automatically or offer the SQL as a suggestion.
Replication of any kind won't help you with a high write load as secondaries have to apply the same number of writes as primaries.
CouchDB is much better (you're as likely to lose data as with Postgres), but is potentially less efficient (no BSON).
My focus in starting Citruseaf wasn't features, it was operational dependability. I had worked at companies who had to take their system offline when they had the greatest exposure - like getting massive load from the Yahoo front page (back in the day). Citrusleaf focuses on monitoring, integration with monitoring software, operations. We call ourselves a real-time database because we've focused on predictable performance (and very high performance).
We don't have as many features as mongo. You can't do a javascript/json long running batch job. We'll get to features.
The global R/W lock does limit mongo. Absolutely. Our testing shows a nearly 10x difference in performance between Mongo and Citrusleaf on writes. Frankly, if you're still doing 1,000 tps, you should probably stick with a decent MySQL implementation.
Here's a performance analysis we did: http://bit.ly/rRlq9V
This theory that "mongo is designed to run on in-memory data sets" is, frankly, terrible --- simply because mongo doesn't give you the control to keep you in memory. You don't know when you're going to spill out of memory. There's no way to "timeout" a page cache IO. There's no asynchronous interface for page IO. For all of these reasons - and our internal testing showing page IO is 5x slower than aio; the reason all professional databases use aio and raw devices - we coded Citrusleaf using normal multithreaded io strategies.
With Citrusleaf, we do it differently, and that difference is huge. We keep our indexes in memory. Our indexes are the most efficient anywhere - more objects, fea. You configure Citrusleaf with the amount of memory you want to use, and apply policies when you start flowing out of memory. Like not taking writes. Like expiring the least-recently-used data.
That's an example of our focus on operations. If your application use pattern changes, you can't have your database go down, or go so slowly as to be nearly unusable.
Again, take my comments with a grain of salt, but with Citrusleaf you'll have better uptime, fewer servers, a far less complex installation. Sure, it's not free, but talk to us and we'll find a way to make it work for your project.
I'm a little surprised to see all of the MongoDB hate in this thread.
There seems to be quite a bit of misinformation out there: lots of folks seem focused on the global R/W lock and how it must lead to lousy performance.
In practice, the global R/W isn't optimal -- but it's really not a big deal.
First, MongoDB is designed to be run on a machine with sufficient primary memory to hold the working set. In this case, writes finish extremely quickly and therefore lock contention is quite low. Optimizing for this data pattern is a fundamental design decision.
Second, long running operations (i.e., just before a pageout) cause the MongoDB kernel to yield. This prevents slow operations from screwing the pooch, so to speak. Not perfect, but smooths over many problematic cases.
Third, the MongoDB developer community is EXTREMELY passionate about the project. Fine-grained locking and concurrency are areas of active development. The allegation that features or patches are withheld from the broader community is total bunk; the team at 10gen is dedicated, community-focused, and honest. Take a look at the Google Group, JIRA, or disqus if you don't believe me: "free" tickets and questions get resolved very, very quickly.
Other criticisms of MongoDB concerning in-place updates and durability are worth looking at a bit more closely. MongoDB is designed to scale very well for applications where a single master (and/or sharding) makes sense. Thus, the "idiomatic" way of achieving durability in MongoDB is through replication -- journaling comes at a cost that can, in a properly replicated environment, be safely factored out. This is merely a design decision.
Next, in-place updates allow for extremely fast writes provided a correctly designed schema and an aversion to document-growing updates (i.e., $push). If you meet these requirements-- or select an appropriate padding factor-- you'll enjoy high performance without having to garbage collect old versions of data or store more data than you need. Again, this is a design decision.
Finally, it is worth stressing the convenience and flexibility of a schemaless document-oriented datastore. Migrations are greatly simplified and generic models (i.e., product or profile) no longer require a zillion joins. In many regards, working with a schemaless store is a lot like working with an interpreted language: you don't have to mess with "compilation" and you enjoy a bit more flexibility (though you'll need to be more careful at runtime). It's worth noting that MongoDB provides support for dynamic querying of this schemaless data -- you're free to ask whatever you like, indices be damned. Many other schemaless stores do not provide this functionality.
Regardless of the above, if you're looking to scale writes and can tolerate data conflicts (due to outages or network partitions), you might be better served by Cassandra, CouchDB, or another master-master/NoSQL/fill-in-the-blank datastore. It's really up to the developer to select the right tool for the job and to use that tool the way it's designed to be used.
I've written a bit more than I intended to but I hope that what I've said has added to the discussion. MongoDB is a neat piece of software that's really useful for a particular set of applications. Does it always work perfectly? No. Is it the best for everything? Not at all. Do the developers care? You better believe they do.
I'm not sure it's a competitor at all. RavenDB is a CouchDB clone for .Net that requires a commercial license for proprietary software.
From this article, sounds like their data is pretty seriously relational.
Mongodb has been pushing the ops side of their product, but I can agree it has failings there. To me the advantage is the querying and the json style documents.
Mongo, on paper, should be an ideal candidate for this job; but, due to complications with the locking model and with its inability to do online compactions, it's failing.
I had to model data with umpteen crazy relationships so we went with Mongodb. We did not have the high update issue or any locking issues. If one has a few large tables with fixed columns that can easily define the data, then relational DBs probably make more sense. But to your point, 10gen will not tell you that and the hype doesn't tell you that either.
If you can do both of those things, it is awesome.
Excited to see that DB get more and more traction.
Also there is no single, central steward and authority on couch. All of this stymies traction and confidence even though the tech is great.
Other than that, I actually think these solutions have been stabilizing exactly because of what you say: innovation is slowing down/stalling.
1-3 years ago the cool thing to do was store data different ways, now that we have all these solutions that people are ready to use in production, they are demanding more and more secure/safe functionality from them.
In the last year Redis added the append log and flushing to disk. CouchDB rewrote the replication code in the last release and has always had a wonderfully redundant and safe file mutation model (can can copy the DB file while in use and still get a safe snapshot) and MongoDB has been responding aggressively to crashes and corruption since 1.7 after all the single-server durability fiasco around 1.5/6 that had everyone up in arms.
These data stores are really brilliant pieces of code with some wonderful deployments to prove their worth.
There is still work to be done, sure, but I am not aware of glaring deficiencies in these systems like I used to be a year or more ago where you could point at "Oh, the XYZ bug might get you" -- that just doesn't seem to be happening anymore.
I don't know a whole hell of a lot about Cassandra (I am one of the few humans that still doesn't grok the data model easily) but I remember data recovery bugs from a year ago in the issue tracker that all got knocked out to the point that 1.x is looking like a really awesome release for them.
At this point, I think it just depends on what you need.
I've not messed with Cassandra in production so I haven't seen such a thing yet.
Alternatively "I understand Cassandra is a Trojan. Can anyone confirm this?"
Yes. In an organization like Postgresql, someone's contribution is measured by how much they contribute to the code. In an organization like Mysql, someone's contribution is measured by how much they contribute to the bottom line.
SQLite. FreeBSD. OpenBSD. PostgreSQL. Python (there are some, but Ruby took the thunder).
My time is finite. Ops time is finite. Obviously you decided to dick around with mine and Ops. How bout I send you to the QA department to write automation and software tools so you don't dick around with production code?
You can write with any language and any storage systems you'd like there.
I'm sorry you work with incompetent people. Sounds like you're in a cubicle farm somewhere. While you're in a meeting swinging your seniority around, I'll be over here shipping products faster than your team.
Second, I'm not using Django.... and what's wrong if I do?
Third. I respect people around me. In return, they respect each other so we don't throw away the word "incompetent" and to think that we're better than anybody else.
No pirates. No ninjas. No rockstars. No racers (dhh?) as well. Just grown-ups doing their job with a bit of love, passion, and respect. All balanced.
Fourth. I have no cubicle. I work in an open space and I love it. I don't need my special office (I had one a few years ago and it sucks).
Fifth. My project manager attends meetings and deliver mostly good news to us. He's the best PM I've been with (so far). If we have meeting, that's usually when shit hit the fan and we need to have an honest conversation. Other than that, e-mails are sufficient.
Sixth. Keep throwing the love...we need more
So yes, I think dynamic typing can lead to debugging nightmares. It just so happens that often the fact that other factors make up for this in many cases.....
* but then Linux distros get network autoconfiguration and suddenly it's obvious that it was the right solution all along.
If MongoDB is popular, it's because it has some compelling features/arguments (I'm not sure of which ones exactly). But classical RDBMS seem to be trying to close the gap.
Prototypes are another story. You'd optimize against mental overhead.
Most modern languages have migration utilities (Flyway for Java, Rails migration for Ruby, Python should have their de-facto migration for Django by now or else they fail hard, and JS... well.. let's wait until Node.js users decided to use RDBMS).
Python should have their de-facto migration for Django by now or else they fail hard
Yep: http://south.aeracode.org/A lot of other stuff can be stored in an RDBMS but isn't really optimal for it. Ideally hierarchical directory servers LDAP don't run directly off a relational db.
So there are places for other forms of stores, from BDB to XML, but these cannot and should not replace RDBMS's for most critical tasks.
(Also there is room for real improvement in certain areas of relational constraints in RDBMS's today, but NoSQL moves the wrong direction there.)
have to disagree with this. By forcing yourself to work with a data layer so abstracted that you can't even reference whether you're dealing with a JSON document or a set of twelve joinable tables, you're going to write the most tortured and inefficient application. Non trivial applications require leaky abstractions.
This is a conception of data that is more true in theory than it is in practice. In practice, if you want to query your data and efficiently, you'll need to worry about how it's stored. You'll have to worry about the failure cases.
Of course, it is definitely application dependent. If you're just writing a Wordpress-replacement, you can probably choose whatever data store you want and just write an abstraction layer on top of it (especially if you don't care about performance). On the other hand, if you're looking at querying and indexing terabytes (or more) of data, you'll have to work very closely with your data store to extract maximum performance.
I can sort of sympathize with this a little. I used to use MySQL for schema prototyping and then move stable stuff to PostgreSQL back when PostgreSQL lacked an alter table drop column capability.
However today, this is less of a factor. Good database engineering is engineering. It's a math intensive discipline. Today I work often with intelligent database design approaches, while trying to allow for agility in higher levels of the app.
Don't get me wrong, NoSQL is great for some things. However it is NEVER a replacement for a good RDBMS where this is needed.
How does something like MongoDB actually help with this, though? Certainly a lack of a schema lets you be more nimble in changing it, but you still have to write code to handle whatever schema you decide rather than letting a battle tested RDBMS handle it. I think NoSQLs have their uses but not forcing correctness on your data as a feature is not one of them. But I also believe in static typing.
Are you saying that Rails schema migration can only solve 10% of your migration needs? That kinda suck bro.
Rails is a bad example here as their primitive migration-system still forces the developer to write those stupid migrations by hand.
Django/South just auto-generates them which removes a huge chore from the daily development workflow.
I don't care if it is a startup or not. Come up with a very simple idea, draw the models in ER diagram, implement that stuff.
It's very hard to imagine that tomorrow suddenly all relationships need to be changed. Even if that is the case, scrap your Repository/Entity model and start from the beginning.
Nothing can help you much if the fundamentals are wrong.
Unfortunately when happy users get used to the culture of free and good quality software, they started to have a sense of entitlement (instead of donating). That if the software didn't provide exactly what they wanted, they starting to swear and whine instead of being... calm and helpful.
One key thing is, I think, to advertise services and ways of making money. IOW, giving people the option to get new features, etc. is an important thing.
There is a lot of solid FOSS out there: PostgreSQL, BSD, Linux, CUPS, and more all come to mind. These often are less sexy than heavily marketed, inferior counterparts. But these all also have solid business models attached.
I highly recommend that folks who start open source software look around at business models surrounding the better open source products and see what they can do to capitalize on that.
I'd MUCH rather be paid to produce new features than fix bugs.
And in the above Linux networking example, this response is analogous to the person who says "But I wrote my own scripts to fix my ethernet configuration, I don't see what the big deal is!"
Edit: That came off a bit harsh. I just don't think the fact that it's not a hassle for us means it's not a hassle for other people. And I don't think that mongo's easy configuration means that it should be used over postgres in all cases, just that postgres should take notice that people like mongo's easy configuration and step up their game in that department.
service postgresql initdb
service postgresql start
sudo -u postgres psql .....
postgres=#.....
If all you are doing is writing code, that's sufficient on an rpm based platform. On Debian, use apt-get instead. On Windows use the 1 click installer.
PostgreSQL comes with a default user and a default database, so the criticism on this thread is a bit..... incorrect.
Now it is true you have to set up a system user if you are compiling PostgreSQL from source. However, that's really optional in most cases unless the code you are writing is, well, a patch to PostgreSQL.......
I use no script whatsoever and the complication . For a dev environment, I go like this:
$ sudo su - postgres
$ createuser -s nakkiel
And again, no further operation is required until you install a new OS on your machine.I really think we're missing the point anyway. The complicated things, and Mongo's advantage, only really come later when it's time to create a database and tables and columns and indexes..
Postgres installs with a default superuser ("postgres" on ubuntu) and a default database (also "postgres"), so that's not the real problem.
Installing software via a package system is trivial and required for any system, so that can't be the issue.
The package distribution invariably chooses a default location for your data and initializes it, so that requires no additional effort at all.
You have to start and stop the service, but the package distribution should make that trivial, as well ("service postgresql start|stop" on ubuntu). And again, I don't see any difference here.
So the only possible area I see for problems is connecting your first time. This is somewhat of an issue for any network service, because you need to prevent anyone with your IP from connecting as superuser. The way ubuntu solves this is by allowing local connections to postgres if the system username matches the database username. So, you have to "su" to the user "postgres", and then do "psql postgres". Now you're in as superuser.
The default "postgres" superuser doesn't have a password (default passwords are bad) and only users with a password can connect over the network. But, you can add a password (which then allows that user to connect over a network), or create new users. If the new username matches a system username, that user can connect locally. If you gave the new user a password, they can connect over a network.
Do you see any fat in the above process that can be streamlined without some horrible side-effect (like allowing anyone with your IP to connect as superuser)? I'm serious here -- if you do see room for improvement, I really, really, really, want to hear what the sticking point is so that it can be fixed.
I tried building a small side project with Postgres about 8 months ago (after not really doing rdbms stuff for 18 months before) and was amazed at how inflexible it felt, and how much frustration used to seem normal.
I write up a schema, send to everyone else on the team, get feedback. If users are invovled get it from them too.... take in all the feedback, write up a new draft, wash, rinse, repeat until running out of shampoo (i.e. feedback).....
Once things are pretty stable, do a prototype, address any oversights, do the real thing.
It's not rocket science.
If it prototypes well, then further refine it with ER diagrams for future maintenance.
Why is planning everything without validation better than above?
That doesn't mean spending months planning. It does mean doing your best to plan over a few days, then prototype, review, and start implementing. If things change, you now have a clearer idea of the issues and can better address them.
The worst thing you can do is go into development both blind and without important tools you need to make sure that requirements are met--- tools like check constraints, referential integrity, and the like.
large changes? throw away and start over; tools like Rails can help you get up and running really quickly.
I rarely see people go back and fix the mess.
Maintenance takes up about 50% of all IT budget [1]. Most individual pieces of software will spend 2-6 times (considering the average life cycle of an in production software product to be 2-4 years) more money on being maintained than being developed.
Data migration is a massive problem for any organization with data sets at any scale. RDBMS, in general, has gotten in the way of those migrations. People aren't looking at NoSQL just because they cannot sit still but instead are looking to find a better experience with handling data.
I'm not sure if NoSQL is the right answer to that but let's give it a chance and see what happens when people are migrating MongoDB data in 3-4 years.
[1] http://www.zdnet.com/blog/btl/technology-budgets-2010-mainte...
Anyway, people who call themselves ops or devs ought to be able to do that sort of 3 lines operations.
As I said before, it happens once in your product's dev/op lifetime.