Evernote blog: WhySQL?

Evernote blog: WhySQL?(blog.evernote.com)

150 points by grifaton 14 years ago | 73 comments

mapleoin 14 years ago |

Does anyone have a link to a decent comparison between MySQL and PostgreSQL? I'm really wondering why so many people use MySQL, even though it supports a lot fewer SQL features than PostgreSQL.

sheff 14 years ago | |

EnterpriseDB (who sell a commercialized version of Postgres) have a couple of MySQL vs Postgres white papers on their site, but hidden behind a registration wall : http://www.enterprisedb.com/resources-community/whitepapers-... .

Robert Haas, a Postgres committer , occasionally blogs about comparisons : http://rhaas.blogspot.com/search/label/mysql

Theres also http://www.wikivs.com/wiki/MySQL_vs_PostgreSQL

Historically , MySQL has been more widely available on low end web hosting plans, so its what a lot of people first use when they start using databases, and a lot of web apps, such as Wordpress, support it exclusively.

Until a year or two ago, only MySQL had built in (if occasionally fragile) replication which made it popular for that reason alone. Postgres now has robust replication, with new features coming down the pipeline soon : http://www.depesz.com/2011/07/26/waiting-for-9-2-cascading-s....

I prefer Postgres, but oddly enough under Oracle theres been some interesting features added to MySQL, which is good for both.

riffraff 14 years ago | | |

postgres is also not (yet, hopefully?) available in amazon's RDS, probably agin due to the replication topic.

markokocic 14 years ago | |

Pure inertia. Over time, a lot of people were using LAMP stack, and they continued to use MySQL for familiarity reasons.

Also, MySQL got commercial entity behind it in its early days which promoted it a lot. In addition, it worked an all platforms, including windows, while Postgres was there just in last couple of years.

newhouseb 14 years ago | |

I posted this elsewhere in the thread, but I ran some benchmarks on schema changes between the two on 5 million row tables (when deciding to switch) and PostgreSQL completely clobbered MySQL performance wise: https://gist.github.com/1620133. As we recently switched to PostgreSQL, one gotcha is that PostgreSQL uses separate process instead of threads for each connection and so it's slower and more memory intensive to establish new connections so connection pools held application side (such as pg_bouncer) are incredibly important.

morganpyne 14 years ago | |

Simple reliable replication has been a huge differentiator for a long time; enough so to put up with a lot of the other faults of MySQL. Have not revisited Postgres replication in a long time but I have seen that it has been worked on. Anyone with recent experience in both care to explain how the replication of both stacks up in recent versions?

moe 14 years ago | | |

Simple reliable replication

I cringe every time I read that. MySQL replication is many things, but it is not reliable (as anyone who has used it at scale will confirm).

I think the only reason this myth prevails is because hardly anyone ever actually verifies if their master/slave are in sync. A table checksum can be a real eye-opener here, especially on a deployment that's been running for while and undergone schema changes, restarts, network splits, etc.

obtu 14 years ago | | |

<rant>Simple, but not reliable. I've seen admins enable statement-based replication without understanding it, and trash the db. Which is generally my gripe with MySQL: it has some popular features that only work if you don't look at them too closely; starting with support for the SQL standard.</rant>

PostgreSQL's built-in replication is pretty easy to set up[1] and provides a writable master, and a cascade of slaves. Slaves can be synchronous or asynchronous, and the synchronicity can be turned off per transaction.

[1] http://www.depesz.com/2011/01/24/waiting-for-9-1-pg_baseback...

ArbitraryLimits 14 years ago | |

In addition to the other factors listed here, Postgres's default configuration was tuned to a dramatically underpowered machine for many years. Yes, that meant that the occasional user who did have that kind of machine saw acceptable performance out of the box, but the other 9/10 systems burned a lot of the DBA's time tuning the system. I think that's a big reason why Postgres has a reputation for difficulty in some quarters.

dgregd 14 years ago | |

For a long time there were no PostgreSQL for Windows. So Windows developers had to use MySQL.

MySQL is good enough so there is no need to migrate applications for a few additional features.

huggyface 14 years ago | |

MySQL was very, very easy to get started with, getting to a cruise quickly. PostgreSQL offered more of a curve.

In the Windows world the same is true of SQL Server -- the setup, connectivity, and basic usage is so incredibly easy that it made it the first choice of many teams.

This seems incredible -- that products are chosen on such an irrelevant-in-the-long-term basis -- however it has proven true across almost all of the computing market, even targeting highly skilled developers. PHP has few competitive merits, yet it was the default option for many because it was so easy to make something basic in.

There's a lesson there in that.

untog 14 years ago | | |

This seems incredible -- that products are chosen on such an irrelevant-in-the-long-term basis

I don't think there is anything too incredible in that. If you want to throw together an idea quickly, get it out there and test response then use whatever technology gets the job done quickest. You can always change later.

Why waste huge amounts of time setting up a technically perfect database for a product it turns out no-one wants?

j_col 14 years ago |

Very interesting to see them bucking the trends, I love his closing line:

"But we’re relatively satisfied with sharded MySQL storage for Evernote user account metadata, even though that’s not going to win any style points from the cool kids."

Indeed, hipsters beware!

opendomain 14 years ago |

There are other reasons to choose NoSQL For example, when Craigslist was using mySql and they had to change their schema, it took MONTHS to facilitate the change across all their slaves. You can also have a mixed strategy of using both RDBMS and NoSQL to achieve consistency while being able to be flexible to architecture changes. Lastly- have you looked at total overal cost? Setting up a large cluster with mySql will have a large operational cost and it may not be partition tolerant so if the wrong servers go down, it may cascade to your whole data store.

dabeeeenster 14 years ago |

It helps that they have a perfectly shardable product I guess.

rmc 14 years ago | |

Yes, if you're Google (where any page can link to any other) or Facebook (where a person can friend any other) you wouldn't have this. But lots of businesses that provide software solutions do usually have something that is very localised.

HarrisonFisk 14 years ago | | |

It's a bit funny that you use Facebook as an example since they use sharded MySQL as their primary data store.

zv 14 years ago |

tldr - it works, we don't care about "being cool"

rmc 14 years ago | |

The true hacker style

fkn 14 years ago |

Can anyone explain the following bit: "They’re cleanly partitioned into 20 million data separate data sets, one per user."

Does it mean they have a database per user? That can't be right is it?

herge 14 years ago | |

It means that the data from one user has no relations to the data from another user. So most if not all their queries only query the data of one user.

This is really useful for things like sharding, where you can split a database table onto more than one machine, because there will be few queries that will stall fetching data from one machine to another.

driverdan 14 years ago | |

Why not have a database for each user? Evernote's data is partitioned perfectly for that. Notebooks and notes are accessible to one user or are public. There is no sharing notes between users.

mitchellhislop 14 years ago | | |

There is sharing notes between users though - I have several shared notebooks, each holding shared notes.

brown9-2 14 years ago | |

http://blog.evernote.com/tech/2011/05/17/architectural-diges...

swah 14 years ago | |

Their SQL is executed per user, so they only touch around 1/20M the size of the database for any request.

daleharvey 14 years ago | |

Not likely, but they can very easily have a database holding all users with id starting with 'a'

driverdan 14 years ago | | |

That's a really bad method of sharding. Names do not distribute equally over the alphabet.

trustfundbaby 14 years ago |

I get where they're coming from, but I do find this to be a little smug :)

See, they haven't run into problems with their setup, as per, MySql 'just works' for them.

What would be interesting and educational (for me anyway) would be a situation where folks that ran into serious problems with their SQL setup despite doing the 'right things' persevered where conventional knowledge would have them switch to a NoSql solution.

tldr; Dog bites man article, would love to hear from someone that actually struggled with a SQL solution and soldiered on.

bitdiffusion 14 years ago |

The notebook/note example is weak - in a nosql database you need to design your data structure appropriately to get the level of atomicity you require.

Storing an entire notebook in a single document would be the most obvious. I use postgres all the time and sql is great, but poo-pooing nosql because it wouldn't work with your relational structure is not the best idea. Also - I have found a hybrid between nosql (mongodb) and sql (postgres) is ideal - who says you need to use a single database?

bni 14 years ago | |

What about when you want to find all notes that was made a specific day last month (say for a report)?

Traverse all notebook documents and look at each notes date? Good luck with that.

artsrc 14 years ago | | |

Many nosql databases support queries. If you are using one of these then you are in a better place for features like that than you are with heavily shaded SQL.

tommyd 14 years ago | | |

If you were to use MongoDB, wouldn't this just be a case of adding an index to the field of the nested child document (i.e. the note within a notebook) you were interested in and then querying on it? e.g. db.notebooks.ensureIndex({"notes.date": 1});

morganpyne 14 years ago | |

Well, taking a consistent snapshot for backups is easier when it's in a single source. I know there are ways around this (ZFS!) and not everyone needs synced backups to the millisecond but it can complicate backups (or more to the point - restores)

huggyface 14 years ago | |

Storing an entire notebook in a single document would be the most obvious.

The cost of course being that a change to any note in a notebook yields a save of the entire document. Not a problem in simple cases, but that sort of mass-write-amplification can kill you (talk to Digg about that).

Also - I have found a hybrid between nosql (mongodb) and sql (postgres) is ideal - who says you need to use a single database?

Simplicity. Coherency. Maintainability. And on. Sure, it might make sense, but if you already have you toes in the "SQL" world, it is usually worthwhile to dunk your whole foot in. Many SQL products also offer the document functionality of MongoDB, for instance. SQL Server, as an example, lets you store XML documents to your hearts content, which you can index and intelligently query upon, etc. Your scheme is boundless, and on and on.

artsrc 14 years ago |

If the replication is asynchronous then SQL databases are not durable. So the most important feature of SQL databases generally isn't one.

oacgnol 14 years ago |

I can imagine that while Evernote has a lot of data to store, it doesn't have the massive amount of concurrent reads that might occur with an equally large web app. Do they publish numbers on read/write usage?

EtienneK 14 years ago |

The biggest news to me was that they are using MySQL.

shingen 14 years ago |

It's amazing how when you focus on proven (but supposedly boring or old) technology that just works, and works very well, you can devote a lot of other resources to the actual product and usability.

Maybe it's the 30 year old in me showing, but I'm sticking with the 'it just works' crowd. Until some other approach provides a staggeringly overwhelming reason to switch. I find scaling up with MySQL to be ridiculously easy, allowing me to focus my time elsewhere. Ram, bandwidth, and fast storage have gotten substantially cheaper in the last few years, making it that much easier and cost effective to throw hardware at scaling up. For 99.9% of the Web, those hardware resources are expanding in value much faster than traffic is increasing.

(It's understood other developers find it just as easy to take a different approach)

jbverschoor 14 years ago |

That was the worst "why we still use mysql"-post ever.

4ad 14 years ago |

This article is very weak, they insist a lot on ACID, but those are completely orthogonal concepts from SQL. Most NoSQL products are ACID.

Also, the example itself is very weak as bitdiffusion below pointed out.

mapgrep 14 years ago | |

Good point about SQL not having a lock on ACID. CouchDB in particular is very proud of its out-of-the-box ACID compliance.

People sometimes conflate the DB access approach (document vs relational) with the storage approach (transactional vs warehouse). This may be because the NoSQL poster child, MongoDB, at one point defaulted to a non ACID mode of operation. But you can have a relational DB that's not ACID (MySQL 3) and an object DB that is (Couch).

I was surprised so much of the original article focused on ACID as though it were the biggest selling point for an RDBMS. It seems like the biggest win (right now) is the sheer number of things a typical RDMBS does for you -- not just ACID but also data integrity (foreign key constraints), automatic index creation (mostly), and automated schema changes across many records (ALTER TABLE). The cost, of course, is the up-front effort of fitting your data and app to the relational model.

- The new databases/stores have different features and use cases. Many include features RDBMS' do not have. - Having to use table/column's for everything can be quite unnatural and tiresome. - RDBMS' are battle-tested and their pros & cons are well known. But they might also be based on legacy models and truths that simply no longer holds. - At least for me, using & learning something new is a big motivation booster :)