Announcing MoSQL(stripe.com) |
Announcing MoSQL(stripe.com) |
HSTORE can be fully indexed (gIST and GIN). Just have to roll your own object graphs for nesting if that's what you need to do.
I swear I have typed this exact same comment previously. Deja vu, maybe
Those "Indexes on Expressions" are really a great feature that can also be combined with XML (not just JSON) and any other types. I recommend everyone to have a look at those:
http://www.postgresql.org/docs/9.2/static/indexes-expression...
One of the reasons MongoDB is so popular is because it is an fantastic database for developers. As a Java developer I can deal in my code with sets, hashmaps, embedded structures and have it effectively map 1-1 in the database. It's akin to an object database meaning you can focus higher up in the stack.
With the SQL ORMs you can't avoid having to deal with the ER model.
MongoDB et all basically are built around the assumption that a schema is never worth the complexity. It's a bold claim that contradicts many decades worth of database research.
And for the record, we use both a SQL store, Redis and MongoDB where the use case suits it where I work.
I've always liked the paradigm of doing analysis on "slower" data stores, such as Hadoop+Hive or Vertica if you have the money. Decoupling analysis tools from application tools is both convenient and necessary as your organization and data scales.
SELECT c.email FROM customers c, subscriptions s WHERE c.subscription_id = s.id AND s.status = "active" and s.trial_start IS NOT NULL;
(where of course the customer and subscription tables would be a virtual view on your customers and subscriptions)
This kind of comment shows how little knowledge you have about NoSQL and SQL. Is not a SQL vs NoSQL, it's about using the right technology for the job.
The question is perfectly valid. In many scenarios (not necessarily Stripe's), PostgreSQL is fast enough to do the job. Stop putting people down for legitimate engineering questions.
Try not to be condescending and your point will be better received. "Right technology" as I'm sure you're aware, has as much to do with subjectivity as appropriateness. Familiarity, workflow, ease of use (and did I mention familiarity?) cannot be overstated even when the perceived benefits are considered.
Read: religion.
Some of the people who rally against NoSQL may be deriding it from a knee jerk reaction, however others are simply frustrated with developers who, as Ted Dziuba would say, "value technological purity over gettin' shit done".
Relational databases were created in the first place to solve these very problems around transactionality and analytics for finance.
This library is a beautiful example of reinventing the wheel, and otherwise creating a patchwork of unnecessary - and ultimately brittle - infrastructure.
https://github.com/10gen-labs/mongo-connector/tree/master/mo...
Seems to be high quality, and supports replica sets.
I'd also like to mention a project I've been contributing to, Mongolike
[My fork is at https://github.com/e1ven/mongolike , once it's merged upstream, that version will be the preferred one ;) ]
It implements mongo-like structures on TOP of Postgres. This has allowed me to support both Mongo and Postgres for a project I'm working on.
I thank them for releasing this.
It's much more effective and efficient to use a SQL query than it is to throw together a huge amount of imperative JavaScript code (that's usually very specific to a single NoSQL database, as well) merely to perform the equivalent query.
It's much safer to use a database that offers true support for transactions and constraints, rather than trying to hack together that functionality in some Ruby or PHP data layer code, or relying on some vague promise of "eventual consistency", for instance.
It's much more maintainable, and leads to higher-quality data, to spend some time thinking about a schema, rather than just arbitrarily throwing data into a schema-less system, and then having to deal with the lack of a schema throughout any application code that's ever written.
Aside from an extremely small and limited handful of situations (Google and Facebook, for instance), relational databases are the best tool for the job.
Honestly. I don't think you could be more misinformed if you tried.
Hint: Google "Big Data".
Look at old NoSQLs: Intersystems Cache got SQL interface, GT.M (in PIP-framework) also got SQL.
My impression that MongoDB looks a lot like MUMPS storage with globals in JSON.
I actually played with mongo_fdw. At this point, it's a really cute hack, and useful for some things, but it doesn't give Postgres enough information and knobs to really let the query planner work effectively, so it ends up being really slow for complex things. I do love the concept, though.
hey, one way to do that is to use the MongoDB foreign data wrapper - also mentioned in some of the earlier threads.
mongo_fdw (https://github.com/citusdata/mongo_fdw) allows you to run SQL on MongoDB on a single node. Citus Data allows you to parallelize your SQL queries across multiple nodes (in this case, multiple MongoDB instances) by just syncing shard metadata. So you would effectively run SQL on a sharded mongo cluster without moving the data anywhere else.
another idea could be to use MoSQL to neatly replicate each mongo instance to a separate PostgreSQL instance, and then use Citus Data to run distributed SQL queries across the resulting PostgreSQL cluster.
I have read than in version 2.x they announce some features, so, it is greatness?
PostgreSQL scales surprisingly well for this purpose, and is much nicer for interactive queries than Hadoop/Hive. We use Impala[1] for some larger datasets, but Impala is comparatively new, and it's nice to have something as battle-tested as postgres here.
As for the "why do we need realtime?": In my mind the benefit of a near-realtime replica is not that you actually often need it, but that it means you never have to ask the question of "Was this snapshot refreshed recently enough?", and never end up having to wait several hours for an enormous dump/load operation, when you realize you did need newer data.
[1] http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-t...
I do agree that PostgreSQL would be nicer for interactive queries. Waiting for a M/R to spin off is a bit of a buzzkill.
With regards to your usecases, what sort of questions have you found yourself answering the most? Do you have analytics applications running off of this?
How was your overall experience with impala ? Did you guys have a fairly new hive cluster to try it out or did you just spin up a new one since impala can only read certain file formats (i.e. no custom SerDe).
Also, for hive/hadoop datasets, is that more for just data exploration, while this PostgreSQL solution is for smaller datasets which return in a few seconds and would not perform well in hive due to the cost of setting up a mapreduce job ?
(In full disclosure, I wrote mongo_fdw for PostgreSQL.)
I'm thinking of this as something like polyglot memoization. Pretty cool when you think about it. Frequently need something that is slow in NoSQL, but fast in SQL? Memoize it to your SQL datastore. The alternative has always been to write it to two places. I kind of dig moving this out to the datastore to figure out.
I'm thinking that plenty of people will find this useful.
MongoDB is great for failover and for rapid development or prototyping. SQL is great for reporting or analytics, since you can do all kinds of aggregates and JOINs right in the database.
The edge cases where you can't represent the data perfectly aren't a huge deal for this use case -- because it's a one-way export, you don't have to be able to round-trip the data, and as long as you can export the data you want to run analysis on, it doesn't matter if there's some you can't get.
I consider both HSTORE (key/value) and the current JSON type and record functions are just intermediate steps to a fuller API [0].
[0]: http://www.postgresql.org/message-id/50EC971C.3040003@dunsla...
Our experience was that mongo_fdw doesn't (yet?) give postgres enough information and knobs to plan JOINs efficiently, which is one of the things we wanted. I got a decent amount of leverage out of using mongo_fdw and then cloning to native tables using SELECT INTO, though :)
I'm sure there are other factors involved for MoSQL, but they are probably outside the scope of this post. I'd love chat about them offline.
And your whole "use the right tool for the right job" goes without saying. It's others who seem to be obsessed with this "SQL is perfect for everything" delusion.
Unless MongoDB et al are saying "always use MongoDB et al and never an RDBMS", then I'm not sure how you arrived at the conclusion that "the schema is never worth the complexity."
If anything, the appropriate assumption is, "schemas aren't always worth the complexity." When they are, you use an RDBMS. When they aren't, you don't bother with the data integrity constraints.
The bottom line is if you drop in a second data store because you have a few fields in your database that are a pain to model with a schema, you are doing yourself a disservice compared to just doing ALTER COLUMN foo hstore.
My colleague mcfunley wrote an article about this blind spot when people talk about these issues:
> MongoDB et all basically are built around the assumption that a schema is never worth the complexity. It's a bold claim that contradicts many decades worth of database research.
You may well argue that if you have N-1 applications using PostgreSQL, and the Nth application could---on its own---justifiably use MongoDB, then it is still appropriate to use PostgreSQL in favor of not adding Yet Another DB Engine.
But that is nothing more than a specific case that is often ignored in the "best tool for the job mantra". It does not mean that schemas are never worth the complexity of an RDBMS.
All I'm saying is that you can't claim that a recommendation of MongoDB assumes schemas are never worth the complexity; you can only claim that the assumption is that they are sometimes not worth the complexity.
More generally, MongoDB makes no assumption that contradicts "years of DB research."
http://www.postgresql.org/docs/9.2/interactive/external-pl.h...
However, the PostgreSQL documentation doesn't mention JavaScript support anywhere. Are you sure there exists mature PL/JavaScript binding for PostgreSQL? If so, their docs should be updated.
[1] There's also "pgSQL", but that's a special-purpose language you won't find outside the database world. I don't recommend learning it unless you have strange requirements that make PL/pgSQL a perfect fit. For normale usage, use PL/Python or PL/Perl. In simple cases, use SQL directly.
I can't vouch for any particular maturity level but seems to have active users and it's been around a few years already.
You could build something like that on top of SQL, but it's nice to have a tool where you don't have.
In the situations you describe, and when using most NoSQL databases, there's still a schema. It's just stored in the minds of developers, in documentation that's correct and up-to-date, in documentation that is incorrect and outdated, throughout application code, and numerous other places.
Then there's the sensible approach taken by most relational database systems, where the schema is centralized, it is described with some degree of rigor, and it can be more safely modified and managed.
I'm imagining with this tool you start to need to be a bit more careful with the flexibility which initially drew you to mongo.
If you need the data in SQL, you can either parse the JSON somehow, or rebuild the SQL table with a MoSQL schema that knows about the new fields.
A Postgres bouncer + WAL replication achieves a similar result: There is no downtime on failover, but there is a single slave.
Of course, it's understandable why they have a bad impression of SQL; they've only ever used one of the most inept implementations around.
Those who are willing to try one of the more mature and sensible relational database systems usually see quite quickly the value that such systems provide.
If there are production-ready options for biggish data other than NoSQL or high priced commercial analytics dbs, please share...
Also, your comment is rather ambiguous, I certainly hope you're not calling Postgresql experimental, because that would be laughable; and there are several examples of multi-terabyte databases using it.
SQL is NOT always the right tool for the job.
There are plenty of situations where an Hadoop or a Storm/S4 approach works better. Again it's about picking the right technology for the task at hand.
Where we use MongoDB, it's not because of speed. PostgreSQL is certainly capable of fast performance. MongoDB is useful for its ability to log freeform data as well as for its replication model. (We use sharded MongoDB in a few places, but mostly use straight replica sets.)
We use MySQL, MongoDB, PostgreSQL, and Impala. They're all useful in different places.
These are the same guys who built hstore, full text search, GIN and GIST indexes and I think are working on a generic regular expression index type right now.
Thanks for the clarification, but this makes it even more obvious your engineering team is introducing needless complexity into your organization.
Postgres can store unstructured data just fine, so you have a 'solution' that uses 3 OLTP stores instead of one.
Making developers productive is an important aspect for choosing a database.
How are you liking Impala? We just dropped 0.5 release yesterday which includes the JDBC driver :D!
Edit: Awesome job on the Ruby client, it's great!
I've been meaning to write a MoSQL equivalent for our Impala data, but at the moment we're doing a more traditional ETL.
I've passed your comment on to Colin, who wrote the Ruby client -- I'm sure he'll appreciate it!
There is absolutely no reason to make banking system on GT.M but they did.
Although: GT.M is the only(?) NoSQL that is ACID-compliant.
Yes there is. PostgreSQL doesn't support multi master replication which makes it a terrible choice if you really want to make sure every transaction gets written. I really wonder at what point people that keep recommending PostgreSQL are going to wake up and realise what is happening in the industry.
People are scaling OUT not UP. Especially startups.
Many startups would be using AWS and it is not inconceivable that you would have Multi-AZ/Multi-Region VPSs. Scaling out != Expensive.
Startups need to scale out because many of them like to deploy on mediocre EC2 instances with the slowest SAN storage ever.
People that keep recommending PostgreSQL are rightfully ignoring this industry.
No. They need to scale out because providers like AWS have outages. And so startups et al need to deploy in multiple AZ/regions in order to have as close to 100% uptime as possible. You can't do that with a well considered multi master style replication strategy which PostgreSQL frankly doesn't have.
>People that keep recommending PostgreSQL are rightfully ignoring this industry.
Sure. And soon enough they will be relegated to the dustbins of history. The trends don't lie.
Wah. And you do not even seem to be ironic. Trends always lie, there is always a next thing that will take the opposite direction, in philosophy, in science, and particularly so in computing stuff.
More important questions are how is the data stored, how is it accessible, how can you scale the system, what operational constraints are there, how fast is it, what types of data modeling can be done, what consistency/transaction guarantees does it provide, etc. These are the things that will make developers productive because they will not be putting out fires all the time.
Thanks for the kind words!