Comparing Database Types

521 points by ironcutter 6 years ago | 170 comments

taffer 6 years ago |

> Relational databases get their name from the fact that relationships can be defined between tables.

This is a widespread misconception. Relational databases get their name from relations in the mathematical sense[1], i.e. sets of tuples containing facts. The basic idea of the relational model is that logical predicates can be used to query data flexibly without having to change the underlying data structures.

The basic paper by Codd[2] is really worth reading and describes, among other things, the problems of hierarchical and network databases that the relational model is meant to solve.

[1] https://en.wikipedia.org/wiki/Finitary_relation

[2] https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

tlarkworthy 6 years ago | |

What is the relation? The table? (I.e. the tuples describing the rows). Or is it the joins? (In which case the article is correct). The columns?

woolcap 6 years ago | | |

One way to think about it is with a mathematical relation, like 'X > Y'. A relational database relation representing this relation would consist of a header tuple, <X, Y>, and a set of tuples whose values satisfy the relation, such as <10, 2>, <8, 3>, <9, 4>. In more common terms, the rows of this table would contain pairs of numbers in which the value of the X attribute is greater than the value of the Y attribute. This table describes the relation(ship) of certain pairs of numbers.

"Each tuple in a relation represents an n-ary relationship...among a set of n values..., and the full set of tuples in a given relation represents the full set of such relationships that happen to exist at some given time--and, mathematically speaking, that's a relation."[1]

[1] Chris Date, Database in Depth, page 46

baddox 6 years ago | | |

The relation is essentially all the rows in a given table. In relational algebra, a relation is a set of tuples of a fixed length. Each position in each tuple is associated with some attribute (essentially the name of the column) and each element in a given position is a value of a certain "data domain" (essentially a data type, like "integer").

taffer 6 years ago | | |

The table is the relation. A join is just an operation that combines two relations to a superrelation.

skybrian 6 years ago | | |

The value in a table at a particular time is a relation, but so is the value of a view, the result of a subquery, the result of a select statement, and the value represented by the contents of a CSV file. In some API's this is called a row set.

It's essentially a table-shaped value. Conceptually it's immutable, and relational algebra is about calculating new values from old ones. A select statement does this too.

dsego 6 years ago | | |

In a sense it reminds me of sentence structure. The table/relations is like the predicate and the rows contains all the subjects/objects that the predicate applies to.

ashelmire 6 years ago | | |

A relation is a tuple (H, B) with H, the header, and B, the body, a set of tuples that all have the domain H.

jrapdx3 6 years ago | | |

As a non-mathematician I've always conceived of the relations among data as existing in the queries. In a SQL query the "relation" is specified by matching columns in the tables containing the data we're looking. Conceptually indexes, etc., are system implementation details.

That's certainly not a rigorous definition but helped me keep my head straight about what I was doing.

nikolasburk 6 years ago | |

Thanks for the hint, we'll update the article! :)

triska 6 years ago | | |

Related to the logical view, it would also be great to include deductive databases:

https://en.wikipedia.org/wiki/Deductive_database

Deductive databases derive logical consequences based on facts and rules.

Datalog and its superset Prolog are notable instances of this idea, and they make the connection between the relational model and predicate logic particularly evident.

Codd's 1979 paper Extending the Database Relational Model to Capture More Meaning contains additional information about this connection. For example, quoting from Section 3 Relationship to Predicate Logic:

"We now describe two distinct ways in which the relational model can be related to predicate logic. Suppose we think of a database initially as a set of formulas in first-order predicate logic. ..."

woolcap 6 years ago |

> Relational databases get their name from the fact that relationships can be defined between tables.

Relational databases get their name from the mathematical concept of a relation, used by the Relational Model, "an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations." [1][2] (emphasis added)

Recommended Reading: Database In Depth, by Chris Date.

[1] https://en.wikipedia.org/wiki/Relational_model

[2] https://en.wikipedia.org/wiki/Relation_(database)

reilly3000 6 years ago | |

I did a MOOC on relational algebra that made me much more productive in SQL and better appreciate the gravity of what RDBMS really offer. Understanding relational algebra helps demystify the magic or query planners and grok why they both add and reduce latency based on use cases.

edmundsauto 6 years ago | | |

Mind sharing the course? Sounds useful.

pjungwir 6 years ago |

Codd's 1979 paper "Extending the Relational Model" [1] is really interesting, especially the second half. The first half is about nulls and outer joins, and I think that steals everyone's attention. But the second half basically gives a way to turn your RDBMS into a graph database by (among other things) letting you query the system catalog and dynamically construct your query based on the results. This would never work with today's parse-plan-optimize-execute pipelines, but it's a really cool idea, and I've certainly often wished for something like it. I'd love to know if anyone has followed up on these ideas, either in scholarship or in built tools.

[1] https://gertjans.home.xs4all.nl/usenet/microsoft.public.sqls...

einpoklum 6 years ago |

The document completely overlooks Columnar databases, which are focused on analytics and are much faster than (most, not all) general-purpose DBMSes. See:

https://en.wikipedia.org/wiki/Column-oriented_DBMS

and

https://www.slideshare.net/arangodb/introduction-to-column-o...

or get:

http://www.nowpublishers.com/article/Details/DBS-024

Examples:

* MonetDB

* SAP Hana

* Actian Vector (formerly Vectorwise)

* Oracle In-Memory

ralusek 6 years ago | |

A database being columnar is more of an implementation detail regarding the underlying storage mechanism than it is a type of database. It has to do with how the data is physically stored, i.e. by putting values across rows of a single column adjacent to one another on disc/in memory.

kthejoker2 6 years ago | |

Also Druid, HBase, Vertipaq (engine behind PowerBI), Redshift, Azure SQL DW, etc

Columnar compression is a really interesting engineering problem

nicoburns 6 years ago | | |

Also notable is the Postgres column store extension.

Not as fancy / performant as the dedicated columns store databases, but it allows you mix and match row-tables with column-tables which is pretty nice.

einpoklum 6 years ago | | |

Yes it is, and I've written academically about this. For example: https://is.gd/9wwjjf

shrumm 6 years ago | | |

ClickHouse is another favourite

muydeemer 6 years ago |

The article reminds of the work of Stonebraker and Hellerstein - What Goes Around Comes Around, which gives a description of how the database world goes in cycles (can be found here: https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/pape...)

gibsonf1 6 years ago |

The discussion of graph dbs completely misses the semantic rdf graph approach and how that differs greatly from the property graph (which is discussed). So important is not having to have a custom schema for each application that does not communicate with any other app as opposed to using standard ontologies with relationships and classes that are known and allow interoperability between systems (Linked Data Platform - Solid)

planck01 6 years ago | |

Do you know of any successfully semantic RDF graph databases, I guess with OWL support? Because I personally don't. If not, it probably is rightfully too much an academic niche to be discussed in the article.

kthejoker2 6 years ago | | |

Stardog, MarkLogic, Virtuoso, AllegroGraph, and RDF4J all have commercial applications, but yeah in general semantic RDF is dying on the vine.

gibsonf1 6 years ago | | |

We are using Blazegraph and Neptune in production as well as Allegrograph. With Neptune, we tested scale by putting the entire dbpedia on one 4 core machine with 16G of ram. It handled 2.7 billion statements without any issues (we ran out of time with the test - sure it can handle more)

hmottestad 6 years ago | | |

Stardog is quite successful.

dehrmann 6 years ago | | |

Does Top Quadrant have anything that does this?

AlphaWeaver 6 years ago |

This article comes from the team at Prisma, who are doing some really cool work building "database schema management and design with code" tools. They're working on a new version of their library right now (Prisma 2) and are regularly giving updates to the community and providing test versions.

Most everything they make is open source and really well designed. Would recommend checking it out!

nikolasburk 6 years ago | |

Nikolas from the Prisma team here! Thanks a lot for the endorsement, we're indeed super excited about the current database space and the tooling we see emerging.

For anyone that wants to check out what we're up to, you can find everything you need in this repo: https://github.com/prisma/prisma2

616c 6 years ago | |

I am curious about Prisma2 because I tried to build a server side API with v1 as a novice to graph systems and it became an unwieldy nightmare. Partially my fault for wanting to do it without the SaaS they provide but trying to build with something complicated and Apollo on the frontend with a skilled FE dev got me so confused I put it off.

nikolasburk 6 years ago | | |

Prisma 1 indeed has a couple of quirks that we're currently ironing out with Prisma 2 (or the "Prisma Framework" as we now call it). Would love to hear from your whether the new version actually solves your pain points!

Feel free to reach out to me: burk@prisma.io or @nikolasburk on the Prisma Slack https://slack.prisma.io

SPascareli13 6 years ago | | |

I was looking into Prisma + Apollo, can you expand more on your problems with it? To be it seemed really magical at first, but I don't know how it really works in production.

muydeemer 6 years ago |

Just a quick remark on graph dbs. Titan which is mentioned in the article as an example of a graph db is dead. Its successor is the Janus graph (https://github.com/JanusGraph/janusgraph).

planck01 6 years ago | |

I am surprised Dgraph isn't mentioned as an example. It is the most starred graph db on Github, and I think it is the best one in terms of performance and scalability.

mdaniel 6 years ago | | |

Strange that they have to have such a non-standard license, when they go out of their way to mention Apache 2 several times: https://github.com/dgraph-io/dgraph/blob/master/LICENSE.md

Contrast that with Orient, who also have an Enterprise version, and they just straight-up say "Apache 2, no drama" https://github.com/orientechnologies/orientdb/blob/develop/l...

We had an absolutely miserable experience trying to get Janus to behave rationally, and thus far have had zero drama with Orient; we skipped dgraph because it does not appear to work with Gremlin, meaning one must use vendor-specific APIs to use dgraph.

Their client reminds me of the days before ORM: write a big string literal and send it to the server: https://github.com/dgraph-io/dgraph4j#running-a-query

mrjn 6 years ago | | |

(Dgraph author) Particularly embarrassing because I actually know the founders of Prisma ;-). Amazing folks!

They even included YugaByte, with only 2.8K GitHub stars. Dgraph crossed 11K GitHub stars and is in the top 10 Graph DBs on DB Engine now -- what would it take for us to be in the article, Søren?

Just joking. Nice article! Keep up the good work, guys!

tabtab 6 years ago |

I'd like to see "dynamic relational" implemented. It's conceptually very similar to existing RDBMS and can use SQL (with some minor variations for comparing more explicitly). You don't have to throw away your RDBMS experience and start over.

And you can incrementally "lock it down" so that you get RDBMS-like protections when projects mature. For example, you may add required-field constraints (non-blank) and type constraints (must be parsable as a number, for instance). Thus, it's good for prototyping and gradually migrating to production. It may not be as fast as an RDBMS for large datasets, though. But that's often the price for dynamicness. (A fancy version could allow migrating or integrating tables to/with a static system, but let's walk before we run.)

https://stackoverflow.com/questions/66385/dynamic-database-s...

Some smaller university out there can make a name for themselves by implementing it. I've been kicking around doing it myself, but I'd have to retire first.

bryanlarsen 6 years ago |

The description of flat-file database seems too restrictive. In my experience, flat files with fixed record lengths and no delimiters were far more common than variable-length delimited formats like CSV. File sizes were often much larger than computer memory size, so random read & write was necessary.

kps 6 years ago | |

Yes. The origin of the flat-file database is fixed-format unit record equipment¹, predating computers. COBOL is essentially a language designed for processing fixed-format files.

¹ https://en.wikipedia.org/wiki/Unit_record_equipment

imchairmanm 6 years ago | |

That's definitely a good point. I'll try to update the article to reflect that soon. Thanks for the feedback!

rainyMammoth 6 years ago |

What about time series databases that are fairly common nowadays ?

manigandham 6 years ago | |

Time-series is more about a specific use-case about data that has a primary time component (like sensor metrics). You can store it in any database, although the common ones are usually some sort of key/value or relational with specific features for time-based queries.

Hbase/Bigtable/DynamoDB/Cassandra are key/value. InfluxDB is key/value. Timescale is an extension to Postgres.

jnordwick 6 years ago | | |

The big time TS databases (Sybase, KDB, Informix Datawarehouse) are column-based, not key value or traditional relational row-oriented. The ones you list are all lower-tier trying to shoehorn a time field on another model.

shakkhar 6 years ago | | |

If you could store timeseries data "in any database", kdb wouldn't be a thing. Just go and ask a quant trader replace his kdb instance with postgres. (Be prepared to be laughed out of the room.)

neop1x 6 years ago | |

Good point, I think it should be mentioned because those DBs are special in that they are optimized for analyzing data changes over time. They store timestamps effectively and often allow filtering and aggregation on tags and in time intervals. It is not common to query RDBM for a chart of 10-minute averages of latency, histograms, quantiles, etc and do things like downscaling from 1-second intervals to 1minute intervals Great examples: Prometheus, Graphite, InfluxDB

nudpiedo 6 years ago |

All of them are in fact graph databases, they just didn't realize about it and got lost giving the implementation the category of design for many reasons specific to the context in which they were created. I think we should think more often as mathematicians and a little bit less as "hackers"

TheMiller 6 years ago | |

I think this is a mischaracterization. The relational model which motivated relational DMBSs is based on predicate logic. Mappings to graphs are obvious, but are not the organizing principle. This was one of the strengths of the relational model, encouraging a more flexible view of the data than graph databases had previously offered. In a complex relational schema, you can discover and work with all kinds of implicit graphs that were not originally intended by the schema design.

edmundsauto 6 years ago | |

If they are all described as graph databases, we lose the usefulness of understanding the differences between them. I think understanding these differences are at least interesting, and possibly useful.

_Understated_ 6 years ago |

I'm curious... what did the author mean by this:

> Legacy database types represent milestones on the path to modern databases. These may still find a foothold in certain specialized environments, but have mostly been replaced by more robust alternatives for production environments.

I didn't notice anything that went into any detail about legacy database types.

Any idea what the author means by a "Legacy Database"?

AlphaWeaver 6 years ago | |

The article gives some examples of these, it seems they're mostly referring to things we wouldn't consider "databases" like flat files.

tempguy9999 6 years ago | |

He actually tells you in the article, straight after (flat file, hierarchical...)

_Understated_ 6 years ago | | |

Aww man. I am a dumbass... I never equated that section of things like Network databases and such as legacy.

Dunno how I missed it :(

*Must read slower...

AtlasBarfed 6 years ago | |

Mainframe days

honkycat 6 years ago |

There is a great chapter in "Designing Data Intensive Applications" about this very subject

bryanrasmussen 6 years ago |

Again a renaming that makes what the article is actually about less clear.

victor106 6 years ago |

Can anyone here point to a resource that gives a comprehensive treatment (use cases, pros and cons etc) of all the types (Nosql, NewSQL, relational, timeseries) of databases being used today?

thekhatribharat 6 years ago |

[Shameless Plug] A summary of the evergrowing NoSQL and NewSQL market: https://medium.com/open-factory/nosql-newsql-a-smorgasboard-...

minitoar 6 years ago |

How would you categorize something like ClickHouse or Interana or Druid? Columnar I guess, but then the description of Column-family in the article doesn't match up with my experience of how those work.

imchairmanm 6 years ago | |

Hello, author here. That's a good question and something I had a hard time sorting out as I worked on this.

I think those fall into a different category confusingly sometimes called column-oriented databases. They're primarily used for analytic-focused tasks and get their name from storing data by column instead of by row (all data in a single column is stored to disk together).

I didn't include those as a separate category here because they're basically relational databases with a different underlying storage strategy to allow for easier column-based aggregation and so forth.

My colleague shared this article [1] with me, which definitely helped inform how I distinguished between the two in my head.

[1] http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-m...

minitoar 6 years ago | | |

That makes sense, they really are just relational databases optimized for certain tasks, with corresponding limitations e.g. they don't support arbitrary joins.

intellix 6 years ago |

Skipped through it looking for an answer but didn't see it: where are unions in Prisma?! Was looking for some big reveal about an underlying choice that enables what everyone is begging for

matthewmueller 6 years ago | |

I've been mapping out union types at Prisma.

Are you in our Slack? I'm @mattmueller at https://prisma.slack.com.

I'd love to chat with you to better understand your use cases, so we can make sure we're designing it for you.

dehrmann 6 years ago |

The column-family databases mentioned (Cassandra, HBase) are both just fancy key-value stores that add semantics for separate tables and cell-level data so you're not rolling it yourself.

neop1x 6 years ago |

No Elastic among the examples while highly popular and nice :( Great article overall, though!

CMCDragonkai 6 years ago |

There's also column-oriented or array databases like MonetDB and Rasdaman.

zbentley 6 years ago | |

And the elephant in the room: Cassandra.

sourcepath 6 years ago |

What happened with Prisma being all about graphql?

nikolasburk 6 years ago | |

GraphQL is a really important use case for Prisma. That is using Prisma as the "data layer" on top of your database when implementing a GraphQL server (e.g. using Apollo Server). However, it's not the only use case since you can effectively use it with any application that needs to access a database (e.g. REST or gRPC APIs). We actually wrote a blog post exactly on this topic: https://www.prisma.io/blog/prisma-and-graphql-mfl5y2r7t49c/

You can also find examples for the various use cases here: https://github.com/prisma/prisma-examples/tree/prisma2

Please let me know if that clarifies it or if you have more questions! :)

galaxyLogic 6 years ago |

What happened to Object-Oriented Databases?

AtlasBarfed 6 years ago | |

Document databases kind of killed them I would specualte. Since JSON serializes with objects so much better than (ugh) XML, the Relational impedence is gone (well, a lot of it).

chrisweekly 6 years ago |

I didn't RTFA, but based on titles alone, isn't "Object Database" missing from the list?

marcosdumay 6 years ago | |

Aren't those a special case of hierarchical databases?

marknadal 6 years ago |

What a lovely article!

It should be emphasized that graph databases can do all other types of databases (relational, document, key/value, etc.) as you can see demonstrated in this article (https://gun.eco/docs/Graph-Guide).

This makes graphs a superior data structure.

If you think about the math, any document is a trie, and tables are a matrix. Both trees and matrices can be represented as graphs. But not all graphs can be represented as a tree or graph.

This gets even more fun when you get into hypergraphs and bigraphs, which are totally possible with property graph databases where nodes have type!

jt2190 6 years ago | |

> This makes graphs a superior data structure.

I'll read this generously and assume you meant to say that graphs are an essential data structure, i.e. we can use a graph to represent the more specific data structures used by various types of databases (e.g. A b-tree is a type of graph)

Whether a graph data store or a more specialized tool (e.g. a relational database, etc.) is superior depends (as I'm sure you agree) on context.

danenania 6 years ago | |

“It should be emphasized that graph databases can do all other types of databases (relational, document, key/value, etc.)”

Not to knock graph dbs, but isn’t the reverse also true?

mumblemumble 6 years ago | | |

Yes. And it may even be the best way to do it. For example, here's a paper where the authors come up with a schema and transpiler for doing a Gremlin-queryable graph DB in PostgreSQL, and find that it outperforms Neo4j and Titan:

https://static.googleusercontent.com/media/research.google.c...

lmkg 6 years ago | | |

One of the key features of graph databases is to select one node, and then recursively 'chase' edges until you find another node matching some criteria. Other database models can have trouble representing chasing an unbounded number of edges. E.g. in the relational model, following an edge to another node is usually represented as a Join operation, and SQL doesn't let you parameterize the number of iterated joins. This is especially true if the thing you want to query is actually the path length.

In an HN thread from a few days ago, someone made the claim that the graph model could be represented by SQL + recursion, and recursive SQL is an extension offered by some databases. But the relational model itself cannot fully represent the graph model.

Without digging too deep, I suspect other database models run into similar problems. E.g. a document store could very easily represent a Directed Acyclic Graph as a document, but when you get into general graphs your document needs to end on a value that is the key to another graph.

This is not agree with the claim that graph databases are generally superior. I like them, and they're fun, and I think more developers should be aware of them for cases where they apply, but I also don't think they have advantages over relational or document stores when the data is natively table-shaped or DAG-shaped.

ithkuil 6 years ago | | |

A rather long write-up but gives some context about why it's hard to build a graph data model as a layer on top of commodity non-graph databases: https://blog.dgraph.io/post/why-google-needed-graph-serving-...

(Obviously, the underlying storage layer of a graph db will use some sort of simpler storage layer, usually some kind of key value store)

namelosw 6 years ago | | |

Some databases design like extremely simple key-value databases cannot efficiently express joining relation unless load all the data in memory. The same could be said for column based databases etc. I guess that's quite a difference.

cbtx 6 years ago | |

There's a difference between being able to do something and doing something well.

kristoff_it 6 years ago |

> To store data, you provide a key and the blob of data you wish to save, for example a JSON object, an image, or plain text. To retrieve data, you provide the key and will then be given the blob of data back. The database does not evaluate the data it is storing and allows limited ways of interacting with it.

Definitely not a good description of Redis, even though they cite it as the first example of a Key-Value DB.

nailer 6 years ago |

> Relational vs. Document

Tabular vs Document. Having relations is orthogonal to the shape of your data. There are document databases with relations - RethinkDB was pretty popular. Mongo sadly doesn't have them but will probably eventually get them too.

takeda 6 years ago | |

The adjective relational in a relational database comes from mathematical relations, tuples i.e. data in tables.

It's common misconception that it is from foreign keys.

nailer 6 years ago | | |

That's interesting - it seems to both be backed by and conflict with a lot of https://en.wikipedia.org/wiki/Relational_model but maybe that's wrong.

I'd still avoid the word 'relational' though - obvious many people will assume 'relational' is related to DB relations rather than tuples (assuming you're right about 'relations' meaning tuples, a lot of the wikipedia contributors are included).

Nican 6 years ago |

I am really tired of articles that talk about the different types of databases. People can make a graph databases act like relational databases, and vice-versa. Computers, in the end, are just a Turing machine. Just pay attention that the query that you are executing is actually doing the optimal solution.

I wish more time would be spent talking about the underlying algorithms that the different query languages use to accomplish the tasks. It is important for developers to understand the execution complexity of queries, and how data is distributed across a cluster.

For example, I am usually surprised when people talk about "web-scale", but they do not understand the difference between a "merge-join" and a "hash-join". Or when people do not realize that a sort requires the whole result set to be materialized and sorted.

ALTER TABLE Doc ADD CONSTRAINT check_doc_val CHECK ( jsonb_typeof(val)='object' AND val ? 'user_id' AND jsonb_typeof(val->'user_id')='string' ); CREATE INDEX doc_user_id ON Doc ((val->>'user_id'));