Comparing Database Types(prisma.io) |
Comparing Database Types(prisma.io) |
It should be emphasized that graph databases can do all other types of databases (relational, document, key/value, etc.) as you can see demonstrated in this article (https://gun.eco/docs/Graph-Guide).
This makes graphs a superior data structure.
If you think about the math, any document is a trie, and tables are a matrix. Both trees and matrices can be represented as graphs. But not all graphs can be represented as a tree or graph.
This gets even more fun when you get into hypergraphs and bigraphs, which are totally possible with property graph databases where nodes have type!
I'll read this generously and assume you meant to say that graphs are an essential data structure, i.e. we can use a graph to represent the more specific data structures used by various types of databases (e.g. A b-tree is a type of graph)
Whether a graph data store or a more specialized tool (e.g. a relational database, etc.) is superior depends (as I'm sure you agree) on context.
Not to knock graph dbs, but isn’t the reverse also true?
https://static.googleusercontent.com/media/research.google.c...
In an HN thread from a few days ago, someone made the claim that the graph model could be represented by SQL + recursion, and recursive SQL is an extension offered by some databases. But the relational model itself cannot fully represent the graph model.
Without digging too deep, I suspect other database models run into similar problems. E.g. a document store could very easily represent a Directed Acyclic Graph as a document, but when you get into general graphs your document needs to end on a value that is the key to another graph.
This is not agree with the claim that graph databases are generally superior. I like them, and they're fun, and I think more developers should be aware of them for cases where they apply, but I also don't think they have advantages over relational or document stores when the data is natively table-shaped or DAG-shaped.
(Obviously, the underlying storage layer of a graph db will use some sort of simpler storage layer, usually some kind of key value store)
Definitely not a good description of Redis, even though they cite it as the first example of a Key-Value DB.
Tabular vs Document. Having relations is orthogonal to the shape of your data. There are document databases with relations - RethinkDB was pretty popular. Mongo sadly doesn't have them but will probably eventually get them too.
It's common misconception that it is from foreign keys.
I'd still avoid the word 'relational' though - obvious many people will assume 'relational' is related to DB relations rather than tuples (assuming you're right about 'relations' meaning tuples, a lot of the wikipedia contributors are included).
I wish more time would be spent talking about the underlying algorithms that the different query languages use to accomplish the tasks. It is important for developers to understand the execution complexity of queries, and how data is distributed across a cluster.
For example, I am usually surprised when people talk about "web-scale", but they do not understand the difference between a "merge-join" and a "hash-join". Or when people do not realize that a sort requires the whole result set to be materialized and sorted.
In fairness, I think "web-scale" generally means the serving path of a website with (say) hundreds of millions of active users. In other words, a heavy OLTP workload. The total query volume is too high for a single-machine DBMS but each operation executed is probably simple. They may not be doing joins at all; many of these websites have gotten away with key/value databases. Where they are joining, most likely at least one side of the join is a small amount of data directly owned by calling end user. (In social products, the other side might be say the users table, to map userids to names, email addresses, profile photos, etc.)
Big joins are more likely to happen in offline paths but likely via something like MapReduce rather than in the database server, and that batch processing framework may use different terminology for similar needs.
In that context, I think it's relatively understandable why someone would be fuzzy on merge-join vs hash-join. There are other skills they might need that are specific to key-value or "NewSQL" databases like Bigtable or Spanner. I wouldn't expect someone who doesn't work on a "web-scale" system to know much about this. These skills aren't simply additive, and "web-scale" isn't necessarily harder, just different.
And then of course there's people who think they have a web-scale website when it's not popular enough that you need to give up on single-machine DBMSs. There's just no hard problem there: not expense of single operations, not overall expense.
This isn't a helpful abstraction. Electricity is just electrons after all, why think about how you're going to wire your house when it's all just atoms? Read a bunch of books on quantum physics, then become an electrician /s
> Just pay attention that the query that you are executing is actually doing the optimal solution.
Isn't that the entire reason for articles like this? eg: If you have data with large amounts of relations a no-sql database probably isn't the right approach.
> I wish more time would be spent talking about the underlying algorithms that the different query languages use to accomplish the tasks.
Absolutely, if just for knowledge's sake, but these are largely not needed if you understand the high level use cases for any given DB. It's cool if you understand B-Trees but not strictly needed to use sql. In fact, many would this this not helpful.
Take a highly nested JSON document and try and implement it in a relational database. You would have an O(1) lookup in MongoDB and a O(n) lookup in MySQL. Or good luck traversing a graph structure in MongoDB when you should've used Neo4J. So in order to have a performant database you do need to "pay attention" and ensure that your access patterns suit the database type.
Also Web Scale is about any approach used at the bigger web companies. And the type of people who use the term are not the same people who would be sitting there optimising SQL queries. So I wouldn't conflate the two.
Your examples are kinda proving yourself wrong though.
> Take a highly nested JSON document and try and implement it in a relational database. You would have an O(1) lookup in MongoDB and a O(n) lookup in MySQL.
In PostgreSQL (which is not MySQL, but it's a relational database) you can create an index for a column which contains JSON and the lookup for a nested field becomes O(1).
> Or good luck traversing a graph structure in MongoDB when you should've used Neo4J.
Facebook implemented the social graph on top of MySQL: https://www.facebook.com/notes/facebook-engineering/tao-the-....
> So in order to have a performant database you do need to "pay attention" and ensure that your access patterns suit the database type.
I think the point of the parent poster was that you don't need to ensure that your access pattern suite the database type, but that it fits well with the features the database provides. If you're looking at the database landscape today talking about a "database type" isn't really saying much at all.
It is not about the database type. It is about the underlying algorithms and data structures.
Some databases are optimized for certain workflows, so for instance they will store only "pointers" to certain data rather than copying them. In addition their clients obfuscate and abstract the storage layer from user.
But you can really store any data in any database, you just might have to do some extra pre/post-processing and multiple the storage requirements.
The constant factors are way more important here. It's a 1000x factor difference depending on how durable you need your data to be (whether you need to write to disk or a quorum of network nodes in multiple regions). That is basically the only thing that mattered in the recent mongo vs postgres benchmarks.
Yes, you can sort of make any database behave like any other database. But you don't want to. You want a power tool that has query planners built in that can efficiently solve the sort of problems you're dealing with.
Everything you said about algorithms and query complexity is part of that foundation, not a replacement for it.
I read this article and learned things I didn't know.
What would you like to have done differently? The author not write the post? Or someone not post it to HN? I don't understand your complaint. If you'd like a different article, write a different article.
> I wish more time would be spent talking about the underlying algorithms
That's precisely the difference. How they store data is fundamental to what kinds of operations are fast. A row-based and columnar database can look rather similar (PostgreSQL/Redshift). But the performance characteristics are far different.
Did the article come up short?
To give some examples, CockroachDB both has column-families and NoSQL characteristics. A column can be specified to be a JSON column, and it can have an inverted index.
Or MemSQL has both row-based and column-based tables, and they use an unorthodox index called "skip lists".
CockroachDB and MemSQL both have different applications and characteristis, but they are just cluttered under "NewSQL", as if it was just was some kind of "SQL but better".
Famous last words :)
Detail, in the end, really matter.
Edit: typo
I recently gave a talk on mm-ADT at ApacheCon: https://www.slideshare.net/slidarko/mmadt-a-multimodel-abstr...
I don't quite get what you mean by "the relational model itself cannot fully represent the graph model". If you can store all the edges of a graph and query them, what can't be done?
Graph Databases, depending on the underlying storage model, can have performance increases over relational databases when the JOINs are what you are interested in. This is because JOINs in SQL are often O(n) or O(log n) (if you index the join) where N is the number of rows in the target table, but following a relationship in a graph might be O(1) if you store an object with pointers to other objects right there. Writes might be more expensive though.
It all comes down to use case
In the case of graphs, if you know the exact shape of your graph and its query patterns ahead of time, you can design the optimal structure in a KV/NoSQL store. But if you need/want the flexibility to do ad hoc queries of the graph in a reasonably performant way, that's where the graph db shines.
I specifically mentioned MongoDB and MySQL. Having a JSON type is basically a mini document database since it uses a different/enhanced query language and is often not relational with the rest of the schema.
I also specifically mentioned MongoDB and Neo4J. Yes you could implement a graph in MySQL but as your paper lists you will need to front it with a Memcache in order to cache the graph traversal lookups. With a graph database you generally don't need such a cache.
With the right design can't you store the graph data across multiple keys and then load pieces of it selectively into memory? I get that a graph db specializes in this pattern and makes it more efficient, easier to query, etc., but that doesn't mean it's the only type of db that can model a graph.
The design of relational databases indicates the database is responsible to do a lot of logic according to the query language. Graph databases usually also include their own query language in order to do the same thing.
For example, from trillions of records there are 100 records with field x equals to value y. And the job is to get all the 100 records instead of trillions of them. The database would take a short query and interpret the predicate logic in the database process, instead of sending the data back to client which usually located in another physical machine.
As an anecdote for one project, while trying to speed up some neo4j queries myself, I decided to model a binary tree structure in the nodes (child/parent relations) and then compared the query times for using the simple cypher queries vs cypher queries with some embedded lib functions that would walk the tree exactly the way I wanted. The times were much faster for something that I hadn't even optimized much code wise.
The test got me thinking that if I could have a way of declaring more info about how the relationships are related, then maybe we could automatically have the db use more appropriate algs for a more appropriate data structure for certain node types. I think that's similar to what's happening here, it's automatically mapping out the simple graph relations to structured relational db tables. I hope in the future we'll be able to provide more input to that as well (or at least I haven't seen something like this yet). Let me annotate my schema to specify my parent/child relationship as a tree, or my word map nodes form a trie but the leaves should be some other type. Why can't we think of a db as having multiple datatypes beyond just a kv-store, or table, or graph?
That it's implemented on top of a relational database seems like a red herring to me. The relational model just defines operations on sets of tuples. A graph is just a particular kind of thing you can construct with sets and tuples.
From there, the query planner and execution engine take over, and an incumbent RDBMS's query planner and execution engine are supported by decades and decades worth of accumulated dark knowledge on how to optimize execution plans and efficiently traverse large datasets in the presence of a hierarchical memory model.
By contrast, Neo4j (to take an example) has a steeper hill to climb. Both in terms of not having had to spend decades trying to compete with Oracle, and in terms of being implemented in a less-than-ideal language for chasing raw performance.
This is a widespread misconception. Relational databases get their name from relations in the mathematical sense[1], i.e. sets of tuples containing facts. The basic idea of the relational model is that logical predicates can be used to query data flexibly without having to change the underlying data structures.
The basic paper by Codd[2] is really worth reading and describes, among other things, the problems of hierarchical and network databases that the relational model is meant to solve.
"Each tuple in a relation represents an n-ary relationship...among a set of n values..., and the full set of tuples in a given relation represents the full set of such relationships that happen to exist at some given time--and, mathematically speaking, that's a relation."[1]
[1] Chris Date, Database in Depth, page 46
It's essentially a table-shaped value. Conceptually it's immutable, and relational algebra is about calculating new values from old ones. A select statement does this too.
That's certainly not a rigorous definition but helped me keep my head straight about what I was doing.
https://en.wikipedia.org/wiki/Deductive_database
Deductive databases derive logical consequences based on facts and rules.
Datalog and its superset Prolog are notable instances of this idea, and they make the connection between the relational model and predicate logic particularly evident.
Codd's 1979 paper Extending the Database Relational Model to Capture More Meaning contains additional information about this connection. For example, quoting from Section 3 Relationship to Predicate Logic:
"We now describe two distinct ways in which the relational model can be related to predicate logic. Suppose we think of a database initially as a set of formulas in first-order predicate logic. ..."
Relational databases get their name from the mathematical concept of a relation, used by the Relational Model, "an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations." [1][2] (emphasis added)
Recommended Reading: Database In Depth, by Chris Date.
[1] https://gertjans.home.xs4all.nl/usenet/microsoft.public.sqls...
https://en.wikipedia.org/wiki/Column-oriented_DBMS
and
https://www.slideshare.net/arangodb/introduction-to-column-o...
or get:
http://www.nowpublishers.com/article/Details/DBS-024
Examples:
* MonetDB
* SAP Hana
* Actian Vector (formerly Vectorwise)
* Oracle In-Memory
Columnar compression is a really interesting engineering problem
Not as fancy / performant as the dedicated columns store databases, but it allows you mix and match row-tables with column-tables which is pretty nice.
Most everything they make is open source and really well designed. Would recommend checking it out!
For anyone that wants to check out what we're up to, you can find everything you need in this repo: https://github.com/prisma/prisma2
Feel free to reach out to me: burk@prisma.io or @nikolasburk on the Prisma Slack https://slack.prisma.io
Contrast that with Orient, who also have an Enterprise version, and they just straight-up say "Apache 2, no drama" https://github.com/orientechnologies/orientdb/blob/develop/l...
We had an absolutely miserable experience trying to get Janus to behave rationally, and thus far have had zero drama with Orient; we skipped dgraph because it does not appear to work with Gremlin, meaning one must use vendor-specific APIs to use dgraph.
Their client reminds me of the days before ORM: write a big string literal and send it to the server: https://github.com/dgraph-io/dgraph4j#running-a-query
They even included YugaByte, with only 2.8K GitHub stars. Dgraph crossed 11K GitHub stars and is in the top 10 Graph DBs on DB Engine now -- what would it take for us to be in the article, Søren?
Just joking. Nice article! Keep up the good work, guys!
And you can incrementally "lock it down" so that you get RDBMS-like protections when projects mature. For example, you may add required-field constraints (non-blank) and type constraints (must be parsable as a number, for instance). Thus, it's good for prototyping and gradually migrating to production. It may not be as fast as an RDBMS for large datasets, though. But that's often the price for dynamicness. (A fancy version could allow migrating or integrating tables to/with a static system, but let's walk before we run.)
https://stackoverflow.com/questions/66385/dynamic-database-s...
Some smaller university out there can make a name for themselves by implementing it. I've been kicking around doing it myself, but I'd have to retire first.
Hbase/Bigtable/DynamoDB/Cassandra are key/value. InfluxDB is key/value. Timescale is an extension to Postgres.
> Legacy database types represent milestones on the path to modern databases. These may still find a foothold in certain specialized environments, but have mostly been replaced by more robust alternatives for production environments.
I didn't notice anything that went into any detail about legacy database types.
Any idea what the author means by a "Legacy Database"?
Dunno how I missed it :(
*Must read slower...
I think those fall into a different category confusingly sometimes called column-oriented databases. They're primarily used for analytic-focused tasks and get their name from storing data by column instead of by row (all data in a single column is stored to disk together).
I didn't include those as a separate category here because they're basically relational databases with a different underlying storage strategy to allow for easier column-based aggregation and so forth.
My colleague shared this article [1] with me, which definitely helped inform how I distinguished between the two in my head.
[1] http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-m...
Are you in our Slack? I'm @mattmueller at https://prisma.slack.com.
I'd love to chat with you to better understand your use cases, so we can make sure we're designing it for you.
You can also find examples for the various use cases here: https://github.com/prisma/prisma-examples/tree/prisma2
Please let me know if that clarifies it or if you have more questions! :)
https://www.datastax.com/products/datastax-graph
https://venturebeat.com/2015/02/03/datastax-acquires-aureliu...
Create a table with a json column:
CREATE TABLE Doc (
id UUID PRIMARY KEY,
val JSONB NOT NULL
);
Then later it turns out all documents have user_ids so you add a check constraint and an index: ALTER TABLE Doc ADD CONSTRAINT check_doc_val CHECK (
jsonb_typeof(val)='object' AND
val ? 'user_id' AND
jsonb_typeof(val->'user_id')='string'
);
CREATE INDEX doc_user_id ON Doc ((val->>'user_id'));
I think the postgres syntax for this is pretty ugly. And if you also want foreign key constraints you still have to move that part of the json out as a real column (or duplicate it as a column on Doc). I am not sure it's even worth it to have postgres check these constraints (vs just checking them in code).I am also a little worried about performance (maybe prematurely). If that document is large, you will be rewriting the entire json blob each time you modify anything in it. A properly normalized schema can get away with a lot less rewriting?
And since 4.x they have transactions over shards too!
For example given the extensional data ALBUM(TITLE, ARTIST) corresponding to the intention "the albums, each with a title and artist", we can compute "the eponymous albums, each with a title" via EPONYMOUS_ALBUM = ALBUM where (TITLE = ARTIST)
We started with some data for a relation corresponding with a concept, and were able to operate on the data to produce a new relation - data corresponding to a new concept.
To be honest, that describes almost all query languages, including SQL and Cypher. Dgraph's QL is based on GraphQL[1]. When starting Dgraph, we had a choice to go with Gremlin or Cypher. But, we didn't like either choices -- Gremlin tends to be slower to execute, because of how they treat the queries, i.e. graph walks, with iteration over results to generate the next set of queries; something we explicitly avoid [2]. And both of them returned lists of things.
GraphQL, on the other hand, allows a subgraph to be returned. One can go from a subgraph to a list of things (lossy), but can't go the other way around, because you lose relationships (knowledge about how they are connected).
That's what led us to the path of GraphQL. Over time we modified it to what we have now, GraphQL+-. Meanwhile, over these last 3 years, GraphQL has become so much more popular than either Gremlin or Cypher, that now Neo4j is a founding part of the GraphQL Foundation, just like Dgraph is.
Now, Gremlin is still relevant in the graph space. So, we're planning to build Gremlin support. But, before that, we're working on building a native support for official GraphQL spec. That should let one integrate into the fast growing GraphQL ecosystem.
[1]: https://graphql.org/ [2]: https://blog.dgraph.io/post/why-google-needed-graph-serving-...
Head of Product at Dgraph here. Our license is Apache v2 for most of our source code.
Only our enterprise features are licensed with the Dgraph Community License [DCL](https://github.com/dgraph-io/dgraph/blob/master/licenses/DCL...).
We decided to make all of our code available on GitHub regardless of the license (unlike OrientDB if I understand correctly) so anyone can see what the Enterprise Edition really does.
Let us know if this is confusing. If you don't want the proprietary code, you can always only use the open-source version of Dgraph with `make oss`.
I'm also really interested in seeing where GQL (gql.today) is going, adopting a community-wide standard would be a definite yes from us.
In any case, thanks for your feedback. I'd love to hear more on how we can make our product better, feel free to email me at francesc@dgraph.io.
Yes, SQL shines when you can whip up a big multi-table join, with nice group by-s and wheres, havings and whatnots, but that's rare. Though, arguably, ORMs solved most of the crazy string concatenation query crafting craziness.
Once we felt that the schema wasn't changing as much, had too many concerns, etc, we made tables and wrote to them (instead of the JSON column).
You can store timeseries data in Postgres if you want to (and optionally adding extensions like Timescale). You can store it in key/value like Redis or Cassandra. You can store it in bigtable. You can store it in MongoDB. Obviously different scenarios require different solutions.
KDB is a relational database with row and columnstore with features for time series and advanced numerical analytics along with a programming language. KDB is a thing because of those abilities, whether you use it for time-series data or not.
> I'm not sure what your point is.
My point is very simple - there is a category of databases widely accepted as "timeseries database", and they deserve a place in any conversation about "types of databases".
For example, here's Pinterest handling time-series data on Hbase: https://medium.com/pinterest-engineering/pinalyticsdb-a-time...
There's a big difference and muddying the definitions with marketing jargon ends up causing too much confusion in this industry.
More and more products support multiple data models today.
This reduces the number of technologies in your tech stack and allows to combine different access patterns without the need to duplicate and sync data between systems.
It does, because it leads to other types of optimization. LittleTable [0], for example, keeps adjacent data in time domain adjacent in disk. So querying large amount of data that are close to each other is efficient even on slow (spinning) disk. Vertica [1] does column compression which allows it to work with denormalized data (common in analytics workload) efficiently.
In an ideal world, you could have a storage layer sitting below a perfect abstraction; orthogonal to higher levels. In the real world, column-based and row-based are two completely different categories serving very different use-cases.
[0] https://meraki.cisco.com/lib/pdf/trust/lt-paper.pdf [1] http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf
Stored is an implementation detail. Optimizations are improvements to performance. Neither affects the fundamental data model, which in relational databases is relational algebra over tuple-sets.
Both rules and facts are instances of a single unifying language element, a logical clause, which is also terminology from predicate logic.
This is useful from a conceptual perspective, and also for performance: Many automatic optimizations that deductive database systems perform apply equally to facts and rules. For instance, argument indexing is a notable feature of virtually all Prolog systems. It is similar to indices in relational databases, and can replace a linear scan of the knowledge base with a fast hash-based lookup that is performed in O(1).
Yeah; I'd like to see these general application layers more seamless and united in simple semantic units.
Like a higher level abstraction over code block elements, database record structures, and configuration setting ensambles; simply as phrases (..??) ....
If you really want to use uuid and care about performance you might prefix it with something that's increasing like a date, or perhaps (did not try it) use hash index (need to be PG 10+).
By the way uuidv1 is already prefixed by a timestamp! But unfortunately it doesn't use a sortable version of the time so it doesn't work for clustering the ids into the same page. I think it was really designed for distributed systems where you would want evenly distributed ids anyway.
https://www.youtube.com/watch?v=xrMbzHdPLKM
It ends up being a pitch for Aurora at the end (as with any presentation from AWS folks), but it has tons of useful information for standard Postgres.
I've been in the industry for over 20 years. I even worked on SQL Server Analysis Services for 5 years (up to 256-dimensional 'cubes' in the early 2000s)... and I never thought of joins in this way. Granted, MDX was a beast in its own right.
Cue discussion about self-taught vs college educations. I've got advantages being self-driven learner... but I've definitely missed out in some regards.
The best short description I can give about MDX is that it's the language your pedantic uncle would come up with after falling in love with Ralph Kimball, when all he knew to base it on was SQL.
But the core operations in MDX operate upon hierarchical dimensions and facts that can be aggregated.
SELECT ... FROM ... WHERE
has no inherent relational semantic. It is simply a syntax that has been standardized upon for interacting with relational database systems. It was also, coincidentally, chosen as the syntax for another language, MDX.Funny enough, the successor to SSAS Multidimensional is SSAS Tabular, where the query language is DAX. DAX was designed with an explicit goal of looking like Excel's formula language, but it is in fact a relational language which is semantically very similar to SQL, despite looking nothing like it.
[1] https://www.postgresql.org/docs/current/sql-createsequence.h...
The top TS databases are more than just storage too. You need a query language that can exploit the ordering column-oriented gives you that the row-oriented relational doesn't.
On the lower end (eg, Timescale db) trying to fit a timeseries model on a row-oriented architecture which is a poor fit.
You're talking about relational databases (which is the formal type) designed for large-scale analytics using column-oriented storage and processing and supporting a time-series use-case.
Storage and querying just for time-series specifically is more about product features than the underlying type. For example, here's Pinterest doing the same on HBase: https://medium.com/pinterest-engineering/pinalyticsdb-a-time...