Graph Databases Intrigue Me(blog.jlank.com) |
Graph Databases Intrigue Me(blog.jlank.com) |
The thing is, it took me a little while to realize that the "the semantic web" is a very specific model where providers are be expected to explicitly provide the semantic decoration/meta-data for all their content. http://en.wikipedia.org/wiki/Semantic_Web
I basically don't believe that this particular approach will ever work (ie, the flood-gates won't open and content providers won't suddenly label all their data). I mean, this approach has been the failed-model of hypertext since ... project Xanadu, mid-sixties (a well-tended, fully meaningful store of data).
Instead, Google and other search engines and tools will just get smarter.
We'll find more ways to incidentally get semantic information from the raw data that's out there. But no will have enough incentive to manually provide that much deep-meaning for their data themselves (and anything whose semantic meaning can be automatically processed can be put on the web for someone else to automatically process). The semantic web approach is always going to be behind the curve compared to just putting raw, unstructured data out-there.
The more uses we find for information, the more ability we'll get to extract meaning from it without the data starting out labeled.
Look at what Watson could do.
-- And I am working on a tool that extract implicit information from the process of people interacting with data. Extracting implicit, inferred and deduced relations has much more promise even if it can't rely on explicit semantic labels. This is more or less what Google does also (it's true that so-far, Google's stuff is considered "semantically meaningless" and I know Google bought Metaweb. We'll see what they get from it...)
I actually think something like semantic technologies could be more useful than the tools that drive watson and google for smaller data sets, where machine/statistical learning are less useful, but even then they are an over engineered solution.
All the examples on this page look equivalent to how I'd model them in SQL: http://wiki.neo4j.org/content/Domain_Modeling_Gallery
The best I gather is that graph databases are schemaless (big whoop, more room for error), and that their implementations tend to perform well with transitive closures. I'm not seeing anything that can't be solved with materialized views in an RDBMS.
I'm tempted to think that just represents NoSQL folks coming to realize that relations are actually a good thing. Soon they'll be talking about how wonderful it is to take the Cartesian product of two graphs.
Someone care to enlighten me?
But still, even if I can get all links of a node with one access, it seems to me memory will be too scarce in most cases. Let's say I want to consider all nodes that are three steps away, and every node has 100 outgoing links. That's 100^3 nodes to consider already. Or in the second step, 100^2 nodes, so we would have to access the db 10000 times to get all nodes of the third step.
What I am getting at: it seems to me for really interesting graph computations it will usually still be necessary to create some specialized/compressed model to fit all the data into main memory.
Looking at Watson could do , made me wonder why is the technology world so lagging behind w.r.t interpreting information .If NLP is a solved problem (looks like in Watson's case) are we only pending creating a linkage between the real world entities???. Freebase has linked 20 million but thats not enough . The approach is non profit , "good for the world" kinds . Can there be an incentive for people to provide links between entities . Can we bring up a profit model where people/organizations compete to provide more and better linkages . Or can we extract such links from peoples web activity (search , social networking etc [FB,Twitter]) .. i am getting too many ideas now :D:D ... and ofcourse can we request IBM to donate their NLP technology for the greater good :P :P
For example, I recently wondered about the best representation for deep hierarchical data, and graph databases appear to be a good fit.
Our dataset is very peculiar, we have about 20M nodes and 500M edges in one of our smallest database. We also need fulltext search and complex joins between tables.
I'm wondering if graph databases are mature enough or if they are not quite ready yet for production work. We're a small shop and software development is not our core competency.
Most graph databases use BerkeleyDB under the hood. They are pretty safe to work with.
Query: find all the nodes a distance < D from some reference node.
With SQL92 or earlier, this will require an unbounded number of queries. Proof: A SQL92 query can only traverse a finite number of edges, since it has only a finite number of joins (say K). Consider the graph 1->2->3->...->N with edge weights 1/N. This will require N/K queries to traverse. N is arbitrary.
It's probably doable in SQL99 and various vendor specific extensions (I think it's Turing Complete), but it is unlikely to be fast.
So, as is the normal situation, you can do whatever you want in SQL. But the performance is likely to suck.
(That said, I agree with you that many people are using NoSQL because it's the hip new thing, when they should just use SQL. "Schemaless flexibility" is BS - SQL is very flexible if you use an ORM + migration tool. The only flexibility you lose is the flexibility to screw up the data in application logic.)
WITH RECURSIVE r AS (SELECT parent, child, distance FROM graph UNION ALL SELECT r.parent, graph.child, r.distance + graph.distance AS distance FROM graph, r WHERE graph.parent = r.child) SELECT parent, child FROM r WHERE distance < 3.14159;
Testing your graph with N=1000 on a teeny VPS. 477,897 rows returned in 6.1s.
If we move the WHERE clause inside the sub-SELECT (which unfortunately makes the query slightly more awkward to use), selecting lower distances reduces the compute time. For example, a bound of 0.5 returns 196,238 rows in 2.5s.
What sort of performance gains does a graph database give for this same query?
Or would you propose a different storage mechanism that keeps the property that reparenting entire subtrees is cheap?
(All this in your RDBMS of choice.)
WITH RECURSIVE r AS (SELECT parent, child FROM tree UNION SELECT r.parent, tree.child FROM tree, r WHERE tree.parent = r.child) SELECT parent, child FROM r;
transforms this: (A,B), (B,C), (B,D)
into this: (A,B), (A,C), (A,D), (B,C), (B,D)
Performance wise it's not magical; it quite literally repeatedly scans the database, broadening the search tree until there is no fringe. On a table I have with nearly exactly that structure of about 300 rows and limited nesting depth, the transitive closure takes 4 ms on a crappy VPS.
Of course I'm sure graph databases are somehow caching the results or performing some other optimization; you can achieve the same effect in RDBMSes with materialized views (which just cache the results of this query & keep it updated). Oracle natively supports materialized views; you can hack it easily in PostgreSQL and MySQL (though it should really be built in).
Further, insertion time becomes worst case O(N^2). Consider a graph with two connected components (each having O(N/2) nodes). Once you add an edge that connects them, updating the view will require O(N^2/4) operations - you need to connect every node in the first component to every node in the second.
A graph database is more specialized than SQL, so they can perform a very simple optimization. You don't scan the database, you just follow links. Links are typically stored with the node, so it's O(1) time.
[1] In a connected graph, there is a constant time algorithm - return True. But that won't work if you also want edge distance.
I don't know how your database is implementing things under the hood, so I don't know. I'll just list the major advantages of a graph database:
1) Edges stored with nodes. You can almost certainly read an edge and all it's neighbors with a single disk seek. This is unlikely to be the case for a SQL system.
2) Compute time is always O( portion of graph traversed), not O(graph size), and is always under the control of the user. In SQL, this will vary quite a bit with implementation.
3) You can specify various heuristics for the traversal - depth first, breadth first, etc. This is important if the specific nature of your data will affect the speed of the query.
1) I contest that this is "unlikely" for SQL -- MySQL (unless I'm misremembering) orders data on disk by its primary key, which would be an equivalent optimization. (PostgreSQL does not do this to my knowledge.)
2) In my two queries above, I demonstrate both O(graph) and O(traversed) (in that order). At least in PostgreSQL, this is very much under control of the user as well, since WITH queries are evaluated in a specific order.
3) There is no way to specify such things in SQL. However most SQLs are designed with the idea that the query planner is smarter than you (it almost always is, at least in PostgreSQL), so you shouldn't be trying to specify execution plans anyway.
Overall, I see no reason for graph databases & SQL to be separate. It seems that if graph DBs really only address performance issues as you claim, that one is throwing out the baby with the bathwater by creating a new kind of database rather than simply adding specialized optimizations to e.g. PostgreSQL.
In fact, looking up the edges is multiple seeks. For an edge (A, B), if the primary key is the row itself, edges are only stored in sequence if you are starting at node A.
Graph DB's usually cheat by store edges on both nodes, if possible within the same page as the node.
2) I'll take your word for it.
3) I generally agree that the query planner is smarter, but I'd be very surprised if that is true for complicated recursive queries. In particular, I really doubt that a query planner can exploit the structure of your data to improve things.
I.e., I really doubt the planner will know whether a DFS or BFS is better, or come up with heuristics to select the optimal path to go down (this is heavily dependent on data). Graph databases tend to provide such functionality.
I absolutely agree with you that better graph support for a SQL DB would be ideal.
SQL doesn't scan either -- tables are indexed via either a hash (O(1)) or a tree (O(log edges)). While I do realize the speed advantage of a specialized structure, I feel this could be better implemented as a specialized index/storage mechanism in SQL than as an entirely new database system.