I did something similar recently, a block store for a rust implementation of ipfs, which models a directed acyclic graph of content-addressed nodes.
https://github.com/actyx/ipfs-sqlite-block-store
I found that performance is pretty decent if you do almost everything inside SQLite using WITH RECURSIVE.
The documentation has some really great examples for WITH RECURSIVE. https://sqlite.org/lang_with.html
At a moderate overhead you could also definitely return all seen nodes and a flag to identify them as such as part of your intermediate data at each recursive step.
The postgres query optimizer struggles with recursive queries even when well suited to the problem though. Are they actually efficient in sqlite even for trees?
[0] - https://www.sqlite.org/cgi/src/artifact/636024302cde41b2bf0c...
[1] - https://charlesleifer.com/blog/querying-tree-structures-in-s...
https://www.amazon.com/Hierarchies-Smarties-Kaufmann-Managem...
I'm considering doing a js template string implementation for node.. cql`...` type thing with an internal compilation cache.
However, what's lacking from something like this is a detailed bill of the cost. I'd love to see some, any benchmark on a DB with > 10^6 edges to see how it goes. That's the other hand of the equation "just use sqlite and be happy" -- the expectation that performance will actually be reasonable.
It also persists namespace mappings so that e.g. schema:Thing expands to http://schema.org/Thing
The table schema and indices are defined in rdflib_sqlalchemy/tables.py: https://github.com/RDFLib/rdflib-sqlalchemy/blob/develop/rdf...
You can execute SPARQL queries against SQL, but most native triplestores will have a better query plan and/or better performance.
Apache Rya, for example:
> indexes SPO, POS, and OSP.
This is a silly pedantic point to make, but it is not necessarily trivial. E.g. it may be the case that a particular use-case scneario does not require massive efficiency, and has a lot to gain from the simplicity of sqlite. In which case this kind of project is an amazing thing to exist.
And if there is a way to get a valid benchmark comparison against a more traditional "efficient" graph database, then informed decisions can be made.
As a personal anecdote, a friend and I based a graph-based project on neo4j and were very happy ... until it was time to deploy. We then realised the installations involved were highly complex, rarely supported on traditional webhosts, and costs involved for adopting 'formal' commercial solutions were highly prohibitive. Had we known about this project at the time we would have definitely used it instead (at least as a proof of concept; you can always switch to a more efficient database later if you really have to)
My latest API+multiple frontends application uses Neo4j as the only database and we deployed with Docker (compose) with great success. With the config in git we were able to do the traditional test-new-versions-on-a-branch-before-deploy and everything is solid.
I would benchmark the tasks "traversal", "aggregation" and "shortest past" for a 10k to 10M node graph. Anything under 10k would be good enough with most techs and over 10M need to consider more tasks (writes, backup, the precise fields queried can become their particular problems at larger scale).
The Github link implements "traversal "in Python instead of pure SQLite. I suspect it will be around x10 slower than it could be with the same tech stack, because it queries once per node from Python to SQLite. Shortest path is not implemented and would be too slow to be useful in an interactive environment. "Aggregation" is also not implemented, but it would perform admirably, because SQL is good at that.
Traditional relational OLTP databases such as Postgres are already faster than dedicated graph databases for certain graph related tasks, according to this benchmark: https://www.arangodb.com/2018/02/nosql-performance-benchmark...
It is indeed quite common that relational databases outperform graph databases on certain graph processing problems such as subgraph queries (a.k.a. graph pattern matching). There are two key reasons for this: (1) most graph pattern matching operations can be formulated using relational operations such as natural joins, antijoins, and outer joins; and (2) relational databases have been around longer and have well-optimized operators.
A lot of the value that graph databases provide lies in their query languages which (for most systems) allow formulating path queries using a nice syntax (unlike SQL's WITH RECURSIVE which many people find difficult to read and write). Their property graph data model supports a schema-optional approach, which makes them better suited for storing semi-structured data. They also "provide efficient programmatic access to the graph, allowing one to write arbitrary algorithms against them if needed" [1].
With all these said, graph databases could be much faster on subgraph queries than relational databases and there are recent research results on the topic (worst-case optimal joins, A+ indexes, etc.). But these are not available in any production system yet.
shortest path typo, right?
SQLite is used a lot on edge (mobile apps, ...), sounds like this project provide a graph database for the very same use case (I probably won't run Neo4J on mobile).
It’s a bad analogy, but SQLite to Postgres is like AMD vs Intel x86 CPUs, whereas a graph database is ARM. Can it be emulated? Yes. Is there a far greater potential for slowdown? Yes.
In the graph space you have Gremlin, Cryper, GQL and many other proprietary query engines (which also looks to be the the case here).
Without that accessibility this feels a bit like pickling a NetworkX object.
https://github.com/schinckel/ulid-postgres/blob/master/ulid....
Of course many orders of magnitude slower than keeping it all in in memory maps and doing the traversal there, but fast enough to not be a limiting factor.
Traversing a medium depth DAG with a million nodes to find orphaned nodes takes less than a second on average hardware.
One thing to be aware of is that SQLite has lots of tuning options, and they are all set to very conservative values by default.
E.g. the default journal mode is FULL, which means that it will flush all the way to disk after each write. The default cache size is tiny.
With a bit of tuning you can get quite decent performance out of SQLite while still having full ACID guarantees, or very good performance for cases where you can compromise on the ACID stuff.
I have not yet found a situation where nosql databases like leveldb offer an orders of magnitude advantage over SQLite, and SQLite is so much more powerful and robust...
Unless you have an abnormally high edge count that sounds super slow to me. Even accounting for metadata overhead and disk page slop you're only reading and processing tens of megabytes, and every algorithm in sight is linear. I'd be surprised if you couldn't get a 2-5x speedup by reading the whole table to RAM in your favorite compiled/jitted language and just traversing it there.
> I have not yet found a situation where nosql databases like leveldb offer an orders of magnitude advantage over SQLite, and SQLite is so much more powerful and robust...
I have no skin in that game, but would some of the nosql solutions not perform significantly better under heavily concurrent insertions and the other workloads they were designed for?
I'm not sure if this is possible in SQLite, as far as I know the WITH clause is limited to SELECT statements.
> Are they actually efficient in sqlite even for trees?
Recursive common table expressions work by adding returned rows to a queue and then performing the recursive select statement independently on each row in the queue until it's empty.
You can use WITH RECURSIVE to traverse a tree by adding the root node to the queue and recursively visiting adjacent rows until the queue is empty. This works correctly and quickly because trees have only a single path between nodes. If you try the same query on a DAG though it will return every path to a given node, you then have to perform a GROUP BY to find the shortest path outside of the recursive query. In the worst case, if you have a graph with many paths between nodes, this method is exponentially slower than a standard BFS.
I haven't used docker much, but I don't know how it could help here (unless you misunderstood and were referring to locally installable software).
We just ran the backend and web frontend on a single Digital Ocean droplet, scaling as needed (we started with the $5/mo and it was fine)
- SQL/PGQ, a property graph query extension to SQL is planned to be released next year as part of SQL:2021.
- GQL, a standalone graph query language will follow later.
While it is a lot of work to design these languages, both graph database vendors (e.g. Neo4j, TigerGraph) and traditional RDBMS companies (e.g. Oracle [2], PostgreSQL/2ndQuadrant [3]) seem serious about them. And with a well-defined query language, it should be possible to build a SQL/PGQ engine in (or on top of) SQLite as well.
[1] https://www.linkedin.com/pulse/sql-now-gql-alastair-green/
[2] http://wiki.ldbcouncil.org/pages/viewpage.action?pageId=1062...
[3] https://www.linkedin.com/pulse/postgresql-oracle-graph-query...
Gremlin's main focus is defining traversal operations on property graphs. While it supports pattern matching [1], IMHO its syntax is not as clean as Cypher's. Gremlin queries are also difficult to optimize: while it is possible to define traversal rewrite rules, they are more involved than relational optimization rules. The fact that most open-source Gremlin implementations are focusing on distributed setups (e.g. a typical deployment of Titan/JanusGraph runs on top of Cassandra) has also implications on single-machine performance, which certainly did not help the adoption of Gremlin -- but this is not necessarily the problem of the query language. Overall, Gremlin is great for workloads where complex single-source traversal operations do the bulk of the work but it's less well-suited to global pattern matching queries such as the ones in the LDBC Social Network Benchmark's BI workload [2].
SPARQL focuses on the graph problems of the "semantic web" domain, which include not only pattern matching but semantic reasoning/inferencing. One can use it for pattern matching queries but with the following caveats:
- Its data model is based on triples so if one wants to return a node and its attributes (properties), one has to specify each of these attributes explicitly.
- On the execution side, returning these attributes might necessitate executing a number of self-join operations.
- Many SPARQL implementations also have performance limitations due to the extra complexity introduced by self-joins, lack of intra-query parallelism, etc.
The "RDF* and SRARQL* approach" is an initiative to amend the self-join problem by introducing nested triples in the data model. It's currently being worked on by a W3C community group [3]. SPARQL also has "property paths", which allows regular path queries, i.e. traversals where the node/edge labels confirm some regular expression (the "property" in "property paths" has nothing to do with "property graphs").
SQL/PGQ and GQL target the property graph data model and support an ASCII-art like syntax for pattern matching queries (inspired by Cypher). They also offer some graph traversal/shortest path operations (including shortest path and regular path queries). Additionally, GQL supports returning graphs so it's queries can be composed.
[1] https://en.wikipedia.org/wiki/Gremlin_(query_language)#Decla...
[2] https://ldbc.github.io/ldbc_snb_docs/workload-bi-reads.pdf
[3] https://blog.liu.se/olafhartig/2019/01/10/position-statement...