Introduction to Datalog(blogit.michelin.io) |
Introduction to Datalog(blogit.michelin.io) |
I actually thought that Datalog was so cool that I went to learn Prolog and it completely changed the way I think about programming. Highly recommend trying out logic programming if you haven't before.
Indeed, I get the feeling that the primacy of Prolog as the logic programming language has waned. If you look at back issues of ICLP* proceedings you'll find plenty of work that is nothing to do with resolution- namely, Answer Set Programming, which is wildly popular.
I tend to think it's mainly the database community that kind of ignores the logic-programming nature of datalog.
Oh and btw, when I talk about datalog, I mean definite clauses without functions, not the aberrant syntax in the article above. I'll never understand why people do that.
________________
* ICLP is the International Conference on Logic Programming.
[1] https://stackoverflow.com/questions/29192927/a-graph-db-vs-a... (2015)
https://raw.githubusercontent.com/xtdb/xtdb/master/docs/conc...
Btw, is that 'RocksDB or ?' for the local store current or other storage engines can get plugged in?
p.s. this is datomic's architecture for comparison.
[1] https://thesearch.space/episodes/5-kevin-feeney-on-terminusd...
RDFS and SHACL (and OWL) are optional in a triple store, which expects the subject and predicate to be string URIs, and there is an object datatype and optional language:
(?s ?p ?o <datatype> [lang])
(?subject:URI, ?predicate:URI, ?object:datatype, object_datatype, [object_language])
RDFS introduces rdfs:domain and rdfs:range type restrictions for Properties, and rdfs:Class and rdfs:subClassOf.`a` means `rdf:type`; which does not require RDFS:
("#xyz", a, "https://schema.org/Thing")
("#xyz", rdf:type, "https://schema.org/Thing")
Quad stores have a graph_id string URI "?g" for Named Graphs: (?g ?s ?p ?o)
("https://example.org/ns/graphs/0", "#xyz", a, "https://schema.org/Thing")
("https://example.org/ns/graphs/1", "#xyz", a, "https://schema.org/ScholarlyArticle")
There's a W3C CG (Community Group) revising very many of the W3C Linked Data specs to support RDF-star: https://www.w3.org/groups/wg/rdf-starLooks like they ended up needing to update basically most of the current specs: https://www.w3.org/groups/wg/rdf-star/tools
"RDF-star and SPARQL-star" (Draft Community Group Report; 08 December 2022) https://w3c.github.io/rdf-star/cg-spec/editors_draft.html
GH topics: rdf-star, rdfstar: https://github.com/topics/rdf-star, https://github.com/topics/rdfstar
pyDatalog does datalog with SQLAlchemy and e.g. just the SQLite database: https://github.com/pcarbonn/pyDatalog ; and it is apparently superseded by IDP-Z3: https://gitlab.com/krr/IDP-Z3/
From https://twitter.com/westurner/status/1000516851984723968 :
> A feature comparison of SQL w/ EAV, SPARQL/SPARUL, [SPARQL12 SPARQL-star, [T-SPARQL, SPARQLMT,]], Cypher, Gremlin, GraphQL, and Datalog would be a useful resource for evaluating graph query languages.
> I'd probably use unstructured text search to identify the relevant resources first.
Treats graph edges as binary relations ("graph normal form"), has a Datalog-ish language. Built for managing large interconnected knowledge sets in a declarative way.
I recommend this talk: https://www.youtube.com/watch?v=WRHy7M30mM4
Last time I used `datalog` was years ago, I was developing an internal interactive tool that was used to compare different approaches to solving a certain problem at my employer. I used `datascript`[1] by way of clojurescript to store all experiment data and then interrogated the `datascript` DB via `datalog`. This is something I always remember fondly.
[0] https://www.learndatalogtoday.org/ [1] https://github.com/tonsky/datascript
https://docs.datomic.com/on-prem/query/query.html#why-datalo...
(Some ten years ago worked at a startup that used Datomic. It seemed to work great, although the only queries I ever needed to add to the system were simple copy-paste hacks of existing ones, so I never got to dive into Datalog.)
datahike would be the closest to datomic in terms of features/implementation (support for as-of, transactor etc).
Then in terms of maturity I think the choice is between xtdb and datascript, both are very solid/maintained but they are also vastly different.
HN Thread from 8 months ago: https://news.ycombinator.com/item?id=31448889
I guess the intention is to be better than SQL but then I was left with "under which circumstances?".
With that question in mind I didn't feel the article addressed the issue.
The author might do better to think in terms of "what burning problem are we trying to fix and how did we fix it".
That's the same thing that Prolog does, calculate the LHM of a logic program, but the difference is that datalog programs have finite LHMs, because they don't have functions that can be self-instantiated for ever ( f(x), f(f(x)), f(f(f(x))), ... ) and the bottom-up evaluation, that goes from the "body" to the "head" of a clause, avoids infinite left-recursions.
Prolog, evaluated top-down (clauses are "picked apart" head-first) can get stuck in infinite left-recursions, so datalog's finiteness, and its decidability under TP, is a big gain in efficiency, as anyone who has had to kill a Prolog console session because of an infinite loop will know.
Also, it is not widely recognised but I am the author of the dumbest and most inefficient TP Operator implementation in existence. Obviously I hang my head in shame and will not link to my code. I understand however that there are optimisations that one can perform that make bottom-up execution efficient, and even quite fast. Unfortunately, I don't know what they are :P
Note that Prolog can also be evaluated without fear of left-recursions, by SLG-Resolution (a.k.a. "tabling", a.k.a. memoization) but there is still the danger of infinite right recursions. Prolog is semi-decidable, because it is Turing-complete. Datalog is decidable, sacrificing completeness for, well, efficiency.
So, in short, it's not a problem if you consider the alternative, but of course there are trade-offs, always. It's like growing old, vs. dying young.
(I hope all this is not completely irrelevant to your question).
Actual datalog looks like prolog.
I am more interested in the general category of relational data model + logic programming than I am in any purity about Datalog in particular. In particular I'm very excited by "data/knowledge + behaviour sitting in a tree, k-i-s-s-i-n-g"
It was pretty cool; you could stream in new facts to it over time and it would incrementally and differentially update itself. The key idea was that I wanted the introduction of ground facts to be messages that the database reads (e.g. off of a message bus), and I wanted the database to be able to publish its conclusions onto the same or other message buses. I also wanted to be able to delete ground facts, which meant it could publish withdrawals of the prior-published conclusions. A lot of it was inspired by Frank McSherry's work, although I didn't use timely or differential dataflow. In retrospect I probably should have!
This particular system isn't used anymore because we made a classic monotonicity mistake by making it the brain of a distributed system, and then having it publish and receive messages with a bunch of microservices. The internal consistency model of the datalog engine didn't extend out to the microservices, and the possibility of feedback loops in the system meant that the whole thing could lie to itself and diverge uncontrollably! Despite this particular application of the engine being a failure, the engine itself worked quite well and I hope to one day return to datalog.
I think what a lot of people miss with datalog, and what becomes apparent as you use it more, is just how unpredictable many engines can be with the execution behavior of rules. This is the same problem that you have with a database, where the query planner makes a bad choice or where you lack an index, and so performance is bad. But with datalog, the composition of rules that comes so naturally also tends to compound this issue, resulting in time spent trying to chase down weird performance things and doing spooky re-ordering of your clause bodies to try to appease whatever choices the engine makes.
I personally don't find the Sexpr-based syntax of Datomic's variant of Datalog all that useful here, and yeah, maybe someone working in SQL for a living would struggle at first with that syntax. But it's not intrinsic to the model itself. Have a waltz through my employer's documentation and see what you think: https://docs.relational.ai/rel/primer/overview
I think it's quite understandable (if a bit terse) and many people doing SQL for a living would appreciate the ability to better compose and structure things in this way, not to mention the ability to handle transitive / recursive relationships in a less awkward way.
Everything is about Rich Hickey. Apache 1.0.
Now that Nubank basically owns it and there's very little progress or activity as of late, I don't see why one would chose to use Clojure, Datalog etc.
Also, a lot of functional programming concepts has been since added to big programming languages like Javascript and hell, even Java has lambdas now.
I'm guessing that also hardcore FP people have moved on to Haskell. The ones that like LISP to Racket... and only people tied to the JVM in legacy projects are with Clojure.
Excellent question.
Two of the most common use cases for databases are "transactional processing" (manipulating small numbers of rows in real time) and "analytical processing" (querying enormous numbers of rows, typically in a read-only fashion).
SQL is generally fine for transactional workloads.
But analytical queries sometimes involve multi-page queries, with lots of JOINs and CTEs. And these queries are often automatically generated.
And once you start writing actual multi-page "programs" in SQL, you may decide that it's a fairly clunky and miserable programming language. What Datalog typically buys you is a way to cleanly decompose large queries into "subroutines." And it offers a simpler syntax for many kinds of complex JOINs.
Unfortunately, there isn't really a standard dialect of Datalog, or even a particular dialect with mainstream traction. So choosing Datalog is a bit of a tradeoff: does it buy you enough, for your use case, that it's worth being a bit outside the mainstream? Maybe! But I'd love to see something like Logica gain more traction: https://logica.dev/
One thing that's really nice about Datalog is that it puts the focus on modelling relations between data instead of "tables" which often end up with a jumble of data and then need to be normalized to work properly. It pushes you into good structures by default, which is a really good property to have. It's also much much simpler than SQL, no need to think in terms of inner joins and outer joins and whatever else, it is all relations.
It also easily produces derived data from existing data, without needing any kind of procedural process, and is highly composable.
It's not so much that it's solving a burning problem than SQL, it's just better than SQL...
(note: the clojure syntax used in that page is much more confusing than the native Datalog syntax IMO)
1. Modularity. In Datalog it's very easy to name a common (sub-)query and reuse it in multiple queries. You can have stored procedures / functions in SQL, but the syntax is very different implementation to implementation and users are typically afraid of using the feature (the SQL/PSM does not go far enough to define what users would normally want). Also, SQL/PSM is an imperative extension of otherwise declarative language. The two just don't work well together.
2. Structures other than primitives and tables (you get lists, vectors, hash-maps and sets, but you can also make your own). A lot of practical SQL extensions also offer some of these structures, but they aren't in the standard.
3. Operations on primitive types fall into the same category / work in the same way as operations on complex types. I.e. select / update / insert / delete in SQL only apply to tables, but strings or numbers don't work in the same way. It's more uniform in Datalog (not 100%, but still better).
CTEs are a great and most welcome addition to SQL but they are a bolt-on patch as compared with Datalog where it is a core feature.
I'm not sure that is exactly right. Do you have a reference? (Not trying to put you on the spot, I'm just curious to learn the history!)
The way I see it, SQL's great strength as a query language is selection and projection whereas Datalog's great strength is inference.
It takes a shift in mindset to take advantage of inference. I've started using Datalog when analyzing unfamiliar codebases. So I might set up something like:
writes_to(microservice1, db1).
writes_to(microservice2, microservice1).
writes_to(microservice3, microservice2).
depends_on(X, Y) :- writes_to(X, Z), depends_on(Z, Y).
depends_on(X, Y) :- writes_to(X, Y).
Allowing you to query something like: depends_on(microservice3, X)
and get back db1, microservice1 and microservice2. Of course, you can ultimately do this in SQL as well, but it's far more compact in Datalog.Which brings me to the second half of the motivation - the principal advantage to the logical database model is the open world assumption. The relational model in practice tends to be fairly buttoned down - data is expected to suit a particular schema (which has its own advantages as a transactional system). This makes it easy to extend a logical database and ask more questions without changing the fact formats already specified.
Of course, I can ultimately do all the things that feel natural in Datalog in SQL. I can work through queries like the one above by building out the data model and writing recursive CTEs. It's about strengths, not possibilities which I suppose brings me back to why SQL 'won'. A disproportionate amount of code is written for transactional line-of-business systems. It's really not hard to see why the validation layer and transactional focus would win in those areas.
https://opensource.googleblog.com/2021/04/logica-organizing-...
In short, it's easier to compose (and decompose), which helps with complex queries that can be assembled from simpler, independently tested parts.
No personal experience in this, I just remembered that blog contrasting Datalog and SQL.
Aggregates are implicit groupings, not explicit slots. This wards you from risk of change and speeds up your feedback loop of making changes and experimenting with ideas. Plus you can compose new aggregate variants out of the same primitives (attributes) which is a fairly typical use-case.
Another is query composition. I'm more familiar with SQL and I'm somewhat comfortable with writing nested queries. But we all know how both query builders and similar are often hard to re-use and compose. With these datalog dialects, composition, refactoring and extraction of logical units is much more straight forward, even as a beginner. Queries are "flat", and consist clauses that can be moved around and composed. Think of having more associative and commutative operations.
On top of these, the mentioned technologies (in the article) are temporal and support time travel out of the box. That's something you can do on top of SQL of course but it's quite powerful to have it in-built as a primary feature. Whenever you have an application that does things like (non-exclusive), reporting, undo, revisions etc. You might want that.
In datalog you can. Either by using the comma or semicolon operator.
Foremost, Datalog has great compositionality, which is one of SQL:s big weaknesses (it was designed with completely different goals). Also, things like recursive queries are completely natural in Datalog, whereas in SQL they are bolted on with recursive common table expressions.
There's a good write-up on this aspect (amongst others) here: https://www.scattered-thoughts.net/writing/against-sql/
> recursive queries are completely natural in Datalog, whereas in SQL they are bolted on with recursive common table expressions
And similarly: https://github.com/frankmcsherry/blog/blob/master/posts/2022...
I have no idea who first came up with the name "datalog", and what exactly was their motivation (I've looked, but couldn't find the original reference to datalog), but the burning problem that datalog does fix is Prolog's semi-decidability, or in other words, its tendency to enter infinite recursions.
One half of the reason for that is Prolog's "compound terms" (known as functions, in First-Order Logic terminology) and the other half is Prolog's use of depth-first search for evaluation. Functions, along with logic variables, mean that a predicate's Herbrand base (its set of logical atoms) can be expanded forever: if f(x) is a term, f(f(x)) is a term, f(f(f(x))) is a term... and so on, ad infinitum. The depth-first search evaluation just gets stuck in left-recursions when the first body literal of a clause is a recursive literal: p(x):- p(x), q(x) loops forever, in Prolog with depth-first search.
Datalog is Prolog without functions other than constants, and it can be evaluated "bottom up", both of which overcome Prolog's semi-decidability (but eliminate its completeness).
But of course this is not a very clear motivation for its use as a database language, only its use as an alternative to Prolog.
More to the point, there are many datalog variants. Some that accept negation, some that do not, some that allow recursion, some that do not, and so on. There's a lot of information on different datalogs, seen from the point of view of database research in this free ebook on databases:
See chapters 12 - 15.
And see these lecture notes for a more logic-programm-y discussion of fixpoint semantics of definite logic programs:
https://www.doc.ic.ac.uk/~mjs/teaching/KnowledgeRep491/Fixpo...
Prolog's built-in search is not semidecidable. https://en.wikipedia.org/wiki/Decidability_(logic)#Semidecid...
As you observe, Prolog gets stuck in left-recursions. But iterative deepening for Prolog is semidecidable (among other fair methods). This indeed is restricted to the pure, monotonic subset of (full) Prolog together with all pure, monotonic extensions.
I did a bit of reading about Datalog and I get that itβs useful in things like static analysis where you get the fixed point analysis for free, but apart from that I am not really sure why Iβd use it.
If you work with VIEWs or CTEs a lot, then Datalog is your friend, as every derived relationship is just a VIEW. If you like to encapsulate and reuse queries, then Datalog is ideal. We had plenty discussions on HN about SQL code reuse. Not an issue in Datalog.
And lastly, there are many ways to compile Datalog programs to "executables" that take maybe a few CSV-files as input and give you the results. I'm not saying that this would be always faster than loading the CSVs into SQLite and running SQL against the data, but a lot of work has been done on Datalog query optimization and compilers can emit very efficient code.
There's been equally frequent releases and updates, I'm not sure how you came to this conclusion?
The best I can think of is that you're confusing core language updates with other tech. Clojure is structured differently than other languages, and core language/library updates are (and always have been) relatively rare while the surrounding ecosystem/tooling provided by the Clojure team is active.
Datalog is not a licensed product or software, the same way Answer Set Programming isn't.
JavaScript was "functional" since the beginning. It always had statically scoped first class closures.
But as you say, that's not very special anymore...
If you want to do primarily FP with a data oriented touch you want stuff like:
- immutable, persistent data structures
- a comprehensive library around transforming these
- expression based syntax
- idiomatic (functional style code) performance is good
- a comprehensive library around manipulating, composing and transforming functions
- more stuff that I'm forgetting right now
Clojure gives you all of that and _more_, such as a powerful REPL, macros, multiple dispatch, spec instrumentation, proper namespaces, ...
Personally I write JS in a very straight forward, functional style. But the differences is still night and day in many dimensions. There are fundamental issues with the language that are unfortunately unfixable. Just having first class function objects isn't cutting it.
You say this as if it's not an absolutely massive body of programmers. A lot of large companies are institutionally committed to JVM, and clojure has settled into a niche as the language teams within them use when they don't want to use java.
I expect there are more people working on projects that fit that description than there are ones working in racket and haskell combined.
it's like confessing one doesn't know what a Lisp lambda implies, in particular creation of closures to capture free variables, since Java creates a copy of free variable values, rather than inform the code generator as to allocation of said free variables, in order to honor the closure-extended lifetimes from stack allocation to heap allocation (for example).
Datomic β Datalog
I'm talking about this:
Query 1: SELECT a, b, c FROM table1 JOIN table2 ... Query 2: SELECT d, e, f FROM (SUBSELECT ... ) AS table3 JOIN table4 ...
Now, join those queries together. How? Well you have to analyze the join clauses in from and then combine those together, then rewrite the SELECTION. Okay, great... not so hard. Now use Query 2 in a recursive CTE for Query 1. Oh goodness... much harder. Or what about:
Query 1: SELECT a, b, c FROM table1 JOIN table2 ... Query 2: SELECT d, e, f FROM (SUBSELECT ... ) AS table3 JOIN table4 ... ORDER BY table4.column LIMIT 10
Now join 1 and 2 together... the syntax is much much much different.
Some SQL libraries in Haskell, like Beam, Opaleye, Squeal, etc, demonstrate the kind of composability I mean. In particular, all three of these offer a monadic interface where arbitrary queries can be joined using well-known relational operators (relational cross joins form a monad). I'm not sure of Opaleye and Squeal, but Beam does it's best to produce a human-comprehensible query.
Moreover, the first queries given above are 'easy' in the sense that the portion after the `FROM` in a regular SQL SELECT does actually compose well, since the JOIN operators are basically a direct translation of the underlying relational algebra. However, they start to fail when you want to join up aggregations and the like. While true that aggregations and ordering are not strictly relational, it is of great use to be able to compute with such things.
Again, the libraries above demonstrate the composability. For example, the beam library offers this: https://haskell-beam.github.io/beam/user-guide/queries/aggre...
Notice how we are able (in the second to last query), simply combine the query `all_ (track chinookDb)` with the complex aggregate expression. In this DSL, the entire aggregate could be floated out. In particular, notice how the queries for either the first `all_` or the second complex aggregate would look much different on their own. That's what I mean by SQL doesn't compose.
Of course all these DSLs have the problem that they're trying to work with an existing language SQL that is not nice and orthogonal, and so they all have their varying deficiencies.
Datalog mostly fixes these issues and provides a nice model.
Thanks, yes, I might be fudging terminology a bit.
Generally, when I talk about Prolog I mean definite programs (sets of definite clauses and Horn goals, the latter usually called "definite goals") under SLD-resolution. That's more or less what is usually meant by "pure" Prolog: definite programs, without not/1 implementing Negation-As-Failure, without the cut, and without side effects; and executed by giving an initial Horn goal as a "query". Entailment between definite clauses and satisfiability of definite programs is undecidable.
By "semi-decidable" I mean that an SLD-refutation proof will terminate if there is a branch of the proof that ends in β‘ and is of finite length. If such a branch does not exist, a proof will "loop" forever. That's regardless of whether the branch succeeds or fails, which is a bit of a fudge, but, in practice, there is no other way to decide the success or failure of a proof than to search all its branches.
Left-recursion is one way in which Prolog, with depth-first search, generates branches of infinite length, but you can get those with right-recursion also. There are restrictions of Prolog, like SLG-resolution (similar to DFS with iterative deepening) that don't loop on left-recursions but the general case remains undecidable.
Fortunately, there seem to be at least as many finite proofs as there are infinite ones, or in any case I have never encountered a Prolog program that looped inintially and that couldn't be rewritten to terminate, at least when called in the way it was meant to be called. And that's also a bit of a fudge.
This is what I always thought was the main benefit.
You are able to much easier raise your level of abstraction, by building up a vocabulary of definitions, and model your further queries (or definitions) using them.
In the case of RelationalAI, as far as I understand there is no way to even try it out without becoming a customer? I just wish there were, because I really do like the approach!
I have this fear because there's a history of that with novel query languages and DB platforms tossing in network/hierarchical/"document"/object-oriented features, and creating a dog's breakfast which loses the compositional/expressive power of the relational algebra. Think MongoDB or Redis. Conceptually a big mess.
RDF itself has a history of this as well. Appeals to novelty.
Or even Google's F1, which smashes hierarchical tree-structured protobufs into a SQL DB, and so has really weird behaviour on joins and projections.
Well, whatever, you know my opinions on this stuff, I think :-)
At this point I'd settle (or ask for) for a network available tuplestore which just receives relational-algebraic operators from a client, and optimizes/executes, and returns pure tuples, and the client-side could formulate whatever query language (or API) it wanted on top of that. I started playing with building something like that between the two jobs, but never got far.
Syntax also clearly is not the Prolog-ish syntax of Datalog "proper."
But my point above is, Datomic is but one product and implementation of these concepts. It's a wrong take to make a statement like "The problem with Datalog, and Clojure in general are the licenses"; there is a field of many products, all with multiple licenses, and also weird to make this "Clojure" swipe, when there's only tangential connection between "Datalog" and "Clojure" based on one product (and its offshoots).
I mainly responded because "Datomic β Datalog" is worthy of discussion in itself ...but I now realise this is probably the wrong subthread for that.
Edit: this presentation describes things differently but it doesn't sound quite right to me "Chandra and Harel - 1982 Studied the expressive power of logic programs without function symbols on relational databases" https://www.dbai.tuwien.ac.at/datalog2.0/slides/Kolaitis.pdf
I have a sneaking suspicion that "function-free Prolog" is as old as ordinary Prolog, and "datalog", as an idea separate to Prolog and used as a database language, was born in the database community, but like the OP I have no reference to this.
Basically, Prolog can do more than Datalog. Datalog is like, some subset of concepts of Prolog, but specialized for relational data queries, so it can achieve some optimizations and focus on data retrieval only.
Is that confusing enough?
Datalog originated specifically as a restricted subset of Prolog, which is why the βclassicβ syntax looks prolog-ish.
> Basically, Prolog can do more than Datalog. Datalog is like, some subset of concepts of Prolog, but specialized for relational data queries, so it can achieve some optimizations and focus on data retrieval only.
Datalog is a purely declarative, non-Turing complete subset of Prolog, removing the imperative features and assuring termination. This gives up lots of power, obviously, but it also, well, provides a termination guarantee. Removing imperative features (cut) from Prolog also means that Datalog implementations can use different (or multiple, switchable) strategies, while differing in what kinds of data sets and queries they perform well on, not correctness.
Including distributed strategies, as seen in Bloom http://boom.cs.berkeley.edu/