Introduction to Datalog

Introduction to Datalog(blogit.michelin.io)

362 points by jgrodziski 3 years ago | 89 comments

grose 3 years ago |

Datalog is great for representing authorization rules. Check out Biscuits, which are auth tokens with Datalog embedded in them. This article is what made it 'click' for me: https://www.clever-cloud.com/blog/engineering/2021/04/15/bis...

I actually thought that Datalog was so cool that I went to learn Prolog and it completely changed the way I think about programming. Highly recommend trying out logic programming if you haven't before.

burakemir 3 years ago | |

Agree but also want to point out that people usually have a narrow view on "logic programming". Datalog can also be understood with out the top-down evaluation / resolution that is typically associated with prolog, which is why it is known to database researchers and in finite model theory. Prolog is great, but bottom up techniques to evaluate datalog are awesome, too, and would arguably also qualify as logic programming. It is rare to see this acknowledged.

YeGoblynQueenne 3 years ago | | |

Acknowledged, by whom? As far as I know people in the logic programming community have no trouble recognising datalog as a logic programming language, be it evaluated bottom-up or not (you can still evaluate a datalog program top-down, by resolution, as if it were a Prolog program without "compound" terms) (a.k.a. functions).

Indeed, I get the feeling that the primacy of Prolog as the logic programming language has waned. If you look at back issues of ICLP* proceedings you'll find plenty of work that is nothing to do with resolution- namely, Answer Set Programming, which is wildly popular.

I tend to think it's mainly the database community that kind of ignores the logic-programming nature of datalog.

Oh and btw, when I talk about datalog, I mean definite clauses without functions, not the aberrant syntax in the article above. I'll never understand why people do that.

________________

* ICLP is the International Conference on Logic Programming.

wslh 3 years ago |

In the last few months the mention of Datalog has increased, I wondered how it differed from graph databases and found a clear answer in SO [1]. I am not an incumbent but found graph databases and clause approaches interesting.

[1] https://stackoverflow.com/questions/29192927/a-graph-db-vs-a... (2015)

refset 3 years ago | |

XTDB, which is mentioned in the post, is subtly different from the other Clojure-based Datalog systems in this respect, because its Datalog engine executes in terms of multi-way joins using a "Worst-Case Optimal Join" implementation that is ideal for graph processing (vs. a tree of binary hash joins). Therefore, based on statistics and query planning heuristics, it will often perform graph pattern matching before resolving the logic/horn clauses. (source: I work on the XTDB team)

eternalban 3 years ago | | |

Interesting architecture:

https://raw.githubusercontent.com/xtdb/xtdb/master/docs/conc...

Btw, is that 'RocksDB or ?' for the local store current or other storage engines can get plugged in?

p.s. this is datomic's architecture for comparison.

https://docs.datomic.com/on-prem/images/clientarch_orig.svg

felixyz 3 years ago | |

I did an interview [1] with Kevin Feeney, one of the founders (no longer active) of TerminusDb, which goes into some depth about the difference between RDF stores and (property) graph databases, where the former is more closely aligned with datalog and logic programming. There are links to a really excellent series of blog posts by Kevin on this topic in the show notes.

[1] https://thesearch.space/episodes/5-kevin-feeney-on-terminusd...

forks 3 years ago | | |

I love The Search Space. Waiting patiently for new episodes!

westurner 3 years ago | | |

With RDF* and SPARQL* ("RDF-star" and "SPARQL-star") how are triple (or quad) stores still distinct from property graphs?

RDFS and SHACL (and OWL) are optional in a triple store, which expects the subject and predicate to be string URIs, and there is an object datatype and optional language:

  (?s ?p ?o <datatype> [lang])

  (?subject:URI, ?predicate:URI, ?object:datatype, object_datatype, [object_language])

RDFS introduces rdfs:domain and rdfs:range type restrictions for Properties, and rdfs:Class and rdfs:subClassOf.

`a` means `rdf:type`; which does not require RDFS:

  ("#xyz", a,        "https://schema.org/Thing")
  ("#xyz", rdf:type, "https://schema.org/Thing")

Quad stores have a graph_id string URI "?g" for Named Graphs:

  (?g ?s ?p ?o)

  ("https://example.org/ns/graphs/0", "#xyz", a, "https://schema.org/Thing")

  ("https://example.org/ns/graphs/1", "#xyz", a, "https://schema.org/ScholarlyArticle")

There's a W3C CG (Community Group) revising very many of the W3C Linked Data specs to support RDF-star: https://www.w3.org/groups/wg/rdf-star

Looks like they ended up needing to update basically most of the current specs: https://www.w3.org/groups/wg/rdf-star/tools

"RDF-star and SPARQL-star" (Draft Community Group Report; 08 December 2022) https://w3c.github.io/rdf-star/cg-spec/editors_draft.html

GH topics: rdf-star, rdfstar: https://github.com/topics/rdf-star, https://github.com/topics/rdfstar

pyDatalog does datalog with SQLAlchemy and e.g. just the SQLite database: https://github.com/pcarbonn/pyDatalog ; and it is apparently superseded by IDP-Z3: https://gitlab.com/krr/IDP-Z3/

From https://twitter.com/westurner/status/1000516851984723968 :

> A feature comparison of SQL w/ EAV, SPARQL/SPARUL, [SPARQL12 SPARQL-star, [T-SPARQL, SPARQLMT,]], Cypher, Gremlin, GraphQL, and Datalog would be a useful resource for evaluating graph query languages.

> I'd probably use unstructured text search to identify the relevant resources first.

noduerme 3 years ago | |

That's a really neat example of something I'm not familiar with. Going up a tree from child to parent is often the heaviest part of dealing with regular datasets, and usually requires a mix of queries and application logic. The idea of flattening the data along some pattern like that is of course always possible in a relational db, but it's not usually efficient, especially not for heavy writing. Lateral joins and window partitions can help. But this seems like an interesting approach to removing the app code completely.

flyingsilverfin 3 years ago | |

I work on TypeDB (https://vaticle.com/typedb), and it sits somewhere at this intersection. The exposed query language has elements of both logic programming constructs and graph-like structures. Both amount to a kind of "constraint" programming.

rapnie 3 years ago | | |

A quick peek shows it seems along similar lines as TerminusDB sorta kinda, but they have WOQL [0]. At this time I start to worry again about all the different kinds and flavours of query languages that are emerging.

[0] https://en.wikipedia.org/wiki/TerminusDB#Query_language

felixyz 3 years ago | | |

I really like TypeDB! Haven't been able to use it for anything serious yet, but have a couple of project brewing where it might fit :)

cmrdporcupine 3 years ago | |

You might be interested in https://relational.ai/

Treats graph edges as binary relations ("graph normal form"), has a Datalog-ish language. Built for managing large interconnected knowledge sets in a declarative way.

I recommend this talk: https://www.youtube.com/watch?v=WRHy7M30mM4

felixyz 3 years ago | | |

Great project, not open source alas. This is another great talk about RelationalAI (and its precursor), highlighting how using powerful databases can simplify complex applications: https://www.hytradboi.com/2022/experience-report-building-en...

muattiyah 3 years ago |

ICYMI, there's an excellent interactive introduction to `datalog` that's referenced in the article's references.[0]

Last time I used `datalog` was years ago, I was developing an internal interactive tool that was used to compare different approaches to solving a certain problem at my employer. I used `datascript`[1] by way of clojurescript to store all experiment data and then interrogated the `datascript` DB via `datalog`. This is something I always remember fondly.

[0] https://www.learndatalogtoday.org/ [1] https://github.com/tonsky/datascript

pavlov 3 years ago |

As mentioned in the article, Datomic is a database that uses Datalog as its query language:

https://docs.datomic.com/on-prem/query/query.html#why-datalo...

(Some ten years ago worked at a startup that used Datomic. It seemed to work great, although the only queries I ever needed to add to the system were simple copy-paste hacks of existing ones, so I never got to dive into Datalog.)

dmitriid 3 years ago | |

Datascript is the open source analog for Clojure, ClojureScript and JS: https://github.com/tonsky/datascript

simongray 3 years ago | | |

There are many open source alternatives in Clojure using this query language: https://github.com/simongray/clojure-graph-resources#datalog

mpenet 3 years ago | | |

'ish.

datahike would be the closest to datomic in terms of features/implementation (support for as-of, transactor etc).

Then in terms of maturity I think the choice is between xtdb and datascript, both are very solid/maintained but they are also vastly different.

samuell 3 years ago |

A bit related, just stumbled upon Flix, a functional JVM language with Datalog contraints and (somewhat?) Go-like concurrency:

https://flix.dev

HN Thread from 8 months ago: https://news.ycombinator.com/item?id=31448889

refset 3 years ago | |

Flix definitely looks interesting! For comparison, I ported the "Datalog Enriched with Lattice Semantics" example from that homepage to XTDB's (Clojure) Datalog after I saw it posted on HN originally: https://gist.github.com/refset/21b3fc1dec9a6928943073809e133...

ianpurton 3 years ago |

So I struggled with this.

I guess the intention is to be better than SQL but then I was left with "under which circumstances?".

With that question in mind I didn't feel the article addressed the issue.

The author might do better to think in terms of "what burning problem are we trying to fix and how did we fix it".

z5h 3 years ago |

I’ve been using Prolog a bunch recently, and also embedded and extended MicroKanren in a project. Something I came to appreciate was that Prolog’s depth-first search, and Kanren’s lazy stream approach are good with memory even when generating/searching through infinite solutions. It is my understanding that Datalog, on the other hand, will iteratively expand a set of data. Isn’t this a problem?

YeGoblynQueenne 3 years ago | |

"Iteratively expand a set of data"? I'm not sure what you mean here. I think you are probably talking about the "bottom up" evaluation strategy of datalog, right? That's where datalog is evaluated by a so-called TP-operator, which derives the set of all logical consequences a datalog program by calculating its least fixed point. That's the same as the Least Herbrand Model (LHM) of the program, or, in other words, the set of atoms entailed by the program (atoms in the logical sense, of atomic formulae, not in the Prolog sense of constants).

That's the same thing that Prolog does, calculate the LHM of a logic program, but the difference is that datalog programs have finite LHMs, because they don't have functions that can be self-instantiated for ever ( f(x), f(f(x)), f(f(f(x))), ... ) and the bottom-up evaluation, that goes from the "body" to the "head" of a clause, avoids infinite left-recursions.

Prolog, evaluated top-down (clauses are "picked apart" head-first) can get stuck in infinite left-recursions, so datalog's finiteness, and its decidability under TP, is a big gain in efficiency, as anyone who has had to kill a Prolog console session because of an infinite loop will know.

Also, it is not widely recognised but I am the author of the dumbest and most inefficient TP Operator implementation in existence. Obviously I hang my head in shame and will not link to my code. I understand however that there are optimisations that one can perform that make bottom-up execution efficient, and even quite fast. Unfortunately, I don't know what they are :P

Note that Prolog can also be evaluated without fear of left-recursions, by SLG-Resolution (a.k.a. "tabling", a.k.a. memoization) but there is still the danger of infinite right recursions. Prolog is semi-decidable, because it is Turing-complete. Datalog is decidable, sacrificing completeness for, well, efficiency.

So, in short, it's not a problem if you consider the alternative, but of course there are trade-offs, always. It's like growing old, vs. dying young.

(I hope all this is not completely irrelevant to your question).

anon291 3 years ago |

Whatever language this is... This is not datalog. This looks like a particular implementation of datalog in closure.

Actual datalog looks like prolog.

cmrdporcupine 3 years ago | |

In reality, people are using "datalog" for a genre of datastore concepts based around horne clauses, or, basically relations + implicit joins. Datalog as a subset or dialect of prolog is only one variant of this. And Datomic has made an sexpr-syntaxed variant built around binary relations popular. To the point where some people in this thread can't seem to tell the two apart.

I am more interested in the general category of relational data model + logic programming than I am in any purity about Datalog in particular. In particular I'm very excited by "data/knowledge + behaviour sitting in a tree, k-i-s-s-i-n-g"

bogomipz 3 years ago | | |

Thanks. Is Datomic also a dialect of Prolog then?

anon291 3 years ago | | |

Yeah I know, I'm just being pedantic.

tannhaeuser 3 years ago |

The language discussed in TFA appears to be Datomic's proprietary Clojure DSL, but has nothing to do with Datalog/Prolog.

dimitar 3 years ago | |

It is a datalog dialect, and there a multiple open-source implementations: https://clojupedia.org/#/page/Datalog

k4st 3 years ago |

I created a datalog engine a few years back called Dr. Lojekyll: https://www.petergoodman.me/docs/dr-lojekyll.pdf

It was pretty cool; you could stream in new facts to it over time and it would incrementally and differentially update itself. The key idea was that I wanted the introduction of ground facts to be messages that the database reads (e.g. off of a message bus), and I wanted the database to be able to publish its conclusions onto the same or other message buses. I also wanted to be able to delete ground facts, which meant it could publish withdrawals of the prior-published conclusions. A lot of it was inspired by Frank McSherry's work, although I didn't use timely or differential dataflow. In retrospect I probably should have!

This particular system isn't used anymore because we made a classic monotonicity mistake by making it the brain of a distributed system, and then having it publish and receive messages with a bunch of microservices. The internal consistency model of the datalog engine didn't extend out to the microservices, and the possibility of feedback loops in the system meant that the whole thing could lie to itself and diverge uncontrollably! Despite this particular application of the engine being a failure, the engine itself worked quite well and I hope to one day return to datalog.

I think what a lot of people miss with datalog, and what becomes apparent as you use it more, is just how unpredictable many engines can be with the execution behavior of rules. This is the same problem that you have with a database, where the query planner makes a bad choice or where you lack an index, and so performance is bad. But with datalog, the composition of rules that comes so naturally also tends to compound this issue, resulting in time spent trying to chase down weird performance things and doing spooky re-ordering of your clause bodies to try to appease whatever choices the engine makes.

eunos 3 years ago |

Huh cool, I didnt realize it come from that Michelin.

eddieroger 3 years ago | |

I remember being surprised once that Michelin, like the star, was the same as Michelin, like the tire. It's really cool to see a company move beyond their core competency more than once in a meaningful way, even if the first time was to sell more tires. They have a really interesting blog that this is just a single article from.

pbronez 3 years ago | | |

I believe they started the star ratings as a way to encourage people to drive more. They saw it as part of “Travel”.

dataengineer56 3 years ago |

This is a cool concept but I'm not sure I'd fancy upskilling a team of SQL analysts to use this.

cmrdporcupine 3 years ago | |

Many SQL analysts are wasting their skills performing mental and syntactical gymnastics to get around the limitations of SQL in order to grasp at the actual conceptual elegance that lays underneath it. Most people who are writing SQL for a living already understand at least some part of what makes the relational model powerful. But SQL is relatively poor tool for accessing it.

I personally don't find the Sexpr-based syntax of Datomic's variant of Datalog all that useful here, and yeah, maybe someone working in SQL for a living would struggle at first with that syntax. But it's not intrinsic to the model itself. Have a waltz through my employer's documentation and see what you think: https://docs.relational.ai/rel/primer/overview

I think it's quite understandable (if a bit terse) and many people doing SQL for a living would appreciate the ability to better compose and structure things in this way, not to mention the ability to handle transitive / recursive relationships in a less awkward way.

i_am_toaster 3 years ago |

Maybe it’s just me but I find the SQL much easier to read in all the examples given.

thiago_fm 3 years ago |

The problem with Datalog, and Clojure in general are the licenses. Terrible licenses.

Everything is about Rich Hickey. Apache 1.0.

Now that Nubank basically owns it and there's very little progress or activity as of late, I don't see why one would chose to use Clojure, Datalog etc.

Also, a lot of functional programming concepts has been since added to big programming languages like Javascript and hell, even Java has lambdas now.

I'm guessing that also hardcore FP people have moved on to Haskell. The ones that like LISP to Racket... and only people tied to the JVM in legacy projects are with Clojure.

writes_to(microservice1, db1). writes_to(microservice2, microservice1). writes_to(microservice3, microservice2). depends_on(X, Y) :- writes_to(X, Z), depends_on(Z, Y). depends_on(X, Y) :- writes_to(X, Y).