SQL is syntactic sugar for relational algebra

SQL is syntactic sugar for relational algebra(scattered-thoughts.net)

216 points by dmarto 2 years ago | 145 comments

bkanuka 2 years ago |

As someone who learned mathematics first and programming later, I think it took me about 10 years of working in data-intensive programming before I could write really "good" SQL from scratch.

I completely attribute this to SQL being difficult or "backwards" to parse. I mean backwards in the way that in SQL you start with what you want first (the SELECT) rather than what you have and widdling it down. Also in SQL (as the author states) you often need to read and understand the structure of the database before you can be 100% sure what the query is doing. SQL is very difficult to parse into a consistent symbolic language.

The turning point for me was to just accept SQL for what it is. It feels overly flexible in some areas (and then comparatively ridgid in other areas), but instead of fighting against this or trying to understand it as a consistent, precise language , I instead just go "oh SQL - you are not like the other programming languages I use but you can do some pretty neat stuff so we can be on good terms".

Writing good SQL involves understanding the database, understanding exactly the end result you want, and only then constructing the subqueries or building blocks you need to get to your result. (then followed by some trial and error of course)

Winsaucerer 2 years ago | |

I feel like a foreigner in another land when I read your comment and others like it. For as long as I can remember using SQL, I can't remember ever finding it more difficult or backwards than anything else I use.

That difference might go some way towards explaining why I prefer a much more database heavy/thick approach to writing apps than my peers.

sodapopcan 2 years ago | | |

I agree. I never even thought about "select what you want first" as a problem until someone else pointed out.

Programmers seem far too sensitive about wanting everything to work one way. SQL is a very powerful DSL. It has its quirks but nothing that ever enraged me. I don't really care that it doesn't work like some other stuff I use, I just accept that I'm learning the language of a particular domain. This doesn't mean that I don't think there is always room for improvement. Of course I think FROM first would be a little nicer, but so much nicer that I think its worth changing a whole battle-tested standard? Not at all. The pain is so minimal I don't even feel it.

naasking 2 years ago | | |

> I feel like a foreigner in another land when I read your comment and others like it. For as long as I can remember using SQL, I can't remember ever finding it more difficult or backwards than anything else I use

Learn linq or query/list comprehensions and then you'll easily see why SQL is backwards.

htag 2 years ago | | |

I learned SQL before I learned set theory. While learning set theory I remember thinking "oh this notation is just SQL backwards." Afterwards I began to find SQL much harder because I realized there are so many ways to mathematically ask for the same data, but SQL servers will computationally arrive at the end differently and with very different performance. This is a minor deal if you're just doing small transactions on the database, because if you are dealing with pages of 100 objects it's trivial to hit good-enough performance benchmarks, even with a few joins.

I was first introduced to the issue of needing hyper optimized SQL in ETL type tasks, dealing with very large relational databases. The company switched to non-relational database shortly after I left, and it was the first time I professional witness someone make the switch and agreed that it was obviously required for them. We were dealing with very large batch operations every night, and our fortune 500 customers expected to have the newest data and to be able to do Business Intelligence operations on the data every morning. After acquiring bigger and bigger customers, and collecting longer and longer histories of data, our DBA team had exhausted every trick to get maximum performance from SQL. I was writing BI sql scripts against this large pool of SQL data to white-glove some high value customers, and constantly had to ask people for help optimizing the sql. I did this for a year at the beginning of my career, before deciding to move cities for better opportunities.

Lately, I've began seeing the requirements of high performance SQL again with the wave of microservice architectures. The internal dependency chain, even of what would have been a mid size monolith project a decade ago, can be huge. If your upstream sets a KBI of a response time, it's likely you'll get asked to reduce your response time if your microservice takes up more than a few percentage points of the total end to end time. Often, if you are using relational SQL with an ORM you can find performance increases in your slowest queries by hand writing the SQL. Many ORMs have a really good library for generating sql queries they expose to users, but almost all ORMs will allow you to write a direct sql query or call a stored procedure. The trick to getting performance gains is to capture the SQL your ORM is generating and show it to the best sql expert that will agree to help you. If they can write better SQL than the ORM generated than incorporate it into your app and have the SQL expert and a security expert on the PR. You might also need to do a SQL migration to modify indexes.

So in summary, I think your experiences with SQL depends heavily on your mathematical background and your professional experience. It's important to look at SQL as computational steps to reach your required data and not simply as a way to describe the data you would like the SQL server to give you.

mamcx 2 years ago | | |

> I can't remember ever finding it more difficult or backwards than anything else I use."

This is the major problem. SQL looks like is not "difficult". You don't see (as a user) all their MASSIVE, HUGE, problems.

That is why:

- People barely do more than basic SQL

- People can't imagine SQL can be used for more than that, which leads to:

- Doing a lot of hacky, complex, unnecessary stuff on app code (despite the RDBMS being capable of it)

- Trying to layer something "better" in the forms of ORM

- Refusing to use advanced stuff like views, stored procedures, custom types, and the like

- Using of using advanced stuff like views, stored procedures, custom types, and the like, but wrongly

- Thinking that SQL means RDBMS

- So when the RDBMS fails, it is because the RDBMS is inferior. But in fact, is SQL that have failed (you bet the internals of the RDBMS are far more powerful than any NoSql engine, unfortunately, they are buried forever because SQL is a bad programming interface for the true potential of the engine!)

- So dropping SQL/RDBMS for something better, like JS (seriously?)

- And they are happier with their "cloud scale" NoSQL that rarely performs better, needs major, massive hacks for queries, or reimplements, poorly, ACID again, is more prone to data issues, etc.

And this is not even starting. If you think "is bad to make a full app, all their code, in relational model" that is how much brain damage SQL has caused.

---

I can count with my fingers the number of semi-proper DBs/SQL usage on my niche (ERPs) and that is mostly mine! (For example: I use dates for dates, not strings, like many of my peers!) and that is taking into account that I actually learned what the heck is that "relational" thingy after +20 years of professional use.

Go figure!

P.D: And then go to my code and see "what the heck, I could have done this in some few lines of SQL" and "what the heck, if only SQL were well designed I could do this dozen lines of SQL in 3!"

whatever1 2 years ago | |

The trial and error is the worst part.

In traditional languages, you can print iteration by iteration the intermediate result and understand if there is something wrong.

In SQL you sample output, and you keep changing the query until you think you get it right. And then 2 years later someone else finds that the query was wrong all this time.

jalk 2 years ago | | |

Common Table Expressions (CTE) do help a little, as you can query each “table” and inspect the output. Debugging a giant query with deeply nested sub queries is very painful indeed

remus 2 years ago | | |

> The trial and error is the worst part.

I don't know about anyone else, but I do this kinda naturally when writing SQL queries. Usually start with a base table, query the first 100 rows to see what the data looks like, start joining on other tables to get info I need, querying as I go to check join conditions, perhaps build out some CTEs if I need to do some more complex work, query those to check the format of the data ... And so on.

It doesn't feel that different to any other programming in that sense. Querying is printing.

fifilura 2 years ago | | |

> you can print iteration by iteration the intermediate result

You would not be able to do that with a multi-threaded/multi-process application.

And this is the reason why e.g. Trino/Presto is so powerful together with SQL.

Instead of telling the computer how to go by to get your result, you tell it what result you want and let it do it in the best way.

The most up-front way of telling a computer "how" is a for-loop. And SQL does not have it. It may seem limiting, but avoiding explicit for loops gives the freedom to the computer. If it sees it fit to distribute that calculation over 200 distributed CPUs it can do that. With an imperative language you need to tell the computer exactly how it should distribute it. And from there it gets really hairy.

Ma8ee 2 years ago | | |

Trial and error is usually a bad idea in all kinds of programming.

r00fus 2 years ago | | |

I mean, I never build a query from front to back. Usually I build it FROM -> JOIN -> WHERE -> SELECT.

thaumasiotes 2 years ago | |

> widdling it down

Whittling. It means to carve something out of wood, with a metaphorical extension, as here, to gradually making something smaller by shaving bits of it away.

Strang 2 years ago | | |

Important distinction. "Widdling" is urination.

mrits 2 years ago | |

I always thought writing SQL from scratch was the easy part. The hard part for me was coming back to my query a few weeks later

arrowsmith 2 years ago | | |

This is true for most programming languages.

Winsaucerer 2 years ago | | |

That's why I try (but sometimes forget) to extensively comment my queries that have any kind of complexity :)

mdcurran 2 years ago | |

This doesn’t totally solve the issue of SELECT’ing first then filtering, but for complex queries I’ve found CTEs very useful (whenever the database/SQL dialect supports it).

icedchai 2 years ago | | |

What I usually do is start with "select *", get the joins and where clause down, then refine the select.

nextaccountic 2 years ago | |

> I completely attribute this to SQL being difficult or "backwards" to parse. I mean backwards in the way that in SQL you start with what you want first (the SELECT) rather than what you have and widdling it down.

> The turning point for me was to just accept SQL for what it is.

Or just write PRQL and compile it to SQL

https://github.com/PRQL/prql

392 2 years ago | |

You may like PRQL, which gives a more composable-atoms based approach. I find it far easier than SQL.

dmead 2 years ago | |

Saying what you want first rather than what you have is evidence of the von Neumann bottleneck or it was a sign of the times when SQL was being developed on 1970s machine.

Either way, point taken that it is not like a proof.

ako 2 years ago | | |

Covey’s: “start with the end in mind” is not a bad advise when building something complex. With procedural languages you do the same, you first write the signature, parameters expected to go in and out, and then you start writing the way to achieve this.

roenxi 2 years ago |

I'm glad that the article concluded "No" to it's own headline. Calling SQL "syntactic sugar" is an insult to sugar. The "helpful diagram explaining how the scoping rules work" alone should make people blanch. The language is a syntactic disaster that we've been saddled with out of habit and inertia.

qazxcvbnm 2 years ago |

As someone who has implemented a composable SQL generator from user-defined algebras of (arbitrary SQL) queries using relational algebra, I understand the shortcomings of SQL when viewed from an angle of a relational query language.

However, SQL is a language with many facets (DML, DDL, DCL) other than 'relational' querying. Putting on a less mathematical and more engineering mindset, SQL ingratiates me by its interface to incredibly powerful primitives difficult to find anywhere else. (I've primarily worked with Postgres SQL)

Consider the humble function; in SQL https://www.postgresql.org/docs/current/sql-createfunction.h..., one can declare the function as `stable` or `immutable` to let the runtime optimise repeated calls; as `parallel` to let the runtime consider parallelisation, as `cost ...` and `rows ...` to aid optimiser cost estimation. Imagine if one could do that in C or Javascript!

Another facet which regularly puts me in awe is the transaction isolation primitives and locking primitives offered by SQL.

I understand that as a database language, SQL necessarily has these within its specialised niche, but it seems to me these aspects of SQL as an interface to a language runtime would be equally useful in the everyday program; in all these areas of functionality, SQL is so much more advanced than nearly every other general purpose programming language.

triska 2 years ago |

Codd's seminal paper, A Relational Model of Data for Large Shared Data Banks, states that a language based on applied predicate calculus "would provide a yard-stick of linguistic power for all other proposed data languages". Quoting from https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf:

"1.5 Some linguistic aspects

The adoption of a relational model of data, as described above, permits the development of a universal data sub-language based on an applied predicate calculus. A first-order predicate calculus suffices if the collection of relations is in normal form. Such a language would provide a yard-stick of linguistic power for all other proposed data languages, and would itself be a strong candidate for embedding (with appropriate syntactic modification) in a variety of host languages (programming, command- or problem-oriented)."

Languages based on predicate calculus indeed seem extremely suitable for reasoning about relational data. Datalog is a well-known example. It is more directly based on predicate logic, and much simpler than SQL.

refset 2 years ago |

> Lest you think is just one weird corner of the sql spec, I found this helpful diagram explaining how the scoping rules work (from Neumann and Leis, 2023)

It's an excellent diagram, it really conveys the dissonance. Incidentally I interviewed Viktor Leis on a podcast last week about the paper where it's from: https://juxt.pro/blog/sane-query-languages-podcast/

A lot of people seem to believe that LLMs or other ML methods can overcome the complexity challenges of generating SQL accurately, but I'm yet to be convinced that a database-powered AI revolution can happen without somehow bypassing SQL.

KingOfCoders 2 years ago |

Everything is just syntactic sugar for something else. I'm syntactic sugar for the hydrogen atoms in my body.

lkey 2 years ago | |

Your comment is both needlessly dismissive, and worse, incorrect. You cannot get a human being, much less the person that is you, by recursively applying a constant set of rewrite rules on unstructured hydrogen atoms.

If you think to counter my assertion with "The standard model of particle physics and the big bang already did that, I'm here after all", then spare us both the trouble and don't reply. The particular arrangement of all known matter and energy in the universe at t=0 is not a repeatable initial condition.

Some rewriting systems are in fact Turing complete[1], and that's an interesting digression. However, it's far afield from the article's discussion of untangling the syntactic mess that is the SQL standard and bringing it closer in line with the standard expression of its semantics.

[1]: https://www.sciencedirect.com/science/article/pii/0304397592...

KingOfCoders 2 years ago | | |

"constant set of rewrite rules"

We just don't know the rewrite rules. And I didn't say unstructured. And you need Carbon atoms etc. - hydrogen was just a shortcut.

WWWWH 2 years ago | |

And on some cases sugar sugar not syntactic sugar.

samsquire 2 years ago |

Thanks for this interesting post.

Intuitively, relational algebra compresses enumeration over data in time that a CPU executing billions of cycles a second can feasibly and efficiently traverse and execute against many collections of millions or billions of records in human perceivable time thanks to indexes.

I've been trying to think of systems communicating with eachother as parts of a relational model in the sense we can model system behaviour as a series of events and a join is a communication between components.

I would love to talk about this with people.

exabrial 2 years ago |

I’d much rather deal with the peculiarities of SQL than any of the attempted replacements (ones I’ve seen in my minted experience). Elastic for instance, other json based languages, are absolutely terrible. We lost something we when stopped writing ANSI standards.

We’ve even stayed on InfluxDB og versions _because of _ the SQL like syntax, and also their improved languages are a nuclear disaster area.

SQL, despite its flaws (null != null) is pretty good enough!

jameshart 2 years ago |

Not totally convinced by the ORDER BY obstacles that the author raises..

    table('test').project('a').orderBy('b')

> That's an error, because we can't order by a column that we just projected away. Right?

assumes that 'projection' completely eliminates part of the underlying relation, but why does that have to be the case?

If a relation includes 'selected fields' and 'hidden fields', and project just 'hides' the fields it doesn't project, while orderBy can operate on either projected or hidden fields, this ends up being perfectly sound.

Even the more complex example which is translated as follows:

   translate('select a+1 as c from test order by b,c')
   =>
   table('test').project('a','b').addColumn('a+1', as='c').orderBy('b','c').project('a')

would work fine as:

   table('test')                 // selected: [a, b, ...], hidden: []
      .addColumn('a+1', as='c')  // selected: [a, b, c, ...], hidden: []
      .project('c')              // selected: [c], hidden: [a, b, ...]
      .orderBy('b','c')          // selected: [c], hidden: [a, b, ...]

(not sure why there's a .project('a') on the end of their version)

Which is a reasonably local, algebraic transformation.

halayli 2 years ago |

Relational algebra IR is implemented in MonetDB and discussed in their paper. Definitely worth reading.

Not trying to be picky but pure relational algebra doesn't map to SQL and IMO it's not a good idea to attempt to do that due to the fact that relational algebra treats tuples as mathematical sets (ordering/uniqueness matters) while SQL does not(and has to deal with nullability).

joking 2 years ago |

A few tweaks here and there and it would be nice enough for me. Most of them are actually implemented by some engines but are not part of the standard. Just changing the order of the from and select clauses so autocomplete can know what fields can you use would be a nice enough change.

zvmaz 2 years ago |

I tried to study C. J. Date's books to understand relational theory... suffice it to say that I got nothing from his books, except a deep irritation partly due to his absolute pedantry...

I finally learned SQL with a gentle introduction by Alan Beaulieu. I stumbled upon another book that's about the theory: Applied Mathematics for Database Professionals, by Lex deHaan, and Toon Koppelaars. Maybe these authors will benevolently teach me relational theory.

But please avoid C. J. Date's books. And don't be him when writing a book or trying to explain something to another human being.

infogulch 2 years ago |

SQL is pretty good all things considered.

But I've always looked out for languages that can represent relational algebra concepts more directly. Maybe CozoaDB is close, though still immature. Any recommendations?

samatman 2 years ago | |

I highly recommend the Third Manifesto. I could link this under most posts in this thread but I'll limit myself to two.

https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf

The only problem there is that you might want to use a D language, and well. You can't. There was a product called Dataphor which one can find some writeups on but, baffling though I find this, there are no robust open-source relational databases which use a D language.

markisus 2 years ago | |

I’ve been using Pandas which exposes a python slicing syntax for manipulating relational data. It also has a builtin join() function.

“select id, date from orders” is orders[“id”, “date”].

It’s meant for in-memory datasets but the syntax could be extended to work for other backends. I’m not sure if anyone is working on that.

__mharrison__ 2 years ago | | |

Ibis takes the notion of a dataframe and abstracts it from SQL backends.

scythmic_waves 2 years ago |

This is a great write up. There appear to be a few camps forming in the comments and I’m in camp “SQL is confusing and attempts to explain it in terms of relational algebra have felt inadequate to me”.

It also gives me some good follow up material to read. I’m particularly interested in that one link that forms subqueries and lateral joins in terms of a new “dependent join” operator.

zer00eyz 2 years ago | |

Go read: Database Design for Mere Mortals.

ERD's are your friend. Learn how to generate one, and how to read it.

The relations (not relational, not algebra) are IN the design they are IN the ERD (as a tool to visualize). Even if your not visual thinker the ERD might help you find a path between two distant tables.

Needing a subquery is rare. It happens but a lot of subqueries would be better off as joins. The moment you grasp the design of something you're less likely to want to sub query.

Explain is your friend. Reading an explain plan is going to give you some good insight into what is going on UNDER the hood. Not only will it help you tune slow queries but it is more insight into how large queries decompose.

Lastly, there is nothing worse than having to query a badly designed DB. If you do a shit job on the first part everything else is going to be painful.

barfbagginus 2 years ago |

Can we call it syntactic ashtray? Because it feels like I'm sucking on 1970s ashtray when I see or use it.

Those who have read their Spivak 2017 will know that databases are just Co-presheaves of Ologs over the Kliesli Category of the Power-Set Monad, the Identity Monad, or the Giry Monad. I would like a QL that acts like it!

breezeTrowel 2 years ago | |

I know some of these words.

lkey 2 years ago | | |

Snark of grandparent aside: https://arxiv.org/abs/1102.1889 if you want to read more.

samatman 2 years ago |

If you're interested in what it would take to put relational databases back on the rigorous footing of relational algebras, the Third Manifesto is a good place to start.

https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf

I find it somewhat sad that an implementation of a database with a proper D language hasn't broken out and become a ubiquitous tool for the profession. There were some proprietary versions shortly after the manifesto's publication, but it never caught on.

aoeusnth1 2 years ago |

I find that most people who object to SQL do not use TVFs. If you don’t have any tools to easily break down the steps of the work, of course SQL will feel like an opaque Write-only language. With TVFs you can easily iteratively add more complex steps to your query while checking your work while you build.

lbourdages 2 years ago | |

What does TVF mean? I have been able to find anything on Google, all I get is an Indian streaming service...

"Truth value function"?

housecarpenter 2 years ago | | |

Table-valued function.

keid 2 years ago |

See C.J. Date's "An Introduction to Database Systems," https://www.amazon.com/Introduction-Database-Systems-8th/dp/... This is not news.

xbar 2 years ago |

Discussions of what is/is not syntactic sugar are unapproachable for me because I cannot get past the abuse of sugar's essential functions in the tortured metaphor.

achr2 2 years ago |

You should look at LINQ in C#/.net . The SQL-like syntax always has a function-first equivalent, that gets across this point fairly eloquently.

r00fus 2 years ago |

That diagram separating the syntactic vs. semantic layers of a SQL statement (from Neumann & Leis paper) is brilliant.

chubot 2 years ago |

An analogy I like is - Are Perl-style regexes (used in Python, Ruby, Java, .NET, etc.) syntactic sugar for regular languages?

The answer is no, because Perl added all sorts of imperative doodads to regexes, which can’t be easily represented and executed in the automata-based paradigm. Trying to do this is like a “research paper generator” (and not in a bad way), e.g.

Derivative Based Nonbacktracking Real-World Regex Matching with Backtracking Semantics - https://dl.acm.org/doi/abs/10.1145/3591262 (2023)

This is until Go and Rust, which used automata-based regexes from the beginning. I don’t think users have lost much.

Purely automata-based engines are kind of pleasant to write, because almost everything is in the compiler, and not in the runtime, e.g. https://github.com/andychu/rsc-regexp/blob/master/py/README....

That is, features like ? + * really are syntactic sugar for repetition. There’s also a lot of syntax sugar around character classes like [^a], and the runtime is very small.

---

Likewise, SQL seems to have so many non-relational doodads in its language design, which cause problems for implementers. In this case, I think there’s an incentive problem with SQL: It benefits vendors if their dialect is harder to re-implement. Although certainly they’ve added many useful features too in 4-5 decades!

To me a language design issue is we never really “learned” to compose languages with different paradigms:

- the set-based paradigms like relational algebra and regular languages, with

- Turing-machine like code. (and also I/O!)

We never learned polyglot programming, so each language becomes its own source of “reckless growth” – its own parochial backwater.

Both regexes and SQL should be able to “escape” to normal code, and that would greatly simplify them. This can be done both by language implementers and by application programmers, i.e. “factoring” across languages. It’s not always obvious how to do this, but it certainly it can be done more than we do it today.

---

I’d argue the same phenomenon – lack of language composition – leads to programming languages within YAML. Github Actions is nominally some kind of “declarative” scheduler specification, or graph (job -> job dependencies), but that’s not enough for many problems.

So it also has a bunch of doodads for escaping that model (to the extent it has a model).

Shell, Awk, and Make also grew many doodads (https://www.oilshell.org/blog/2016/11/14.html), which are not very well designed. They used to be declarative languages, but no longer are.

Although there is some distinction between “formerly set-based languages” like SQL and regex, and other “declarative” non-Turing-complete languages. But I think the language composition problem is approximately the same. Part of it is syntax, but a lot of it is semantics.

(copy of lobste.rs comment)