SQL is syntactic sugar for relational algebra(scattered-thoughts.net) |
SQL is syntactic sugar for relational algebra(scattered-thoughts.net) |
I completely attribute this to SQL being difficult or "backwards" to parse. I mean backwards in the way that in SQL you start with what you want first (the SELECT) rather than what you have and widdling it down. Also in SQL (as the author states) you often need to read and understand the structure of the database before you can be 100% sure what the query is doing. SQL is very difficult to parse into a consistent symbolic language.
The turning point for me was to just accept SQL for what it is. It feels overly flexible in some areas (and then comparatively ridgid in other areas), but instead of fighting against this or trying to understand it as a consistent, precise language , I instead just go "oh SQL - you are not like the other programming languages I use but you can do some pretty neat stuff so we can be on good terms".
Writing good SQL involves understanding the database, understanding exactly the end result you want, and only then constructing the subqueries or building blocks you need to get to your result. (then followed by some trial and error of course)
That difference might go some way towards explaining why I prefer a much more database heavy/thick approach to writing apps than my peers.
Programmers seem far too sensitive about wanting everything to work one way. SQL is a very powerful DSL. It has its quirks but nothing that ever enraged me. I don't really care that it doesn't work like some other stuff I use, I just accept that I'm learning the language of a particular domain. This doesn't mean that I don't think there is always room for improvement. Of course I think FROM first would be a little nicer, but so much nicer that I think its worth changing a whole battle-tested standard? Not at all. The pain is so minimal I don't even feel it.
Learn linq or query/list comprehensions and then you'll easily see why SQL is backwards.
I was first introduced to the issue of needing hyper optimized SQL in ETL type tasks, dealing with very large relational databases. The company switched to non-relational database shortly after I left, and it was the first time I professional witness someone make the switch and agreed that it was obviously required for them. We were dealing with very large batch operations every night, and our fortune 500 customers expected to have the newest data and to be able to do Business Intelligence operations on the data every morning. After acquiring bigger and bigger customers, and collecting longer and longer histories of data, our DBA team had exhausted every trick to get maximum performance from SQL. I was writing BI sql scripts against this large pool of SQL data to white-glove some high value customers, and constantly had to ask people for help optimizing the sql. I did this for a year at the beginning of my career, before deciding to move cities for better opportunities.
Lately, I've began seeing the requirements of high performance SQL again with the wave of microservice architectures. The internal dependency chain, even of what would have been a mid size monolith project a decade ago, can be huge. If your upstream sets a KBI of a response time, it's likely you'll get asked to reduce your response time if your microservice takes up more than a few percentage points of the total end to end time. Often, if you are using relational SQL with an ORM you can find performance increases in your slowest queries by hand writing the SQL. Many ORMs have a really good library for generating sql queries they expose to users, but almost all ORMs will allow you to write a direct sql query or call a stored procedure. The trick to getting performance gains is to capture the SQL your ORM is generating and show it to the best sql expert that will agree to help you. If they can write better SQL than the ORM generated than incorporate it into your app and have the SQL expert and a security expert on the PR. You might also need to do a SQL migration to modify indexes.
So in summary, I think your experiences with SQL depends heavily on your mathematical background and your professional experience. It's important to look at SQL as computational steps to reach your required data and not simply as a way to describe the data you would like the SQL server to give you.
This is the major problem. SQL looks like is not "difficult". You don't see (as a user) all their MASSIVE, HUGE, problems.
That is why:
- People barely do more than basic SQL
- People can't imagine SQL can be used for more than that, which leads to:
- Doing a lot of hacky, complex, unnecessary stuff on app code (despite the RDBMS being capable of it)
- Trying to layer something "better" in the forms of ORM
- Refusing to use advanced stuff like views, stored procedures, custom types, and the like
- Using of using advanced stuff like views, stored procedures, custom types, and the like, but wrongly
- Thinking that SQL means RDBMS
- So when the RDBMS fails, it is because the RDBMS is inferior. But in fact, is SQL that have failed (you bet the internals of the RDBMS are far more powerful than any NoSql engine, unfortunately, they are buried forever because SQL is a bad programming interface for the true potential of the engine!)
- So dropping SQL/RDBMS for something better, like JS (seriously?)
- And they are happier with their "cloud scale" NoSQL that rarely performs better, needs major, massive hacks for queries, or reimplements, poorly, ACID again, is more prone to data issues, etc.
And this is not even starting. If you think "is bad to make a full app, all their code, in relational model" that is how much brain damage SQL has caused.
---
I can count with my fingers the number of semi-proper DBs/SQL usage on my niche (ERPs) and that is mostly mine! (For example: I use dates for dates, not strings, like many of my peers!) and that is taking into account that I actually learned what the heck is that "relational" thingy after +20 years of professional use.
Go figure!
P.D: And then go to my code and see "what the heck, I could have done this in some few lines of SQL" and "what the heck, if only SQL were well designed I could do this dozen lines of SQL in 3!"
In traditional languages, you can print iteration by iteration the intermediate result and understand if there is something wrong.
In SQL you sample output, and you keep changing the query until you think you get it right. And then 2 years later someone else finds that the query was wrong all this time.
I don't know about anyone else, but I do this kinda naturally when writing SQL queries. Usually start with a base table, query the first 100 rows to see what the data looks like, start joining on other tables to get info I need, querying as I go to check join conditions, perhaps build out some CTEs if I need to do some more complex work, query those to check the format of the data ... And so on.
It doesn't feel that different to any other programming in that sense. Querying is printing.
You would not be able to do that with a multi-threaded/multi-process application.
And this is the reason why e.g. Trino/Presto is so powerful together with SQL.
Instead of telling the computer how to go by to get your result, you tell it what result you want and let it do it in the best way.
The most up-front way of telling a computer "how" is a for-loop. And SQL does not have it. It may seem limiting, but avoiding explicit for loops gives the freedom to the computer. If it sees it fit to distribute that calculation over 200 distributed CPUs it can do that. With an imperative language you need to tell the computer exactly how it should distribute it. And from there it gets really hairy.
Whittling. It means to carve something out of wood, with a metaphorical extension, as here, to gradually making something smaller by shaving bits of it away.
> The turning point for me was to just accept SQL for what it is.
Or just write PRQL and compile it to SQL
Either way, point taken that it is not like a proof.
However, SQL is a language with many facets (DML, DDL, DCL) other than 'relational' querying. Putting on a less mathematical and more engineering mindset, SQL ingratiates me by its interface to incredibly powerful primitives difficult to find anywhere else. (I've primarily worked with Postgres SQL)
Consider the humble function; in SQL https://www.postgresql.org/docs/current/sql-createfunction.h..., one can declare the function as `stable` or `immutable` to let the runtime optimise repeated calls; as `parallel` to let the runtime consider parallelisation, as `cost ...` and `rows ...` to aid optimiser cost estimation. Imagine if one could do that in C or Javascript!
Another facet which regularly puts me in awe is the transaction isolation primitives and locking primitives offered by SQL.
I understand that as a database language, SQL necessarily has these within its specialised niche, but it seems to me these aspects of SQL as an interface to a language runtime would be equally useful in the everyday program; in all these areas of functionality, SQL is so much more advanced than nearly every other general purpose programming language.
"1.5 Some linguistic aspects
The adoption of a relational model of data, as described above, permits the development of a universal data sub-language based on an applied predicate calculus. A first-order predicate calculus suffices if the collection of relations is in normal form. Such a language would provide a yard-stick of linguistic power for all other proposed data languages, and would itself be a strong candidate for embedding (with appropriate syntactic modification) in a variety of host languages (programming, command- or problem-oriented)."
Languages based on predicate calculus indeed seem extremely suitable for reasoning about relational data. Datalog is a well-known example. It is more directly based on predicate logic, and much simpler than SQL.
It's an excellent diagram, it really conveys the dissonance. Incidentally I interviewed Viktor Leis on a podcast last week about the paper where it's from: https://juxt.pro/blog/sane-query-languages-podcast/
A lot of people seem to believe that LLMs or other ML methods can overcome the complexity challenges of generating SQL accurately, but I'm yet to be convinced that a database-powered AI revolution can happen without somehow bypassing SQL.
If you think to counter my assertion with "The standard model of particle physics and the big bang already did that, I'm here after all", then spare us both the trouble and don't reply. The particular arrangement of all known matter and energy in the universe at t=0 is not a repeatable initial condition.
Some rewriting systems are in fact Turing complete[1], and that's an interesting digression. However, it's far afield from the article's discussion of untangling the syntactic mess that is the SQL standard and bringing it closer in line with the standard expression of its semantics.
[1]: https://www.sciencedirect.com/science/article/pii/0304397592...
We just don't know the rewrite rules. And I didn't say unstructured. And you need Carbon atoms etc. - hydrogen was just a shortcut.
Intuitively, relational algebra compresses enumeration over data in time that a CPU executing billions of cycles a second can feasibly and efficiently traverse and execute against many collections of millions or billions of records in human perceivable time thanks to indexes.
I've been trying to think of systems communicating with eachother as parts of a relational model in the sense we can model system behaviour as a series of events and a join is a communication between components.
I would love to talk about this with people.
We’ve even stayed on InfluxDB og versions _because of _ the SQL like syntax, and also their improved languages are a nuclear disaster area.
SQL, despite its flaws (null != null) is pretty good enough!
table('test').project('a').orderBy('b')
> That's an error, because we can't order by a column that we just projected away. Right?assumes that 'projection' completely eliminates part of the underlying relation, but why does that have to be the case?
If a relation includes 'selected fields' and 'hidden fields', and project just 'hides' the fields it doesn't project, while orderBy can operate on either projected or hidden fields, this ends up being perfectly sound.
Even the more complex example which is translated as follows:
translate('select a+1 as c from test order by b,c')
=>
table('test').project('a','b').addColumn('a+1', as='c').orderBy('b','c').project('a')
would work fine as: table('test') // selected: [a, b, ...], hidden: []
.addColumn('a+1', as='c') // selected: [a, b, c, ...], hidden: []
.project('c') // selected: [c], hidden: [a, b, ...]
.orderBy('b','c') // selected: [c], hidden: [a, b, ...]
(not sure why there's a .project('a') on the end of their version)Which is a reasonably local, algebraic transformation.
Not trying to be picky but pure relational algebra doesn't map to SQL and IMO it's not a good idea to attempt to do that due to the fact that relational algebra treats tuples as mathematical sets (ordering/uniqueness matters) while SQL does not(and has to deal with nullability).
I finally learned SQL with a gentle introduction by Alan Beaulieu. I stumbled upon another book that's about the theory: Applied Mathematics for Database Professionals, by Lex deHaan, and Toon Koppelaars. Maybe these authors will benevolently teach me relational theory.
But please avoid C. J. Date's books. And don't be him when writing a book or trying to explain something to another human being.
But I've always looked out for languages that can represent relational algebra concepts more directly. Maybe CozoaDB is close, though still immature. Any recommendations?
https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf
The only problem there is that you might want to use a D language, and well. You can't. There was a product called Dataphor which one can find some writeups on but, baffling though I find this, there are no robust open-source relational databases which use a D language.
“select id, date from orders” is orders[“id”, “date”].
It’s meant for in-memory datasets but the syntax could be extended to work for other backends. I’m not sure if anyone is working on that.
It also gives me some good follow up material to read. I’m particularly interested in that one link that forms subqueries and lateral joins in terms of a new “dependent join” operator.
ERD's are your friend. Learn how to generate one, and how to read it.
The relations (not relational, not algebra) are IN the design they are IN the ERD (as a tool to visualize). Even if your not visual thinker the ERD might help you find a path between two distant tables.
Needing a subquery is rare. It happens but a lot of subqueries would be better off as joins. The moment you grasp the design of something you're less likely to want to sub query.
Explain is your friend. Reading an explain plan is going to give you some good insight into what is going on UNDER the hood. Not only will it help you tune slow queries but it is more insight into how large queries decompose.
Lastly, there is nothing worse than having to query a badly designed DB. If you do a shit job on the first part everything else is going to be painful.
Those who have read their Spivak 2017 will know that databases are just Co-presheaves of Ologs over the Kliesli Category of the Power-Set Monad, the Identity Monad, or the Giry Monad. I would like a QL that acts like it!
https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf
I find it somewhat sad that an implementation of a database with a proper D language hasn't broken out and become a ubiquitous tool for the profession. There were some proprietary versions shortly after the manifesto's publication, but it never caught on.
"Truth value function"?
The answer is no, because Perl added all sorts of imperative doodads to regexes, which can’t be easily represented and executed in the automata-based paradigm. Trying to do this is like a “research paper generator” (and not in a bad way), e.g.
Derivative Based Nonbacktracking Real-World Regex Matching with Backtracking Semantics - https://dl.acm.org/doi/abs/10.1145/3591262 (2023)
This is until Go and Rust, which used automata-based regexes from the beginning. I don’t think users have lost much.
Purely automata-based engines are kind of pleasant to write, because almost everything is in the compiler, and not in the runtime, e.g. https://github.com/andychu/rsc-regexp/blob/master/py/README....
That is, features like ? + * really are syntactic sugar for repetition. There’s also a lot of syntax sugar around character classes like [^a], and the runtime is very small.
---
Likewise, SQL seems to have so many non-relational doodads in its language design, which cause problems for implementers. In this case, I think there’s an incentive problem with SQL: It benefits vendors if their dialect is harder to re-implement. Although certainly they’ve added many useful features too in 4-5 decades!
To me a language design issue is we never really “learned” to compose languages with different paradigms:
- the set-based paradigms like relational algebra and regular languages, with
- Turing-machine like code. (and also I/O!)
We never learned polyglot programming, so each language becomes its own source of “reckless growth” – its own parochial backwater.
Both regexes and SQL should be able to “escape” to normal code, and that would greatly simplify them. This can be done both by language implementers and by application programmers, i.e. “factoring” across languages. It’s not always obvious how to do this, but it certainly it can be done more than we do it today.
---
I’d argue the same phenomenon – lack of language composition – leads to programming languages within YAML. Github Actions is nominally some kind of “declarative” scheduler specification, or graph (job -> job dependencies), but that’s not enough for many problems.
So it also has a bunch of doodads for escaping that model (to the extent it has a model).
Shell, Awk, and Make also grew many doodads (https://www.oilshell.org/blog/2016/11/14.html), which are not very well designed. They used to be declarative languages, but no longer are.
Although there is some distinction between “formerly set-based languages” like SQL and regex, and other “declarative” non-Turing-complete languages. But I think the language composition problem is approximately the same. Part of it is syntax, but a lot of it is semantics.
(copy of lobste.rs comment)
The worse part is nobody dares changing anything because it would require application code changes. Don’t blame SQL. Blame incompetent people doing SQL badly.
It’s perplexing to me now why SQL was perplexing. Even knowing about relational theory and enough DBA to pass interviews, it somehow always seemed like we should have been able to treat databases like our OOP’s.
To think of the hours I wasted. Maybe I heard in a meeting “it’s the join table” and another colleague said “just use the association table,” and I was trying the n^2 debugging approach to solve it with model classes.
The best advice I got on this made sense later: “I wouldn’t even try to manage that [dataset] myself. Just figure out how to tell a database to do it—that’s its only job.”
There was a fundamental disconnect. I approached a class project differently not only because I was a DBA, but I am neurodivergent. Because my design didn’t align with his expectations, I got a C.
I had a 4.0 GPA prior to that, but that made me realize I wasn’t a fit for college.
Good thing as well because I am debt free today.
Everything you spoke to is around people doing relational algebra badly. That has little to do with SQL. The same mistakes would lead to the same problems if the database used, say, QUEL instead.
Perhaps it's not SQL that you love at all?
Random saturday ramblings, sorry about that :-D
I think if so that might explain the difference in query structure.
Even writing queries I think in terms of outputs then build the query from there.
I am blanking about the advantages of starting with from, people clearly share your view, but I have no idea why.
Maybe it comes from my C++ background where return type comes first, then function name, then inputs.
Reminds me of reading music actually… many first time noob’s, mostly out of frustration, realize how sub optimal staff notation is and seek to write something better. In the end it’s ‘Good enough’ and accept it’s flaws because no rewrite has proven to be a 10x better solution.
I think it's simply that most businesses and investors don't register SQL as having any real problems, and especially now with a resurgent interest in SQL the idea of attempting anything novel feels too risky.
Shameless plug of one recent attempt to offer something different: XTQL https://docs.xtdb.com/intro/what-is-xtql.html
SQL actually killed alternatives back in the 80s (and 90s, Postgres used to use a dialect of Ingres's QUEL, the "SQL" bit of PostgreSQL specifically denotes the switch over from QUEL to SQL).
But help me with "more concise". SQL has no boilerplate, its keywords are brief, you express the exact logic of the result set you're seeking. There are tons of SQL overlays or replacements embedded in programming languages or BI/reporting tools, and they are universally, more difficult to work with than straight SQL.
The basic idea was based on the "stream fusion" papers. So the relation was a stream in the stream fusion sense so it was pretty trivial to implement the normal relational operators in that paradigm. Changing this type of system to work on "events" as input would be pretty trivial.
The one thing I never managed to get to work was the actual "fusion" compiler hint. I kept trying variants of what he did in the paper but ghc just refused to optimise my stream/unsteam unstream/stream pairs away because it had already done some rewriting to them. I couldn't figure out how to apply the optimisation early enough to be effective.
[1] Which are a fantastic read if you're into CS whatever you think of my idea https://www.cs.tufts.edu/~nr/cs257/archive/duncan-coutts/str... and https://www.researchgate.net/publication/220802863_Rewriting...
I am unfamiliar with the source material and I have recorded the paper you linked to my reading list.
From a description of "stream fusion", it reminds me of Clojure's "transducers".
It also reminds me of Kafka's Table/Stream duality.
Term rewriting is something is really interesting to me.
Communication, protocols and Communicating sequential processes, session types are all ideas I am thinking about and trying to understand.
Makes me wonder just how far people have pushed Foreign Data Wrappers in practice.
They don't really work with indexes but instead regular files stored in partitions (where date is typically one of them).
This means that they only have to worry about the data (e.g. dates) that you are actually querying. And they scale up to the number of CPUs that particular calculation needs. They rarely choke on big query sizes. And big tables are not really an issue as long as you query only the partitions you need.
Of course with 20/20 hindsight that decision is easy to criticize. I suspect their primary concerns were to minimize risk and costs while meeting our customer's requirements. Even today, making a brand new Google product or Facebook backed open source project a hard dependency would be too much risk for an established business.
The bigger framework got me more static analysis and dependency graphs on top of that. Those features saved me lots of time and headaches (type checking, dependency graphs, etc). Mind you, the frameworks I worked with still allowed for raw SQL for those edge cases that still pop up.
If you need threads in production I think you will end up getting rid of your for loops anyway (or possibly, if you really want to, end up in a mutex/semaphore quagmire).
I must say, though, that there are other benefits with a declarative approach than just avoiding threading issues. But I guess it takes some getting used to.
I would say that the same "I cant step through my code" argument also goes for functional style code.
Except that it most likely will be orders of magnitude slower. Most databases are very good at what they are doing.
programmers and language users all have a mental translation layers that they use to go from conceptual->concrete using the syntax and idioms of their language of choice.
javascript doesn’t make sense in a lot of ways. calling it a disaster is hyperbole. go doesn’t make sense in a lot of ways. it’s not a disaster. unfamiliarity or disagreement with choices does not necessarily mean it’s a disaster
the only language with which i’m familiar that is a complete, unmitigated disaster is the english language, but i suppose that’s outside of the scope of this conversation
Out of curiosity, are you a native speaker of english or did you learn it as a second language?
I’m a native speaker and feel sorry for anyone who has to learn all its irregularities, etc. But the few times I’ve mentioned this to anyone who had to learn as a second language, they’d correct me and tell it was easy. Which baffled me.
Obviously as a native speaker, I’m in no place to judge it.
How is it "very" limited?
> I would actively avoid utilizing any query language where I have to count brackets
That's really an editor/tooling problem, solvable in many ways, but I guess a Python-like/Parinfer approach would be your preference? (where whitespace/indentation is significant)
I thought it was a problem as soon as IDEs had good SQL autocomplete. I got so used to depending on just being able to "tab my way through" autocompleting in other languages (e.g. if you do <objectVariable>.<propertyName>, it's obvious the set of property names can be narrowed down based on the type of the variable), that it immediately becomes apparent that doing select first sucks, because autocomplete has no good information until you get to the from clause. A lot of times with a good SQL editor like Datagrip I just do "SELECT * FROM foo" first, and then go back and edit the select columns because it can now autocomplete them quickly.
I now notice this in other places too, like I hate how in JavaScript you do `import { foo } from "moduleBar"`. I'd much rather do `from "moduleBar" import { foo }`.
> I now notice this in other places too, like I hate how in JavaScript you do `import { foo } from "moduleBar"`. I'd much rather do `from "moduleBar" import { foo }`.
Personally I prefer languages that don't make you import at all ;)
If SQL weren't so (needlessly) complex we would see much more competition across the database space.
I think there is more competition across the database space now than back when the SQL spec was less complex (say, in 1989 with SQL-89).
Also, much complexity in the spec comes from complex features; I really like grouping sets and window functions, and sure, that adds complexity; but it does allow users to express certain concepts that allow the database to more efficiently process data than sending everything to the user and letting the user solve the computations.
The big problem with SQL AFAIC is its poor (really complete lack of) composability.
https://news.ycombinator.com/item?id=39798528
et voila: https://www.cidrdb.org/cidr2024/papers/p48-neumann.pdf
see their comments on WITHIN GROUP and redundant declaration of identifiers. As for boiler plate, SQL's turf war over plain English words as reserved names creates this interesting situation of boiler plate identifiers in SQL to disambiguate from reserved keywords.
Some operations on groups are also too verbose, like getting top 5 items in each group.
SELECT results FROM source WHERE criteria
results = source(criteria)
It's rare to see someone want to change assignments in code to be like: source(criteria) -> results
Where I see it as the same thing: the SELECT columns, like the variable assignment, are the interaction point with the following lines of code.And yes, CTE ordering does annoy me because of this. Putting it in the middle is pretty much the worst order.
Indeed, which is why source(criteria) -> results makes more sense: the results definition is right next to the code that's going to be using that definition. If you put the results definition first as with SQL, then you have to scroll up to find the context (although perhaps Python's indentation sensitivity is the tripping point in this case). Not even mentioning the fact that the SQL way completely destroys any chance of code completion.
I'm going to boldly state that the SQL way is literally objectively wrong, in that there is no world in which SQL's choice is superior for general querying.
> or query/list comprehensions
List comprehensions are column first.
return source(criteria).results
In your SQL, `results` isn't the variable you're assigning to, it's the column you're reading from source.If I’m in psql I can tab complete columns in the where clause but not in the select because I haven’t actually given any information about what I’m selecting from yet.
But that is a valid usecase!
But method(object1, object2, object3, object4) is different than object.method().
And sure, multimethods are too complex for naive autocompletion. But it's useful for people to have tooling that can say "given an input tell me what outputs are possible" because that's how we write code.
I also wish we needed to explicitly take locks. In PostgreSQL at least (I think other dialects/engines too), figuring out what operation will take what lock and potentially mean downtime is left as an exercise to the reader. Force me to write WITH EXCLUSIVE TABLE LOCK when I add a non nullable column with a default!
select x, y, z
from Foo
where a or b
Here the opposite is the case: selection-first moves the return definition far from the subsequent code that uses it.So if you're going to support list comprehensions, a monadic do-style notation which lets you chain them and again places select last: