SQL:2023 has been released(iso.org) |
SQL:2023 has been released(iso.org) |
- SQL/PGQ - A Graph Query Language
- JSON improvements (a JSON type, simplified notations)
Peter Eisentraut gives a nice overview here: https://peter.eisentraut.org/blog/2023/04/04/sql-2023-is-fin...
SQL:2023 is finished: Here is what's new - https://news.ycombinator.com/item?id=35562430 - April 2023 (153 comments)
SELECT t.j.foo.bar[2], ... FROM tbl t ...Using a graph DB with many underlying KV-store nodes, you can have a single graph spread over many machines representing e.g. Facebook's social graph, and run a query which "chases around" edges between vertices that live on different nodes, to solve that query, while ensuring that as little of that has to happen as possible — both by rebalancing vertices so that data is sharded at low-connection-degree points in the graph; and by consolidating the steps of queries that occur on the same node into single batch queries, such that the whole thing becomes (close to) a single map/reduce step.
There's nothing in Postgres that knows how to do that; if you had e.g. a graph stored in a Citus hypertable, and did a recursive CTE over it to do graph search, then you'd get pretty dang bad perf.
What you really want is to apply graph processing to data as it is. The SQL 2023 additions are a step in the right direction. I need to find a good detailed description of the constraints and semantics to assess how good it is.
I suspect if Postgres had a solid implementation of SQL/PCQ it would be a similar story for Neo4j.
In particular it is nice to see that a core dev views JSON dot accessing and PCQ as "sensible" future additions to Postgres.
[1] - https://peter.eisentraut.org/blog/2023/04/18/postgresql-and-...
For example, OK, I realize auto-incrementing IDs are not the most important thing in the world, and arguably not even a good approach in many cases. But sometimes you want them, and helpfully almost every database engine I know of has some kind of support for this, even if the semantics may differ. It's a super basic thing to want a unique ID that roughly counts upward on a table. You might have specific needs about re-using numbers and whatnot, but the general idea is very simple.
However: in practice, there is not an excellent way to do it that I can see. The closest thing I could find is `GENERATED BY DEFAULT AS IDENTITY` which, well, works. However, none of SQLite3, MSSQL, nor MariaDB support this to my knowledge.
This is relentlessly annoying.
Is it the standards fault, or the implementations? I honestly can't say. However, I definitely find this annoying, since I was really hoping that by this time, we'd at least have a nice clean subset of standard SQL you could count on anywhere, for popular database engines. Unfortunately, it's not quite there yet, necessitating ugly hacks to this day.
I assume this new standard doesn't really change anything on this regard, since it's a desync with implementations that is a problem, and it does not seem the standards committee really cares too much about this kind of thing. (I could be wrong, though, as I am saying this based on feel and not evidence.)
---
As @rgbgraph points out below, the price is actually several times that. There are several parts to the standard, and that US$230 is per part.
Asking for a friend, of course.
Does anyone else find value in what's in an ISO standard?
This isn't SQL-specific, but this is 100% the problem for me. There's such a big culture gap between the way that we do things in most of the tech world and ISO, and one of the biggest clashes is this weird $180 PDF thing.
If I want to implement a new standards-compliant HTML parser, I can hop right onto whatwg.org and view the complete standard instantly [0]. It's massive and complicated, but it's freely accessible to anyone interested.
In contrast, if I want to implement an ISO 8601-compliant date parser, ISO wants me to buy their PDF for CHF166 (~$180 USD). This spec is for a standard that is orders of magnitude less complex, and they're charging through the nose for it.
I'm unclear what makes the difference between a standard that can be maintained by a community for the benefit of everyone and a standard that needs to be locked behind a paywall.
The only real way of fixing it is for enough people to ignore ISO so they become irrelevant.
If you are building a new DB engine (toy or not), don't use SQL. Either design a new spec or use something that's more openly specified (maybe GraphQL or EdgeQL).
Ridiculous.
It might be interesting to have a comparison of where major databases stand (or plan to be) with respect to SQL/PGQ
Ridiculous.
I think tax-payer money should pay for standards, because they benefit us all. It is like the highway system, or clean air, and water.
PostgreSQL does mostly aim at conforming to the standard. They will invent new syntax when needed, But compared to the those previously mentioned, Postgres seems to prefer to stick closer to the standard whenever possible, including adding standard syntax for existing features whenever possible.
PostgreSQL does have some places where there is deliberate non-conformance (beyond just incompletely implemented features). They document many deliberate deviations (other than unimplemented or partially implemented features) and if they think they can be fixed in the future or not: https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL_Standard . Looking at the list I'd say only one especially likely to bite a developer is the default escape character for LIKE clauses, or the non-standard trailing space behavior for character(n) datatypes (but who used fixed length character datatypes instead of varchar or text?). And obviously not yet implemented features could bite people, but many such features are optional to implement anyway, so...
I cannot speak about MySQL or MariaDB, due to insufficient familiarity.
It makes no sense to have a standard SQL when nonsensical implementations like MSSQL or MySQL exist.
My mental model is that it's a mixture, but my life experience has been that a "standard" without a test harness or (at bare minimum) a reference implementation is just a bunch of navel gazing. For SQL specifically, that problem is even worse given the number of existing engines that move faster than the specification, so in the absence of leadership they just make stuff up
Natural language is also a catastrophically horrible specification mechanism, since your black/blue is my white/gold
As a result, for databases to compete on features, they must arbitrarily extend the SQL language standard; these modifications to the language then get backfilled into the standard, and runs headfirst into backwards compatibility, and suddenly no one agrees on really what the feature should precisely be so they it becomes an optional part of the standard, which really just means that it isn't standardized.
In any sane language, you wouldn't need different databases to add specific support for GENERATED BY DEFAULT AS IDENTITY, and especially not for 15 different syntax's used in 15 different databases to specify kinda sorta not really the same thing -- it would simply be a function, one you could write yourself, or provided by the standard library. It wouldn't be up to the RDBMS to offer support beyond actual language features -- it'd just be up to you to update your libraries.
Very good point. Non-composable mean non-scalable (language).
I wonder if it would be possible to define a very small robust standard subset of SQL with a standard for extending it. But that would take away the customer-lock-in which is what all big DB-vendors desire.
SQLite SQL != MySQL SQL != ISO SQL
Not all specs are particularly accessible (e.g. ECMAScript is often hard to follow if you haven’t spent a fair bit of time around it—it’s mostly natural-language code that works with a ECMAScript virtual machine), but most of the time, I would much rather have the actual spec over someone’s digested summary that covers what they thought was important, but regularly lacks details important for my situation. Some specs are absolutely magnificent. The HTML Standard is my very favourite as both a piece of spec work and as reference material.
Seriously, specs are really good stuff and it makes me sad how people often ignore them because they assume they’ll be inscrutable. (Similar remarks apply to legislative texts. They’re normally pretty easy to understand, and you find all kinds of sometimes fun and sometimes useful gems that potted summaries exclude.)
My only issue is with the width. While whitespace between the sides and the centre content is very useful, this isn’t the 1990s anymore with its 1024×768 monitors. You _can_ make the centre column responsive to the overall width of the screen. Doing so can also give you a lot more room to do things, and make it easier to read. Your line-height is already great, and is perfect for text blocks a good half again wider if not twice as wide. Even on my vertical monitors, which are only 1500px wide, that centre column is pretty much claustrophobically narrow.
How we could we add Snowflake to the compatibility matrices?
It's popular enough to deserve it, IMHO.
Which is even more hilarious, considering the ISO is already being funded by the tax dollars of member countries.
A bit like having to pay a journal to get access to research papers: ridiculous.
Trying to throw together a list of them all, I'm only finding 5 though:
• https://www.iso.org/standard/76583.html
• https://www.iso.org/standard/76584.html
• https://www.iso.org/standard/76585.html
• https://www.iso.org/standard/76586.html
• https://www.iso.org/standard/76587.html
Any idea about the others?
"Inclusive and sustainable industrialization, together with innovation and infrastructure, can unleash dynamic and competitive economic forces that generate employment and income"
Not clear how an arbitrary levy as this can contribute to an SDG goal. It is the definition of exclusion. A bright new thing can use the $230 to get several good database books.
Surely ISO has enough corporate stakeholders that can defray the cost of running the standardization process.
ISO defends its own model by arguing that it produces greater independence from vendors, being less reliant on them for funding and thus making it easier for them to say "no" to them. One big difference is ISO is not an industry-specific standards body, it has standards for all kinds of things that have nothing to do with computers – screw threads, metallurgy, analytical chemistry, oil and gas pipelines, you name it. Possibly their defence makes more sense for some of those other industries than it does for ours.
If the major SQL players got together (major proprietary vendors and leading open source projects), they could create their own SQL standards process to supersede the ISO one, and release the standard freely. The ISO standard could still exist, but it could turn into one of these ISO standards where ISO just adopts the text of an existing standards organisation - e.g. the ECMAScript standard is developed by ECMA. Initially ISO republished ECMA's standards under their own number (ISO/IEC 16262), now instead they publish a 3 page standard which just incorporates ECMA's by reference (ISO/IEC 22275). They offer it for free download [0], but are also willing to sell you a copy if you are desperate to give them money. ISO's SQL standard could turn into the same thing – but, that would require the SQL community to decide to push for that, and I'm not sure any of the players feel sufficiently motivated to do it.
[0] https://standards.iso.org/ittf/PubliclyAvailableStandards/in...
Somewhat ironically the standards-body also benefits from such bloat because that is the product they are selling. The bigger the standard the more money they can ask for it.
Finally, for the folks who need professional access to this document, such as RDBMS implementors or professional developers using an RDBMS, the price is a pittance. To anyone for whom the price is a problem - perhaps someone writing a free software application - the lack of access to the standard isn’t a problem, because what they really need is documentation on how their implementations work, and two superb implementations - PostgreSQL and SQLite - have voluminous documentation and they’re 100% free of charge.
The problem is that the standards have to be bought. They should be in the public domain. But someone can still be paying for it (eg a government, or an international organization funded by various public bodies).
While obviously standards don't have virgin birth (some costs are involved) it does smack as rent extraction: By definition there is only one standard so you can't ask for competitive pressures to lower the price.
Now with 23c, they are adopting PGQ and one should be able to interface through the regular JDBC connectors. Will see how it shapes up.
It may be interesting to see if real world usage of PGQ pushes the database engines to do interesting things.
But thanks for the feedback. There is a rework pending anyway.
edit: was referring to scrollaway's comment
I'd guess the incentive structure is the opposite of what you're implying; the same reason why every cordless drill manufacturer has their own battery connector: vendor lock in fuels private planes and shareholder reports, versus being compatible means no one is forced to buy your batteries and thus profits are `$total - $forced_purchases`
This situation gets even worse in the case of a standard without any objective way of knowing one is in compliance. Having a badge on the mysql.com website saying "now featuring SQL:2023 compliance!11" sells how many more support contracts exactly?
• Part 2: https://www.iso.org/standard/76584.html
• Part 3: https://www.iso.org/standard/84803.html
• Part 4: https://www.iso.org/standard/76585.html
• Part 9: https://www.iso.org/standard/84804.html
• Part 10: https://www.iso.org/standard/84805.html
• Part 11: https://www.iso.org/standard/76586.html
• Part 13: https://www.iso.org/standard/84806.html
• Part 14: https://www.iso.org/standard/76587.html
• Part 15: https://www.iso.org/standard/84807.html
• Part 16: https://www.iso.org/standard/79473.html
Wonder if Elsevier is involved somewhere? It's how they do things too.
Having that said: free would be better.
So unless you are writing a database yourself, the DMBS documentation is going to be more relevant.
It doesn't have to be SQL based systems on the other end - the most used connector with Trino is to query files on object storage (S3/GCS/Azure Blob).
Disclaimer: I'm one of the maintainers of the project.
The rule of thumb is that Trino aims to provide a uniform layer over whatever sits underneath. So operations which when "pushded down" to the source result in same results as when executed within Trino do get "pushed down" - i.e. executed on the source. But in cases where the results might differ or it's complex to push-down the operation the operation runs within Trino - i.e. pull data from source (minimum needed data) and then perform operation within Trino.
Note that it's not an all-or-nothing case, e.g. a query like:
SELECT n.name, r.name
FROM postgresql.tpch.nation n
LEFT JOIN postgresql.tpch.region r ON r.regionkey = n.regionkey
WHERE n.name > 'A'
AND r.name = 'ASIA'
will result in the following query to Postgres: SELECT n.name, r.name FROM tpch.nation n LEFT JOIN tpch.region r ON r.regionkey = n.regionkey WHERE r.name = 'ASIA'
The rest of the query (n.name > 'A') would be applied in Trino to results fetched from Postgres because the collation in Postgres will affect results if we push complete query down to Postgres and may not match results when entire query would be processed by Trino. With single data source this is not easily appreciated but e.g. if the second table in the query came from SQL Server then you'd want to have a consistent comparision logic regardless of source of table.This is a very simplified example though.
A problem everbody would love to have but pretty much nobody actually has.
Except the people who do have it and need to keep their business running off of one postgres instance.
For example, at one place just over a year ago they were well into this territory. One of weird problems for them was with pgBadger's memory usage (https://github.com/darold/pgbadger). That's written in perl, which doesn't seem to do garbage collection well. So even on a reporting node with a few hundred GB's of ram, it could take more than 24 hours to do a "monthly" reporting run to analyse PG usage for the time period.
There wasn't a solution in place at the time I left, so they're probably still having the issue... ;)
It's not for scalability reasons it's for high-availability.
Which as cloud adoption has increased and server uptime has decreased is even more important.
Anyone with dogmatic opinions about this stuff need to be taken with a grain of salt. If you scale out PeopleSoft, your accounting system will exceed the value of your company. If you’re worried about webscaling your random app, that’s more wasting time navel gazing than accomplishing anything! :)
Postgres uses foreign data wrappers and table partitions to achieve sharding of relational databases over many nodes. It's certainly possible to make the FDW layer smarter wrt. being able to distribute certain queries to the shards, including some recursive queries that happen to be of practical interest.
But a lot of graph databases have sizes where they are fully replicated on each node and some small internal optimizations for the graph are good enough. Given the design of the graph query language (and what you can query, hint: not everything arbitrarily) there is no reason postgres can't gain some additional optimizations to do that effectively as long as it's not facebook scale.
SQL now having syntax for querying graph means Graph databases can support SQL, at least a subset of it, in a standardized way.
It's usually not that data doesn't fit on one machine but that load on the database exceeds what one machine can serve. A failover configuration might enable you to use the spares for some read operations and take a little load of the primary, but you lose ACID semantics when you do that and it generally doesn't help you for long.
a) At some point you will have more data or more users than one instance can handle. And instead of simply adding another node you need to throttle usage in order to do a rolling upgrade. Which is far easier said than done and involves impact to the business.
b) With distributed databases you are constantly testing that everything works in Dev, Test etc environments. With failover you really only test it every now and again usually before you deploy to Production. And in most companies which are hopeless the testing will be guaranteed to be inadequate.
c) Vendors lie. They promise that failover will just work but in my experience it very often doesn't. Which is another reason why b) is such important to validate their claims.
one of the things I would like to see in my lifetime is somehow it should be easier to "run the whole enterprise" from one box. Sure, it will probably be seriously underpowered and I can't do all things at once but for most small to mid-size companies, it should be possible to run all our "code" from one machine.
I think of this as some kind of development or pre-qa environment. It really shouldn't be that big of an ask...
I am thinking most, if not all, companies will be able to fit their entire enterprise on a Supermicro A1+ server with two 96-core processors. Sure, there is no machine in the world that can fit all of YouTube videos but there is no reason why we can't have YouTube, with a limited set of non-production data, running from just one box. Thoughts?
There's also a good article by Martin Fowler about how "NoSQL" was really "NoDBA" for a lot of folks, and I definitely saw that dynamic. JSON fields can also be a good middle ground here, where a DBA can insure good "structural integrity" of your schema, but you don't need to go through the hassle of adding a new column and schema update if you're just adding some "trivial" bit of data.
Another good example is for generalized classified ads, different categories may have additional details, but you don't necessarily want to create the plethora of tables to store said additional details.
Some other controversial thoughts: SQL itself is a really not-ergonomical query language, and also the lack of any decent Rust-like enum typing is really unfortunate. I know lots of people think that databases aren't for typing, but (1) clearly SQL aspires toward that but gives up half way and (2) that's a shame because they have a lot of potential in that capacity. Also while you can sort of hack together something like sum types / Rust enums, it's a lot of work to do it reasonably well and even then there are gaps.
pg_jsonschema is a postgres extension that implements schema validation for JSON columns. I'm not particularly familiar with Rust, so not sure exactly what you mean by "Rust-like enum typing", but postgres has enums, composite types, array types, and custom scalars, so not sure what's missing.
Assuming you've got good integration test coverage of the database schema alterations end up taking a minuscule amount of time and if you lack test coverage than please reconsider and add more tests.
Sibling comment, "is when you want to store/use additional payment processor details for a transaction", is a great example IMO. I've dealt with card processing systems where the card transaction data can be reams of JSON. Now, to be clear, there are a lot of subfields here that are important that I do pull out as columns, but a lot of them are just extra custom metadata specific to the card network. When I'm syncing data from another API, it's awesome that I can just dump the whole JSON blob in a single field, and then pull out the columns that I need. Even more importantly, by sticking the API object blob in a single field, unchanged, it guarantees that I have the full set of data from the API. If I only had individual columns, I'd be losing that audit trail of the API results, and if, for example, the processor added some fields later, I wouldn't be able to store them without updating my DB, too.
Before JSON columns were really standard, saw lots of cases where people would pull down external APIs into something like mongo, then sync that to a relational DB. Tons of overhead for a worse solution where instead I can just keep the source JSON blob right next to my structured data in postgres.
And there are four major reasons still to choose MongoDB over something like PostgreSQL.
a) PostgreSQL has terrible support for horizontal scalability. Nothing is built-in, proven or supported.
b) MongoDB has superior ability to manipulate and query the JSON.
c) MongoDB is significantly faster for document-attribute updates.
d) MongoDB has better tooling for those of us that prefer to manage our schema in the application layer.
And primary-secondary failover in my experience is rarely without issues.
There is a reason almost every new database aims to be distributed from the beginning.
Sharding has nothing to do with high-availability.
I think that's the mainframe idea. There's probably some interesting philosophical question in there about whether or not a data center is just a big mainframe. It sort of feels like it verges on semantics.
> Sure, there is no machine in the world that can fit all of YouTube videos but there is no reason why we can't have YouTube, with a limited set of non-production data, running from just one box. Thoughts?
I'm not sure what you mean exactly, but if you're streaming YouTube's traffic through one box (even if that box isn't directly connected to the disks on which the video is stored) you'll run into I/O bottlenecks--such a machine would need to push terrabytes per second which is probably not trivial. Moreover, having a single machine that can handle YouTube's peak traffic probably means you're underutilizing it most of the time.
Basically, somewhere I can run something and feel safe knowing this is production code except this one change I have made.
1. compute and memory bandwidth to serve complicated queries 2. IO
You can't scale memory bandwidth beyond some pretty low limit on one machine. You can't scale IO bandwidth beyond some limit. To give you an example, I've seen database servers with 20GB of data being so overloaded by compute requirements of complex queries that they needed to be scaled horizontally.
I mean I expect things to be slower I guess but to test for correctness and spec?
That's partly because you can't compete with the existing RDBMSs if you're single node: they are good enough already. Nobody will buy your database if you don't introduce something more novel than PostgreSQL, whether that novelty is worth it or not.
---
And to respond sibling comment about "noticeable" downtime....
Primary-secondary failover in <1m is very feasible. And each minute downtime is a mere 0.002% for the month.
Primary-secondary isn't what is hurting your availability.
And given that the entire industry has moved to a distributed model despite its complexity gives you a hint as to which way the wind has been blowing for the last decade.
Failover is automatic for PG when using e.g. Patroni. Of course you lose active transactions and that might be a showstopper, but monitoring failover? I am curious when you'll have to do that.