Common Mistakes and Missed Optimization Opportunities in SQL(hakibenita.com) |
Common Mistakes and Missed Optimization Opportunities in SQL(hakibenita.com) |
SELECT a.id, COUNT(b.*) FROM a JOIN b ON b.a_id = a.id GROUP BY a.id
is not permitted in Postgres.Sure, I could just use COUNT(b.a_id) since that's what I join on, but a more complicated example might not allow for that. For instance if it was a virtual table.
SELECT a.id, COUNT(b.*), COUNT(c.*)
FROM a JOIN b ON b.a_id = a.id JOIN c ON c.a_id = a.id
GROUP BY a.id
I want to know how many occurrences a_id has in both table b and c. Again in this simple example, I could just count on b.a_id and c.a_id, respectively, but imagine if b and c were complex virtual tables: JOIN (SELECT NULL AS foo, 1 AS bar
UNION SELECT 1 AS foo, NULL AS bar) b ON b.foo = a.id OR b.bar = a.id
This would be useful if we are aggregating data together, where essentially, there are two ways to join the data with the main table, and both columns can be null.Of course, in this example, you could count by going COUNT(b.foo) + COUNT(b.bar), but that's a bit awkward, or a column in table b you know to never be null. But what if you don't? And still have table c next to it?
Yes, in all cases, there would be a way out. In the extreme case, you could wrap it in a virtual table, where you add a column that is just always 0 (not null), so you can count on it. It would just be neat if b.* was possible.
SELECT a.id, count() FROM ...
On a related note, is the MCSE the gold standard for SQL education? Have been looking for a way to brush up and formalize my SQL skills.
for reference: https://sudonull.com/posts/998-Important-changes-in-the-CTE-...
Generally speaking, some clarification would be helpful!
Before Postgres 12 CTEs were always materialized so you did not get any query optimization benefits of CTEs acting like inline subqueries.
After Postgres 12 all CTEs default to NOT MATERIALIZED if only referenced once or MATERIALIZED if referenced more than once. You can override via MATERIALIZED or NOT MATERIALIZED when defining the CTE.
Their example is showing that you can let Postgres (before 12) optimize a CTE for you by writing it as an inline subquery instead of a CTE:
SELECT *
FROM (
SELECT *
FROM sale
) AS inlined
WHERE created_by_id = 1
But with Postgres 12 their "don't" example would result in an index scan without refactoring to the "do" example. Basically their advice on do vs don't applies to before Postgres 12.https://www.postgresql.org/docs/12/queries-with.html is pretty thorough on this
I seriously take issue with "Reference Column Position in GROUP BY and ORDER BY" though. If it is restricted to ad-hoc (AKA messing-about) queries I'd be fine with it, but it won't be. Just don't do it.
# select cust_id as c, sum(avail_balance) as b from account group by c order by b;not indexing, most often, tables are not or poorly indexed.
Implicit conversion can generate a lot of io/leads to poor perf or just not using indexes.
Sql function:sorry but they are most often crap and useless, better to in-line or use TVF, and no its not code logic duplication.
Read uncommitted unless you enjoy not reading rows, multiple times or half of a value (page split and/or LOB values)
I'm not familiar with MS SQL (I've only worked with MySQL / PostgreSQL), can someone explain me how it works?
If you are asking about the timezone shift applied to a date, I think the engine converts the date to 00:00:00 timestamp and then does the timezone conversion.
SELECT
col1
,col2
,COUNT(col3)
FROM t1
JOIN t2 ON ta.pk = t2.fk
WHERE col1 = col2
AND col3 > col4
GROUP BY col1
,col2
HAVING COUNT(col3) > 1Select A.value, B.valuue
from tableA A on A.id = 77
join tableB B on B.id = A.bId
What abomination of a date format is this? I can only assume this is a bug, a typo, or an easter egg for those paying attention. Please let it be one of those. The last thing the world needs is people pushing yet another crazy date format into use.
[1] https://www.ibm.com/support/knowledgecenter/en/SSS28S_8.1.0/...
An ORM only works until the point where you need to join tables. As soon as that's needed the ORM just causes you endless trouble.
Functions have helped me tremendously in SQL server, but you do have to know the issues, which can be taken advantage of to some degree.
Code reuse is the obvious use case, but due to lack of inlining up to SQL Server 2019 meant you could reduce performance compared to hand inlined case statement or whatever. Hopefully now that functions can be inlined in 2019 this will be a non issue going forward
They are an optimization barrier which can be a good thing. I have used to this my advantage to stabilize tricky queries that where using views for code reuse. The performance becomes consistent and predictable rather than going pathological on some databases even though it may be slightly slower on others.
But I agree that implicit conversion is the root of a lot of evil
Please elaborate.
The first time I experienced the lob issue was in 2009. At that time, I didn't know how to really reproduce the issue due to my lack of knowledge (which triggered also deep diving will to sql server internals _not my article,nor his but more generally Paul White /sql kiwi has published a load of great articles on internals, maybe some of the best technical articles I have read. )
This "fix" only helps with the case where new items are inserted at the end. By using alphabetic ordering you increase the chances that a new item will be inserted at the beginning or the middle, in which case it makes no difference where you put the commas.
It only really helps at all because inserting new items at the end is common, whereas removing items from the start is rare, at least in SQL (all it really does is displace the dangling comma problem from the former case to the latter). However, always inserting new items at the end tends to lead to unintuitive ordering, which in turn leads to additional VCS churn when that becomes seen as technical debt.
Commas at the start does give objectively better readability in one sense: it ensures they are aligned vertically. That makes it easier to spot errors at a glance. You might call this "ease of formal reviewability".
However, in practice it seems to be worse for general readability, for the entirely subjective reason you already pointed to. Since code is typically read more often than it is written, it's important to optimise for that first.
Since an error here will always cause a hard fail, it falls into a different category than say, omitting braces in a C-style if statement, which can introduce subtle bugs through bad merges. In the latter case, ease of formal reviewability has to take precedence over subjective aesthetics.
WHERE foo
AND bar
Or with long, complex ones: WHERE
foo
AND
barI did not learn SQL this way. I was shown these tricks and their adoption was near immediate BECAUSE it just made more sense to my brain.
This isn't to devalue/argue against your comment. I just find it interesting that the common denominator for OCD-ish compulsions is that the "right" way for an individual OCD brain is usually specific to that individual brain.
,first=conditionI find this approach allows me to write smaller procs that become reusable in different contexts, effectively composable.
Be careful that for over a thousand records table variables perform poorly, so I usually dump the data into a temp table inside the proc, or a permanent table with a key unique to this execution of the proc, this could be the SPID
Basically they are just very convenient compared to temp tables (they are basically function / procedure scoped temp tables rather than connection scoped) and yes allow you to go for an imperative execution you can count on. They also allow you to share results across statements not just the same statement like a CTE.
I haven't used Postgres very much but SQL's declarative style is great until it isn't. When the optimizer is being dumb its real nice to have the tools to go imperative and just tell it what to in what order like you know a normal programming language.
Maybe its because many SQL server programmers are also .Net programmers and are used to just throwing results in a list or dictionary to process data. This is partially why I use them, I let SQL do it thing, but when it doesn't I can just stuff it into a table variable and do the next step with a separate isolated statement or cursor or whatever thing your not supposed to do in SQL to actually get the job done.
This goes for query hints too, which seems to be a big no no in the PG world.
T.first_name = "Ender" AND
T.last_name = "Wiggins"
versus
WHERE
T.first_name = "Ender"
AND T.last_name = "Wiggins"
you could also do
WHERE 1=1
AND T.first_name = "Ender"
AND T.last_name = "Wiggins"
I will now use that to easily remove filters.
An extension for the SELECT based on the very insightful top comment could be
SELECT NULL
,something1
,something2
This is frequently unavoidable, though. Or more precisely: it could be avoided with a sane database design, but the databases on which I have to work for my day job are the precise opposite of "well-designed", so grouping on complex expressions is unfortunately an inevitability.
It's (IIRC!) something to do with the situation of
select x + y as x
from ...
group by x
which x are we talking about? (Logically that example is crap because only the alias x makes sense, but something like that anyway).You're grouping and counting after the three way join. That join will involve all combinations of child records between the two child tables associated with any given parent row (almost never what is wanted). So any given non-null thing you're counting from one child record will appear multiple times, = the number of child records in the other table associated with the parent row.
I think you just want to use correlated subqueries to count the child records: select a.id, (select count(whatever) from child1 c1 where c1.a_id = a.id), (select count(whatever) from child2 c2 where c2.a_id = a.id) ...
TLDR: You almost never want to join independent children to a common parent, use independent correlated subquery expressions instead.
For little data, that's probably fine, but for a big database, it will be slow. However, the optimiser may be able to handle that? I know Sybase's and MSSQL's had trouble with it, but I've heard Postgres' might be able to.
If there's an inner join, then there is a matching row. It will be counted by a normal COUNT( * ). On an outer join, columns in a 'missing' row will be represented as NULL.
You're saying you wish there were two NULLS, 'missing value in table null' and 'missing row in join null', and that you could count the first one?