Common Mistakes and Missed Optimization Opportunities in SQL

Common Mistakes and Missed Optimization Opportunities in SQL(hakibenita.com)

179 points by haki 6 years ago | 66 comments

Svip 6 years ago |

While counting columns will not include NULL columns, how about counting joined tables?

  SELECT a.id, COUNT(b.*) FROM a JOIN b ON b.a_id = a.id GROUP BY a.id

is not permitted in Postgres.

Sure, I could just use COUNT(b.a_id) since that's what I join on, but a more complicated example might not allow for that. For instance if it was a virtual table.

gnud 6 years ago | |

I'm sorry, if you want to count NULL-rows in b.* , how will it ever be different from just COUNT( * )? Maybe I'm misunderstanding what you're after?

Svip 6 years ago | | |

You're right, it's a bad example, imagine if I joining two tables:

  SELECT a.id, COUNT(b.*), COUNT(c.*)
    FROM a JOIN b ON b.a_id = a.id JOIN c ON c.a_id = a.id
    GROUP BY a.id

I want to know how many occurrences a_id has in both table b and c. Again in this simple example, I could just count on b.a_id and c.a_id, respectively, but imagine if b and c were complex virtual tables:

  JOIN (SELECT NULL AS foo, 1 AS bar 
        UNION SELECT 1 AS foo, NULL AS bar) b ON b.foo = a.id OR b.bar = a.id

This would be useful if we are aggregating data together, where essentially, there are two ways to join the data with the main table, and both columns can be null.

Of course, in this example, you could count by going COUNT(b.foo) + COUNT(b.bar), but that's a bit awkward, or a column in table b you know to never be null. But what if you don't? And still have table c next to it?

Yes, in all cases, there would be a way out. In the extreme case, you could wrap it in a virtual table, where you add a column that is just always 0 (not null), so you can count on it. It would just be neat if b.* was possible.

andreareina 6 years ago | |

Any reason a regular count() wouldn't work?

SELECT a.id, count() FROM ...

Svip 6 years ago | | |

It would, I realised after I read the replies, that I should have used the two joined table scenario as per my reply to your sibling.

adamiscool8 6 years ago |

Some of these have been learned through trial and error over the years, but a few were new and great to know.

On a related note, is the MCSE the gold standard for SQL education? Have been looking for a way to brush up and formalize my SQL skills.

Dowwie 6 years ago |

Would someone please confirm whether this article is misrepresenting a subquery as an inline CTE? It is my understanding that as of Postgresql 12, a programmer denotes a CTE as "AS MATERIALIZED", "AS NOT MATERIALIZED", or neither and allow the default operation to happen: the CTE subquery will default to inline if its result is used once.

for reference: https://sudonull.com/posts/998-Important-changes-in-the-CTE-...

Generally speaking, some clarification would be helpful!

rgharris 6 years ago | |

I think the article and you are correct - the article is worded a little oddly though and ignores the fact that in Postgres 12 CTEs that are referenced multiple times are MATERIALIZED by default.

Before Postgres 12 CTEs were always materialized so you did not get any query optimization benefits of CTEs acting like inline subqueries.

After Postgres 12 all CTEs default to NOT MATERIALIZED if only referenced once or MATERIALIZED if referenced more than once. You can override via MATERIALIZED or NOT MATERIALIZED when defining the CTE.

Their example is showing that you can let Postgres (before 12) optimize a CTE for you by writing it as an inline subquery instead of a CTE:

  SELECT *
  FROM (
    SELECT *
    FROM sale
  ) AS inlined
  WHERE created_by_id = 1

But with Postgres 12 their "don't" example would result in an index scan without refactoring to the "do" example. Basically their advice on do vs don't applies to before Postgres 12.

https://www.postgresql.org/docs/12/queries-with.html is pretty thorough on this

Dowwie 6 years ago | | |

Thanks for confirming

tempguy9999 6 years ago |

This is a pretty trivial list. Useful for beginners I guess.

I seriously take issue with "Reference Column Position in GROUP BY and ORDER BY" though. If it is restricted to ad-hoc (AKA messing-about) queries I'd be fine with it, but it won't be. Just don't do it.

wfriesen 6 years ago | |

It's especially egregious in the ORDER BY, since there you have the option of using column aliases.

commandlinefan 6 years ago | | |

Are you saying you can't use column aliases in group by? What version of Postgres are you using? I just tried it in 11.5 and it worked:

    # select cust_id as c, sum(avail_balance) as b from account group by c order by b;

tempguy9999 6 years ago | | |

I always, always forgot what column aliases I can use where. Thanks for the reminder.

irrational 6 years ago | |

It's useful for people new to Postgres, since many of these things are particular to Postgres.

tempguy9999 6 years ago | | |

Not really. Most is pretty well standard SQL (CTE optimisation fence pre PG12 being one exception, and there are a couple more, but really it's mostly standard stuff).

godshatter 6 years ago |

I'd never run across coalesce before. I usually end up doing nested NVL calls if I'm trying to find the first non-null in a series of expressions (I'm on Oracle, btw). I've now added this function to my toolbox.

oarabbus_ 6 years ago | |

Coalesce and NVL are synonyms for each other.

godshatter 6 years ago | | |

They don't seem to be in oracle. Giving more than two parameters to nvl gives me an error but works fine with coalesce. Granted they are basically the same thing if you are giving both two parameters.

Foobar8568 6 years ago |

I would add to the common mistakes (should be generic, but I have more xp with sql server) :

not indexing, most often, tables are not or poorly indexed.

Implicit conversion can generate a lot of io/leads to poor perf or just not using indexes.

Sql function:sorry but they are most often crap and useless, better to in-line or use TVF, and no its not code logic duplication.

Read uncommitted unless you enjoy not reading rows, multiple times or half of a value (page split and/or LOB values)

esnard 6 years ago |

In the "Avoid Transformations on Indexed Fields" part, I fail to understand how the example can work if you're applying the timezone computation on the right-hand side.

I'm not familiar with MS SQL (I've only worked with MySQL / PostgreSQL), can someone explain me how it works?

moron4hire 6 years ago | |

Your only failure is because it's just wrong. It's about the same as trying to change "if((a + b) > c)" to "if(a > (c + b))". If this weren't time zones, it'd obviously be "if(a > (c - b))", because you have to balance the equation by applying the same operation to boths sides. But because this is dealing with timezones, the offset of "b" is different depending on the value of "a", so we won't know what to subtract from "c" to get the right comparison. So the right transformation for this "gotcha" is not even possible.

paulclinger 6 years ago | | |

I think the advice will still work, but you'd need to switch from "named" timezones to number-specific one, so for example replace `PST` with `-08:00` and then apply the opposite conversion on the right side (as you and I suggested).

paulclinger 6 years ago | |

I don't think it works the way author expects it to work, as the math is not correct. Think about `a+1 < 2` comparison. To remove +1, you need to change it to `a < 2-1`, not to `a < 2+1`; the operation needs to be transformed to the opposite one, which in this case would imply shifting the timezone in the opposite direction.

If you are asking about the timezone shift applied to a date, I think the engine converts the date to 00:00:00 timestamp and then does the timezone conversion.

jlarocco 6 years ago | |

I think the advice is correct, but the examples are not. When the transformation switches to the other side of the comparison it has to be inverted.

irrational 6 years ago |

In regards to formatting sql, I used to do it the way shown, but a coworker formatted the columns in the select with the commas in front. This seemed strange to me until I tried it. I realized that this solved the problem of sometimes a query would be changed and the last item in the select list would be removed, but the last comma would not be removed. Or, a new item was added to the end of the select list, but they neglected to add in a comma at the end of the previous last item.

SELECT

  col1

  ,col2

  ,COUNT(col3)

FROM

  t1

  JOIN t2 ON ta.pk = t2.fk

WHERE

  col1 = col2

  AND col3 > col4

GROUP BY

  col1

  ,col2

HAVING

  COUNT(col3) > 1

monkeycantype 6 years ago |

I wish I could use ON for the selection criteria for the first table instead of a where clause:

Select A.value, B.valuue

from tableA A on A.id = 77

join tableB B on B.id = A.bId

kbenson 6 years ago |

> 2019-22-11: Fixed the examples in the "Faux Predicate" section after several keen eyed readers noticed it was backwards.

What abomination of a date format is this? I can only assume this is a bug, a typo, or an easter egg for those paying attention. Please let it be one of those. The last thing the world needs is people pushing yet another crazy date format into use.

jandrese 6 years ago | |

It looks like a typo. Hebrew date style is yyyymmdd[1].

[1] https://www.ibm.com/support/knowledgecenter/en/SSS28S_8.1.0/...

gigatexal 6 years ago |

Edit: “Don’t use an ORM” should be point 1

ars 6 years ago | |

The opposite. Point 1 should be don't use an ORM unless you don't know SQL. But you should know SQL so don't use an ORM.

An ORM only works until the point where you need to join tables. As soon as that's needed the ORM just causes you endless trouble.

gigatexal 6 years ago | | |

Edited, i meant don’t use one.