Things to know about databases

Things to know about databases(architecturenotes.co)

730 points by grech 4 years ago | 241 comments

This article is informative. I have found that databases in general tend to be less sexy than the front-end apps...especially with the recent cohort of devs. As an old bastard, I would pass on one thing: Realize that any reasonably used database will likely outlast the applications leveraging it. This is especially true the bigger it gets, and the longer it stays in production. That said, if you are influencing the design of a database, imagine years later what someone looking at it might want to know if having to rip all the data out into some other store. Having migrated many legacy systems, I tend to sleep better when I know the data is well-structured and easy to normalize. In those cases, I really don't care so much about the apps. If I can sort out (haha) the data, I worry less about the new apps I need to design. I have been known to bury documentation into for-purpose tables...that way I know that info won't be lost. Export the schema regularly, version it, check it in somewhere. And, if you can, please, limit the use of anything that can hold a NULL. Not every RDBMS handles NULL the same way. Big old databases live a looooong time.

mmcnl 4 years ago | |

"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious." -- Fred Brooks, The Mythical Man Month (1975)

Aeolun 4 years ago | | |

This man has clearly never seen our database schema.

Show me either flowcharts and/or tables, it doesn’t matter, I’ll continue to be mystified.

SulphurSmell 4 years ago | | |

This is going on my wall. Thanks so much.

emerongi 4 years ago | |

> Realize that any reasonably used database will likely outlast the applications leveraging it.

I love this statement. It's true too, having seen a decades-old database that needed to be converted to Postgres. The old application was going to be thrown away, but the data was still relevant :).

evilduck 4 years ago | | |

About a decade ago I worked for an insurance company. It was an offshoot that was spun out of of another insurance company from another state, which itself was decades old. As best as I could infer from my vantage point, my expertise at the time, and the spare time I was willing to investigate the matter, the database schema and a good chunk of the core data tables were first created in the late-80s on a mainframe and had outlived 4 or 5 application rewrites and (at least) two SQL variant migrations. I'm hand-waving exact details because nobody from the original company or that time period was still around even prior to the corporate split and so there was nobody who could answer history questions in detail, but that's also a testament to how persistent data can be. There was one developer from the parent company they slapped with golden handcuffs who knew where most of the bodies were hid in that software stack that enabled decent productivity but even she was lacking a solid 15 years of first-hand experience of its inception. To the best of my knowledge that database is still in use today.

Databases in heavy use will not just outlast your application, they have a strong chance of outlasting your career and they very well may outlast you as a person.

Yhippa 4 years ago | | |

I think this is and will continue to be a common use case. I'm very thankful for these applications that the data was still stuck in a crusty old relational database for me to work on top of as I built a new application.

It's going to be interesting when this same problem occurs years from now when people are trying to reverse schemas from NoSQL databases or if they become difficult to extract.

The only sticking point is when business logic is put into stored procedures. On one hand if you're building an app on top of it, there's a temptation to extract and optimize that logic in your new back-end. On the other hand, it is kind of nice to even have it at all should the legacy app go poof.

irrational 4 years ago | |

The NULL issue is so true. We migrated a large database from Oracle to Postgres. It took 2 years. By far and away the biggest issue was rewriting queries to account for the (correct) way Postgres handles NULLs versus how Oracle does it.

Also, in my experience, the database is almost always the main cause of any performance issues. I would much rather hire someone who is very good at making the database perform well than making the front end perform well. If you are seeking to be a full stack developer, devote much more time to the database layer than anything else.

SulphurSmell 4 years ago | | |

>the database is almost always the main cause of any performance issues

I would be careful with the term "cause". There is a symbiotic relationship between the application and the database. Or, if talking to a DBA...a database and its applications. Most databases can store any sets of arbitrary information...but how they are stored (read: structure) must take into account how the data is to be used. When the database designer can be told up-front (by the app dev team) considerations can be made to optimize performance along whatever vector is most desired (e.g. read speed, write speed, consistency, concurrency, etc). Most database performance issues result when these considerations are left out. Related: Just because a query works (ie. returns the right data) does not mean it's the best query.

Aeolun 4 years ago | | |

It’s like. If the database doesn’t perform well, nothing else performs well either.

If your database is great, at least you have the option of a fast backend.

nijave 4 years ago | | |

>Also, in my experience, the database is almost always the main cause of any performance issues

More generically, state stores are almost always bottlenecks (they tend to be harder to scale without some tradeoff)

fipar 4 years ago | |

> As an old bastard, I would pass on one thing: Realize that any reasonably used database will likely outlast the applications leveraging it.

I’ve been working with and on databases for a long, long time, and I’ve even written about things I think people should know about if they want to do this, yet I never came up with such great insight. This is so true it should be engraved somewhere. Hats off!

SulphurSmell 4 years ago | | |

Thanks. Scar tissue sometimes breeds insight. In further conversation on this phenomenon, I would argue that "long lived databases" are not so as result of brilliant design. Rather, it happens because the database itself is neglected and largely misunderstood, and gets less investment. And they live on and on...managers come and go...no investment. And then, years later, some poor bastard is stuck with a hideous mess that can't go anywhere. Don't let this happen to you.

hodgesrm 4 years ago | |

The article left out one of the most fundamantal topics of databases--clustering of data in storage is everything. Examples:

1. If you store data in rows it's quite fast to insert/update/delete individual rows. Moreover, it's easy to do it concurrently. However reads can be very slow because you read the entire table if you scan a single column. That's why OLAP databases use column storage.

2. If you sort insert data in the table, reading ranges based on the sort key(s) is very fast. On the other hand inserts may spray data over over the entire table, (eventually) forcing writes to all blocks, which is very slow. That's why many OLTP databases use heap (unsorted) row organization.

In small databases you don't notice the differences, but they become dominant as volume increases. I believe this fact alone explains a lot of the proliferation of DBMS types as enterprise datasest have grown larger.

Edit: minor clarification

beckingz 4 years ago | |

I heard about Flywaydb today, which appears to be an open source database versioning tool. Pretty interesting! https://flywaydb.org/

vladsanchez 4 years ago | | |

Pretty open-source, until you need "premium" features like "rollback" :/ (headwall)

zippergz 4 years ago | |

This is one reason that ORMs which wish to own the database schema make me uncomfortable. How much fun is that schema going to be years down the road when that ORM is out of fashion, but you still need an app working with that data? Some are better than others at doing things in a sane way, of course.

Aeolun 4 years ago | | |

Doesn’t really matter though? Even if the ORM is changed, the actual schema is still in the database.

I’ve migrated ORM several times, and the only thing that changes is the entity definition. The database remains the same.

motogpjimbo 4 years ago | | |

Or even worse, when the ORM is written in a programming language your organisation no longer uses and is part of codebase that is no longer under development.

YetAnotherNick 4 years ago | |

> I have found that databases in general tend to be less sexy than the front-end apps

I don't know if there is a single soul who believes this. If you are designing a database, it is much more cooler than front end apps.

SulphurSmell 4 years ago | | |

I think they are wonderful (from the Codd and Date days...) but mostly everyone else disagrees.

vbezhenar 4 years ago | |

I agree. I have some kind of design hierarchy. Database -> Architecture -> Services for outside consumers -> Backend -> Frontend. Things coming first must be designed more thoroughly as they're likely to live longer. Proper database design is paramount. Spend as much time as necessary. Iterate before going live as long as necessary to ensure that design is sound. Because it's so much harder to change database later. Trivial changes often require huge efforts.

bitexploder 4 years ago | | |

Rob Pike’s 5 rules:

https://users.ece.utexas.edu/~adnan/pike.html

Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

Jemaclus 4 years ago | |

> And, if you can, please, limit the use of anything that can hold a NULL.

I'm curious: what's the alternative to NULL? I'm struggling to think of a database where NULL wouldn't be super useful. It feels like NULL as a concept is almost required, but I think you're suggesting that's a faulty assumption.

Would love to hear more about this.

goto11 4 years ago | | |

The article probably means: Define anything as non-nullable which can be non-nullable. Unfortunately SQL defaults to nullable, so there is a tendency to define too many columns as nullable. Normalization can also reduce the need for nullable columns in base tables (but you will get them back if you perform an outer join, so it is not a panacea).

But if a columns truly has unknown values, NULL's are the best ways to represent it. It is sometimes suggested to use "sentinel values" like empty string or -1 to represent missing values, but IMHO this is much worse than NULL's, since these will be treated as regular values by operators. When you have missing values, you want three-valued logic.

layer8 4 years ago | | |

See here: https://stackoverflow.com/a/4358687

SulphurSmell 4 years ago | | |

Noticed I said "limit", and not "eliminate". The concept of NULLS in an RDBMS has been discussed and argued for decades. Three valued logic is generally not well understood, and as such, it's usually skipped over. Binary logic is easy...0 or 1. It's there, or not. OFF/ON. 3 valued logic introduces a third state: "unknown". It really means that there is no meaningful answer...not yet, anyway. In simple terms, NULL in RDBMS is dangerously often equated to 0 (zero) in numeric fields. Or "space" in character fields. Neither are true. On occasion, these might behave as such...but you are playing with fire here. What's worse, you can't compare a NULL to a NULL. NULL != NULL. Which is why you often see the "IS NULL" operator used in DML for such things. What it boils down to is that your applications need to pay careful attention when digging around (read: joining) tables with NULLS. Additional code logic is often required to ensure that things work the way you expect them to when NULLS are involved. Formal primary keys cannot NULL (this is enforced by the RDBMS) but it does not stop ad-hoc clever queries from including NULL columns as part of the "where..." clause. So what do? You can tell your DBA to ensure that all columns are NOT NULL. This really tightens things down, and makes some operations a bit more sane. However, if a column value is actually not known (yet!) then one is forced to populate it with data that may not be correct/relevant. These are often called "sentinal" values and can cause a mess of their own. There are use cases where a RDBMS schema with everything as NOT NULL can make sense. In my experience, databases whose data is never (directly) seen/input by actual people can work. When a human sees a field with "placeholder value" instead of just blank space..it is uncomfortable. My advice is to really understand why something might be NULL, and don't blindly add a mess of columns to a table as NULL because it's easy. Remember, that shit will live forever. Google around for "three valued logic" and start down the rabbit hole. Long-term (think: migrating from one RDBMS impl to another) you will absolutely find that NULLs don't behave the same. Various operations may or not be consistent from one to another...and this will break your apps. The key (haha) relationships modeled in your schema...if you strip all the unimportant stuff away...should avoid NULL. The flip side of this is to do a code scan (app side) and search for "is NULL" , "is NOT NULL" in the embedded SQL. Especially when there are a lot of "and ____ IS NOT NULL and ___IS NOT NULL" and so forth. This will indicate those parts of the database that are "hot spots" for NULL issues. I have seen SQL where 80% of the DML is taken up with NULL handling of some kind.

Akronymus 4 years ago | |

I personally try to strive towards a database design as if the next person were to know my address and having anger and control issues.

Fixing up a database step by step is a painful process.

yla92 4 years ago |

Great post. Also highly recommend Designing Data-Intensive Applications by Martin Kleppmann (https://www.amazon.com/Designing-Data-Intensive-Applications...). The sections on "Storage and Retrieval", "Replication", "Partitioning" and "Transactions" really opened up my eyes!

itsmemattchung 4 years ago | |

Second this.

I really like how he (Martin Kelppman) in the book starts with a primitive data structure for constructing a database design, and then evolves the system slowly and describes the various trade offs with building a database from the ground up.

lysecret 4 years ago | |

Absolutely loved the book. Can someone recommend similar books?

dangets 4 years ago | | |

I have not read it personally, but I've seen 'How Query Engines Work' highly recommended several times before. I have a procrasinatory tab open to check it out some day.

https://leanpub.com/how-query-engines-work

avinassh 4 years ago | | |

Database Internals is also pretty good.

pixelmonkey 4 years ago | | |

There is a quite-nice interactive browser dataviz here that shows you books similar to the themes, categories, and topics discussed in DDIA:

https://anvaka.github.io/greview/ddia/1/

wombatpm 4 years ago | | |

Database Design for Mere Mortals by Ray Hernandez

tiffanyh 4 years ago |

#1 thing you should know, RDBMS can solve pretty much every data storage/retrieval problem you have.

If you're choosing something other than an RDBMS - you should rethink why.

Because unless you're at massive scale (which still doesn't justify it), choosing something else is rarely the right decision.

Merad 4 years ago |

> a dirty read occurs when you perform a read, and another transaction updates the same row but doesn't commit the work, you perform another read, and you can access the uncommitted (dirty) value

It's even worse than this with MS SQL Server. When using the READ UNCOMMITTED isolation level it's actually possible to read corrupted data, e.g. you might read a string while it's being updated, so the result row you get contains a mix of the old value and new value of the column. SQL Server essentially does the "we got a badass over here" Neil deGrasse Tyson meme and throws data at you as fast as it can. Unfortunately I've worked on several projects where someone apparently thought that READ UNCOMMITTED was a magic "go fast" button for SQL and used it all throughout the app.

jiggawatts 4 years ago | |

I really wish SERIALIZABLE was the default transaction isolation level and anything lower was opt in… with warnings.

hodgesrm 4 years ago | | |

SERIALIZABLE is ridiculously slow if you have any level of concurrency in your app. READ COMMITTED is a reasonable default in general. The behavior GP is describing sounds like an out and out bug.

Dirty reads incidentally weren't supported for quite some time in the Sybase architecture (which forked to MS SQL Server in 1992). There was a Sybase effort to add dirty read support around 1995 or so. The project name was "Lolita."

AtNightWeCode 4 years ago |

Not sure how to use these recommendations in practice though even if the info is somewhat correct. SQL is a beast of tech and it is used because of battle history and since there is simply no other viable tech replacing it when it comes to transactions and aggregated queries.

Indexes are a nightmare to get right. Often performance optimizations of SQL databases include removing indexes as much as adding indexes.

larrik 4 years ago | |

Indexes aren't a "make my DB faster" magic wand. They have benefits and costs.

If you are seeing performance gains from removing indexes, then I'm assuming your workload is very heavy on writes/updates compared to reads.

dspillett 4 years ago | | |

Too many indexes can cause significant performance problems if RAM is short. If the indexes are actually used (rather than sitting idle on disk because other indexes are better choices for all your applications' typical queries) then they will “compete” for memory potentially causing a cache thrashing situation.

But yes, the issue with too many indexes is more often that they harm write performance.

A related issue is indexes that are too wide, either covering many columns or “including” them. As well as eating disk space they also eat extra memory (and potentially cause extra IO load) when used (less rows per page, so more pages loaded into RAM for the same query).

Both problems together, too many indexes many of which are too wide, usually comes from blindly accepting recommendations from automated tools (particularly when they are right that there is a problem, and it is a problem that a given index may solve, but fixing the queries so existing indexes are useful could have a much greater effect than adding the indexes).

AtNightWeCode 4 years ago | | |

Mostly because of overlapping indexes. Then if there are include columns it may get out of hand. Not too difficult to achieve. Just blindly follow recommendations from a tool or a cloud service.

roflyear 4 years ago | | |

Or you're using MySQL ;)

vorpalhex 4 years ago | |

It's not that SQL is all that beastly, it's that most tutorials fail to explain the internals and basics and so you just see all these features and interfaces of the system and can't build a mental model of how the system works.

AtNightWeCode 4 years ago | | |

Well, SQL does come with liberties. I worked with expensive commercial software that destroys the performance of databases by doing everything from complicated ad hoc queries to massive amounts of point reads.

donatj 4 years ago |

I still think about my first job out of college. Shopping cart application, we would add indexes exclusively when there was a problem rather than proactively based on expected usage patterns. It's genuinely a testament to MySQL that we got as far as we did without knowing anything about what we were doing.

One of my most popular StackOverflow questions to this day is about how to handle one million rows in a single MySQL table (shudder).

The product I work on now collects more rows than that a day in a number of tables.

mjb 4 years ago |

Introductory material is always welcome, but I suspect this isn't going to hit the target for most people. For example:

> Therefore, if the price isn’t an issue, SSDs are a better option — especially since modern SSDs are just about as reliable as HDDs

This needs a tiny extra bit of detail: if you're buying random IO (IOPS) or throughput (MB/s), SSDs are significantly (orders of magnitude!) cheaper than HDDs. HDDs are only cheaper on space, and only if your need for throughput or IO doesn't cause you to "strand" space.

> Consistency can be understood after a successful write, update, or delete of a row. Any read request immediately receives the latest value of the row.

This isn't the ACID definition of C, and is closer to the distributed systems (CAP) one. I can't fault the article for getting this wrong, though - it's super confusing!

googletron 4 years ago | |

You are absolutely right about the C being more inline with CAP one.

I have a post in draft to discuss disk trade offs which digs into this aspect, its impossible to dig into everything in this level of a post.

thedougd 4 years ago |

I have to plug the "Designing Data-Intensive Applications" book. It dives deep into the inner workings of various database architectures.

https://dataintensive.net/

wrs 4 years ago |

From the SERIALIZABLE explanation: “The database runs the queries one by one … It is essential to have some retry mechanism since queries can fail.”

I know they’re trying to simplify, but this is confusing. If the first part is true, the second part can’t be. In reality the database does execute the queries concurrently, but will try to make it seem like they were done one by one. If it can’t manage that, a query will fail and have to be retried by the application.

googletron 4 years ago | |

I believe there was a caveat around this exact point later in the post. It was really tough striking a balance for people learning this for the first time and more knowledgeable audience without confusing them further.

I do appreciate the feedback and will look to add some more color here! Thank you!

blupbar123 4 years ago | | |

It's kind of saying something which isn't true. Optimally one would find a wording that doesn't confuse beginners but also is factual, IMHO.

bironran 4 years ago |

Nice post, though for the indexing "introduction-deep-dive" I would still recommend newbies to look at https://use-the-index-luke.com/ .

konfusinomicon 4 years ago | |

also check out rick james's mysql documents http://mysql.rjweb.org/

I send those 2 links to coworkers all the time

googletron 4 years ago | |

Great resource! I have it linked as a reference!

jwr 4 years ago |

Some of the explanations are questionable: I think they were overly simplified, and while I applaud the goal, some things just aren't that simple.

I highly recommend reading https://jepsen.io/consistency and clicking on each model on the map. This is the best resource I found so far for understanding databases, especially distributed ones.

petergeoghegan 4 years ago | |

> Some of the explanations are questionable: I think they were overly simplified, and while I applaud the goal, some things just aren't that simple.

I am an expert on the subject matter, and I don't think that the overall approach is questionable. The approach that the author took seems fine to me.

The definition of certain basic concepts like 'consistency' is even confusing to experts at times. This is made all the more confusing by introducing concepts from the distributed systems world, where consistency is often understood to mean something else.

Here's an example of that that I'm familiar with, where an expert admits to confusion about the basic definition of consistency in the sense that it appears in ACID:

https://queue.acm.org/detail.cfm?id=3469647

This is a person that is a longtime peer of the people that invented the concepts!

Not trying to rigorously define these things makes a great deal of sense in the context of a high level overview. Getting the general idea across is far more important.

googletron 4 years ago | |

I would love the feedback, what was questionable? striking the balance is tough. jepsen's content is great.

gumby 4 years ago | | |

Everyone can disagree on what is the precise place to slice "this is beginner content" from "this is almost-beginner content". I could stick my own oar in in this regard but I won't.

I think your level of abstraction is quite good for the absolute "what on earth are people talking about when they use that 'database' word?". With an extremely high level understanding, when they encounter more detail they'll have a "place to put it".

Diggsey 4 years ago | | |

One thing that can be surprising is that for "REPEATABLE READ", not all "reads" are actually repeatable.

There are at least two ways (that I'm aware of) that this can be violated. For example, if you run an update statement like this:

    UPDATE foo SET bar = bar + 1

Then the read of "bar" will always use the latest value, which may be different from the value other statements in the same transaction saw.

galaxyLogic 4 years ago |

https://github.com/prql/prql :

" Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL. "

jandrewrogers 4 years ago |

> "Scale of data often works against you, and balanced trees are the first tool in your arsenal against it."

An ironic caveat to this is that balanced trees don't scale well, only offering good performance across a relatively narrow range of data size. This is a side-effect of being "balanced", which necessarily limits both compactness and concurrency.

That said, concurrent B+trees are an absolute classic and provide important historical context for the tradeoffs inherent in indexing. Modern hardware has evolved to the point where B+trees will often offer disappointing results, so their use in indexing has dwindled with time.

jrm4 4 years ago |

To go big picture; I'm kind of glad databases are largely like cars in this respect, in ways that other software tooling isn't.

Which is to say they're frequently good enough such that the human working with them on whatever level can safely not know a lot of these details and get a LOT done. Kudos to whoever deserves them here.

charcircuit 4 years ago | |

Isn't that true for almost all software? You only need to know the implementation of a small subset of parts. I would say databases are worse since you need to know how they are implemented else you will start making O(rows) queries or doing other inefficient stuff.

jrm4 4 years ago | | |

Going broadly (which is all I can do because I teach this stuff and don't build in depth) -- "the database" is the part I can most easily "abstract" away as if it were walled off?

As opposed to aspirationally discrete classifications that end up being porous, e.g. MVC, "Object Oriented" etc.

googletron 4 years ago |

This is a quick rundown of database indexes and transactions. Excited to continue sharing these notes with community!

mgrouchy 4 years ago | |

I have been really enjoying the content so far, any hits on whats coming up?

googletron 4 years ago | | |

We have another couple of notes from a few companies like Temporal, Sentry, and Gadget.

trhoad 4 years ago |

An interesting subject! The article could do with an edit, however. There are lots of grammatical errors.

molly0 4 years ago |

Anyone read this pdf/book https://sql-performance-explained.com and would recommend?

r0b05 4 years ago |

Nicely written and informative!

googletron 4 years ago | |

Thank you!

manish_gill 4 years ago |

What tool was used to create the visuals?

praveenhm 4 years ago | |

I am guessing it was done on iPad

sonofacorner 4 years ago |

This is great. Thanks for sharing!

dennalp 4 years ago |

Really nice guide.

otherflavors 4 years ago |

why is this tagged "MySQL" but not also "SQL"

googletron 4 years ago | |

Thanks! Added!

throwaway787544 4 years ago |

Can anyone give me a brief understanding of stored procedures and when I should use them?

CASE WHEN SUM(daily_revenue) OVER (PARTITION BY department, TRIM(SUBSTR(region, 5)) IN ('North','West','Misc')) > AVG(revenue) OVER (ORDER BY sale_time ASC rows BETWEEN 28 PRECEDING AND CURRENT ROW) AND NOT COALSECE(had_prev_month_party, FALSE) THEN pizza_party_points + 5 WHEN <above> AND had_prev_month_party THEN pizza_party_points + 3 WHEN MIN(sale_time) over (PARTITION BY department) = DATE_TRUNC('month', current_date) then 5 ELSE GREATEST(pizza_party_points - 1, 0) END as pizza_party_performance_points_current