My notes on Gitlab's Postgres schema design (2022)

My notes on Gitlab's Postgres schema design (2022)(shekhargulati.com)

488 points by daigoba66 2 years ago | 161 comments

> It is generally a good practice to not expose your primary keys to the external world. This is especially important when you use sequential auto-incrementing identifiers with type integer or bigint since they are guessable.

What value would there be in preventing guessing? How would that even be possible if requests have to be authenticated in the first place?

I see this "best practice" advocated often, but to me it reeks of security theater. If an attacker is able to do anything useful with a guessed ID without being authenticated and authorized to do so, then something else has gone horribly, horribly, horribly wrong and that should be the focus of one's energy instead of adding needless complexity to the schema.

The only case I know of where this might be valuable is from a business intelligence standpoint, i.e. you don't want competitors to know how many customers you have. My sympathy for such concerns is quite honestly pretty low, and I highly doubt GitLab cares much about that.

In GitLab's case, I'm reasonably sure the decision to use id + iid is less driven by "we don't want people guessing internal IDs" and more driven by query performance needs.

tetha 2 years ago | |

> I see this "best practice" advocated often, but to me it reeks of security theater. If an attacker is able to do anything useful with a guessed ID without being authenticated and authorized to do so, then something else has gone horribly, horribly, horribly wrong and that should be the focus of one's energy instead of adding needless complexity to the schema.

Yes, but the ability to guess IDs can make this security issue horrible, or much much worse.

If you had such a vulnerability and you are exposing the users to UUIDs, now people have to guess UUIDs. Even a determined attacker will have a hard time doing that or they would need secondary sources to get the IDs. You have a data breach, but you most likely have time to address it and then you can assess the amount of data lost.

If you can just <seq 0 10000 | xargs -I ID curl service/ticket/ID> the security issue is instantly elevated onto a whole new level. Suddenly all data is leaked without further effort and we're looking at mandatory report to data protection agencies with a massive loss of data.

To me, this is one of these defense in depth things that should be useless. And it has no effect in many, many cases.

But there is truely horrid software out there that has been popped in exactly the described way.

dijit 2 years ago | | |

Case in point, a recent security issue Gitlab experienced (CVE-2023-7028; arbitrary password reset by knowing one of the accounts associated mail addresses) was made worse by a feature of gitlab that few people know about; that the "userID" is associated with a meta/internal mail address.

This meant that people could send password resets for any user if they knew their userID. The mail format was like user-1@no-reply.gitlab.com or something.

Since it's a safe bet that "user ID 1" is an admin user, someone weaponised this.

s4i 2 years ago | |

It’s mentioned in the article. It’s more to do with business intelligence than security. A simple auto-incrementing ID will reveal how many total records you have in a table and/or their growth rate.

> If you expose the issues table primary key id then when you create an issue in your project it will not start with 1 and you can easily guess how many issues exist in the GitLab.

no-dr-onboard 2 years ago | | |

Business intelligence isn’t really applicable on the database level with guids..way too many abstraction layers down.

coldtea 2 years ago | |

>I see this "best practice" advocated often, but to me it reeks of security theater.

The idea of "security theater" is overplayed. Security can be (and should be) multilayered, it doesn't have to be all or nothing. So that, when they break a layer (say the authentication), they shouldn't automatically gain easy access to the others

>If an attacker is able to do anything useful with a guessed ID without being authenticated and authorized to do so, then something else has gone horribly, horribly, horribly wrong and that should be the focus of one's energy instead of adding needless complexity to the schema.

Sure. But by that time, it's will be game over if you don't also have the other layers in place.

The thing is that you can't anticipate any contigency. Bugs tend to not preannounce themselves, especially tricky nuanced bugs.

But when they do appear, and a user can "do [something] useful with an ID without being authenticated and authorized to do so" you'd be thanking all available Gods that you at least made the IDs not guassable - which would also give them also access to every user account on the system.

yellowapple 2 years ago | | |

> Security can be (and should be) multilayered, it doesn't have to be all or nothing.

In this case the added layer is one of wet tissue paper, at best. Defense-in-depth is only effective when the different layers are actually somewhat secure in their own right.

It's like trying to argue that running encrypted data through ROT13 is worthwhile because "well it's another layer, right?".

> you'd be thanking all available Gods that you at least made the IDs not guassable - which would also give them also access to every user account on the system.

I wouldn't be thanking any gods, because no matter what those IDs look like, the only responsible thing in such a situation is to assume that an attacker does have access to every user account on the system. Moving from sequential IDs to something "hard" like UUIDs only delays the inevitable - and the extraordinarily narrow window in which that delay is actually relevant ain't worth considering in the grand scheme of things. Moving from sequential IDs to something like usernames ain't even really an improvement at all, but more of a tradeoff; yeah, you make life slightly harder for someone trying to target all users, but you also make life much easier for someone trying to target a specific user (since now the attacker can guess the username directly - say, based on other known accounts - instead of having to iterate through opaque IDs in the hopes of exposing said username).

metafunctor 2 years ago | |

Bugs happen also in access control. Unguessable IDs make it much harder to exploit some of those bugs. Of course the focus should be on ensuring correct access control in the first place, but unguessable IDs can make the difference between a horrible disaster and a close call.

It's also possible to use auto-incrementing database IDs and encrypt them, if using UUIDs doesn't work for you. With appropriate software layers in place, encrypted IDs work more or less automatically.

lordgrenville 2 years ago | |

> The only case where this might be valuable is business intelligence

Nitpick: I would not call this "business intelligence" (which usually refers to internal use of the company's own data) but "competitive intelligence". https://en.wikipedia.org/wiki/Competitive_intelligence

EE84M3i 2 years ago | | |

See also "German Tank Problem" https://en.m.wikipedia.org/wiki/German_tank_problem

remus 2 years ago | |

In general it's a defense-in-depth thing. You definitely shouldn't be relying on it, but as an attacker it just makes your life a bit harder if it's not straightforward to work out object IDs.

For example, imagine you're poking around a system that uses incrementing ints as public identifiers. Immediately, you can make a good guess that there's probably going to be some high privileged users with user_id=1..100 so you can start probing around those accounts. If you used UUIDs or similar then you're not leaking that info.

In gitlabs case this is much less relevant, and it's more fo a cosmetic thing.

serial_dev 2 years ago | | |

> In gitlabs case this is much less relevant (...)

Why, though? GitLab is often self hosted, so being able to iterate through objects, like users, can be useful for an attacker.

worksonmine 2 years ago | |

> What value would there be in preventing guessing?

It prevents enumeration, which may or may not be a problem depending on the data. If you want to build a database of user profiles it's much easier with incremental IDs than UUID.

It is at least a data leak but can be a security issue. Imagine a server doing wrong password correctly returning "invalid username OR password" to prevent enumeration. If you can still crawl all IDs and figure out if someone has an account that way it helps filter out what username and password combinations to try from previous leaks.

Hackers are creative and security is never about any single protection.

yellowapple 2 years ago | | |

> If you can still crawl all IDs and figure out if someone has an account that way it helps filter out what username and password combinations to try from previous leaks.

Right, but like I suggested above, if you're able to get any response other than a 404 for an ID other than one you're authorized to access, then that in and of itself is a severe issue. So is being able to log in with that ID instead of an actual username.

Hackers are indeed creative, but they ain't wizards. There are countless other things that would need to go horribly horribly wrong for an autoincrementing ID to be useful in an attack, and the lack of autoincrementing IDs doesn't really do much in practice to hinder an attacker once those things have gone horribly, horribly wrong.

I can think of maybe one exception to this, and that's with e-commerce sites providing guest users with URLs to their order/shipping information after checkout. Even this is straightforward to mitigate (e.g. by generating a random token for each order and requiring it as a URL parameter), and is entirely inapplicable to something like GitLab.

JimBlackwood 2 years ago | |

I follow this best practice, there’s a few reasons why I do this. It doesn’t have to do with using a guessed primary ID for some sort of privilege escalation, though. It has more to do with not leaking any company information.

When I worked for an e-commerce company, one of our biggest competitors used an auto-incrementing integer as primary key on their “orders” table. Yeah… You can figure out how this was used. Not very smart by them, extremely useful for my employer. Neither of these will allow security holes or leak customer info/payment info, but you’d still rather not leak this.

tomnipotent 2 years ago | | |

> extremely useful for my employer.

I've been in these shoes before, and finding this information doesn't help you as an executive or leader make any better decisions than you could have before you had the data. No important decision is going to be swayed by something like this, and any decision that is probably wasn't important.

Knowing how many orders is placed isn't so useful without average order value or items per cart, and the same is true for many other kinds of data gleamed from this method.

kehers 2 years ago | |

One good argument I found [^1] about not exposing primary keys is that primary keys may change (during system/db change) and you want to ensure users have a consistent way of accessing data.

[^1]: https://softwareengineering.stackexchange.com/questions/2183...

dalore 2 years ago | |

It's also exposes your growth metrics. When using sequential id's one can tell how many users you have, how many users a month you are getting and all sorts of useful stuff that you probably don't want to expose.

It's how the British worked out how many tanks the German army had.

SkyMarshal 2 years ago | |

> This is especially important when you use sequential auto-incrementing identifiers with type integer or bigint since they are guessable.

I thought we had long since moved past that to GUIDs or UUIDs for primary keys. Then if you still need some kind of sequential numbering that has meaning in relation to the other fields, make a separate column for that.

sgarland 2 years ago | | |

Except now people are coming back around, because they’re realizing (as the article mentions) that UUID PKs come with enormous performance costs. In fairness, any non-k-sortable ID will suffer the same fate, but UUIDs are the most common of the bunch.

strzibny 2 years ago | |

Whether you think it's a real problem or not, if you want to solve it somehow, I compiled the current options one have in Rails apps:

https://nts.strzibny.name/alternative-bigint-id-identifiers-...

cyberfart 2 years ago | |

There are also reasons outside infosec concerns. For example where such PKs would be directly related to your revenue, such as orders in an e-commerce platform. You wouldn't want competitors to have an estimate of your daily volume, that kind of thing.

mnahkies 2 years ago | |

It can be really handy for scraping/archiving websites if they're kind enough to use a guessable id

heax 2 years ago | |

It really depends but useful knowledege can be derived from this. If user accounts use sequential ids the id 1 is most likely the admin account that is created as first user.

kelnos 2 years ago |

> For example, Github had 128 million public repositories in 2020. Even with 20 issues per repository it will cross the serial range. Also changing the type of the table is expensive.

I expect the majority of those public repositories are forks of other repositories, and those forks only exist so someone could create pull requests against the main repository. As such, they won't ever have any issues, unless someone makes a mistake.

Beyond that, there are probably a lot of small, toy projects that have no issues at all, or at most a few. Quickly-abandoned projects will suffer the same fate.

I suspect that even though there are certainly some projects with hundreds and thousands of issues, the average across all 128M of those repos is likely pretty small, probably keeping things well under the 2B limit.

Having said that, I agree that using a 4-byte type (well, 31-bit, really) for that table is a ticking time bomb for some orgs, github.com included.

zetalyrae 2 years ago |

The point about the storage size of UUID columns is unconvincing. 128 bits vs. 64 bits doesn't matter much when the table has five other columns.

A much more salient concern for me is performance. UUIDv4 is widely supported but is completely random, which is not ideal for index performance. UUIDv7[0] is closer to Snowflake[1] and has some temporal locality but is less widely implemented.

There's an orthogonal approach which is using bigserial and encrypting the keys: https://github.com/abevoelker/gfc64

But this means 1) you can't rotate the secret and 2) if it's ever leaked everyone can now Fermi-estimate your table sizes.

Having separate public and internal IDs seems both tedious and sacrifices performance (if the public-facing ID is a UUIDv4).

I think UUIDv7 is the solution that checks the most boxes.

[0]: https://uuid7.com/

[1]: https://en.wikipedia.org/wiki/Snowflake_ID

traceroute66 2 years ago |

Slight nit-pick, but I would pick up the author on the text vs varchar section.

The author effectively wastes many words trying to prove a non-existent performance difference and then concludes "there is not much performance difference between the two types".

This horse bolted a long time ago. Its not "not much", its "none".

The Postgres Wiki[1] explicitly tells you to use text unless you have a very good reason not to. And indeed the docs themselves[2] tell us that "For many purposes, character varying acts as though it were a domain over text" and further down in the docs in the green Tip box, "There is no performance difference among these three types".

Therefore Gitlab's use of (mostly) text would indicate that they have RTFM and that they have designed their schema for their choice of database (Postgres) instead of attempting to implement some stupid "portable" schema.

[1] https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use... [2] https://www.postgresql.org/docs/current/datatype-character.h...

alex_smart 2 years ago | |

>The author effectively wastes many words trying to prove a non-existent performance difference and then concludes "there is not much performance difference between the two types".

They then also show that there is in fact a significant performance difference when you need to migrate your schema to accodomate a change in length of strings being stored. Altering a table to a change a column from varchar(300) to varchar(200) needs to rewrite every single row, where as updating the constraint on a text column is essentially free, just a full table scan to ensure that the existing values satisfy your new constraints.

FTA:

>So, as you can see, the text type with CHECK constraint allows you to evolve the schema easily compared to character varying or varchar(n) when you have length checks.

traceroute66 2 years ago | | |

> They then also show that there is in fact a significant performance difference when you need to migrate your schema to accodomate a change in length of strings being stored.

Which is a pointless demonstration if you RTFM and design your schema correctly, using text, just like the manual and the wiki tells you to.

> the text type with CHECK constraint allows you to evolve the schema easily compared to character varying or varchar(n) when you have length checks.

Which is exactly what the manual tells you ....

"For many purposes, character varying acts as though it were a domain over text"

exabrial 2 years ago |

Foreign keys are expensive is an oft repeated rarely benched claim. There are tons of ways to do it incorrectly. But in your stack you are always enforcing integrity _somewhere_ anyway. Leveraging the database instead of reimplementing it requires knowledge and experimentation, and it more often than not it will save your bacon.

bluerooibos 2 years ago |

Has anyone written about or noticed the performance differences between Gitlab and GitHub?

They're both Rails-based applications but I find page load times on Gitlab in general to be horrific compared to GitHub.

golergka 2 years ago | |

I used Gitlab a few years ago, but then it had severe client-side performance problems on large pull requests. Github isn't ideal with them too, but it manages to be decent.

wdb 2 years ago | | |

Yeah, good reason to spit up pull requests ;) I do think it improved a lot over the last two years, though

oohffyvfg 2 years ago | |

> compared to GitHub.

this is like comparing chrome and other browsers, even chromium based.

chrome and github will employ all tricks in the book, even if they screw you. for example, how many hours of despair I've wasted when manually dissecting a git history on employer github by opening merge diffs, hitting ctrl F, seeing no results and moving to the next... only to find on the 100th diff that deep down the diff lost they hid the most important file because it was more convenient for them (so one team lead could hit some page load metric and get a promotion)

heyoni 2 years ago | |

I mean GitHub in general has been pretty reliable minus the two outages they had last year and is usually pretty performant or I wouldn’t use their keyboard shortcuts.

There are some complaints here from a former dev about gitlab that might provide insight into its culture and lack of regard for performance: https://news.ycombinator.com/item?id=39303323

Ps: I do not use gitlab enough to notice performance issues but thought you might appreciate the article

vinnymac 2 years ago |

I always wondered what the purpose of that extra “I” was in the CI variables `CI_PIPELINE_IID` and `CI_MERGE_REQUEST_IID` were for. Always assumed it was a database related choice, but this article confirms it.

gfody 2 years ago |

> 1 quintillion is equal to 1000000000 billions

it is pretty wild that we generally choose between int32 and int64. we really ought to have a 5 byte integer type which would support cardinalities of ~1T

klysm 2 years ago | |

Yeah it doesn't make sense to pick something that's not a power of 2 unless you are packing it.

postalrat 2 years ago | | |

I guess that depends on how the index works.

gfody 2 years ago | | |

we usually work with some page size anyways, 64 5-byte ints fit nicely into 5 512-bit registers, and ~1T is a pretty flexible cardinality limit after ~2B

azlev 2 years ago |

It's reasonable to not have auto increment id's, but it's not clear to me if there is benefits to have 2 IDs, one internal and one external. This increases the number of columns / indexes, makes you always do a lookup first, and I can't see a security scenario where I would change the internal key without changing the external key. Am I missing something?

Aeolun 2 years ago | |

You always have the information at hand anyway when doing anything per project. It’s also more user friendly to have every project’s issues start with 1 instead of starting with two trillion, seven hundred billion, three hundred and five million, sevenhundred and seventeen thousand three hundred twentyfive.

josephg 2 years ago |

> As I discussed in an earlier post[3] when you use Postgres native UUID v4 type instead of bigserial table size grows by 25% and insert rate drops to 25% of bigserial. This is a big difference.

Does anyone know why UUIDv4 is so much worse than bigserial? UUIDs are just 128 bit numbers. Are they super expensive to generate or something? Whats going on here?

AprilArcus 2 years ago | |

UUIDv4s are fully random, and btree indices expect "right-leaning" values with a sensible ordering. This makes indexing operations on UUIDv4 columns slow, and was the motivation for the development of UUIDv6 and UUIDv7.

couchand 2 years ago | | |

I'm curious to learn more about this heuristic and how the database leverages it for indexing. What does right-leaning mean formally and what does analysis of the data structure look like in that context? Do variants like B+ or B* have the same charactersistics?

eezing 2 years ago |

We shouldn’t assume that this schema was designed all at once, but rather is the product of evolution. For example, maybe the external_id was added after the initial release in order to support the creation of unique ids in the application layer.

martinald 2 years ago |

Is it just me that thinks in general schema design and development is stuck in the stone ages?

I mainly know dotnet stuff, which does have migrations in EF (I note the point about gitlab not using this kind of thing because of database compatibility). It can point out common data loss while doing them.

However, it still is always quite scary doing migrations, especially bigger ones refactoring something. Throw into this jsonb columns and I feel it is really easy to screw things up and suffer bad data loss.

For example, renaming a column (at least in EF) will result in a column drop and column create on the autogenerated migrations. Why can't I give the compiler/migration tool more context on this easily?

Also the point about external IDs and internal IDs - why can't the database/ORM do this more automatically?

I feel there really hasn't been much progress on this since migration tooling came around 10+ years ago. I know ORMs are leaky abstractions, but I feel everyone reinvents this stuff themselves and every project does these common things a different way.

Are there any tools people use for this?

rob137 2 years ago |

I found this post very useful. I'm wondering where I could find others like it?

aflukasz 2 years ago | |

I recommend Postgres FM podcast, e.g. available as video on Postgres TV yt channel. Good content on its own, and many resources of this kind are linked in the episode notes. I believe one of the authors even helped Gitlab specifically with Postgres performance issues not that long ago.

rglynn 2 years ago | |

I like this, although a bit narrower - https://blog.mastermind.dev/indexes-in-postgresql

cosmicradiance 2 years ago | |

Here's another - https://zerodha.tech/blog/working-with-postgresql/

sidcool 2 years ago |

Great read! And even better comments here.

firemelt 2 years ago |

so anyone use schema.rb in production? even dhh once campfire use .sql instead schema.rb