UUID, serial or identity columns for PostgreSQL auto-generated primary keys?

UUID, serial or identity columns for PostgreSQL auto-generated primary keys?(cybertec-postgresql.com)

204 points by lhenk 5 years ago | 178 comments

3pt14159 5 years ago |

> Now, sometimes a table has a natural primary key, for example the social security number of a country’s citizens.

You know, you think that, but it's never that simple. The field was added incorrectly and nobody noticed until the value is in countless tables that you now need to simultaneously update or the value is something that's supposed to be semi-secret, so now a low level support staff can't reference the row when dealing with a request. Or the table's requirements change and now you need to track two different kinds of data or data that is missing the field.

Me, I always just have the table make its own ID. It is just simpler, even when you think it is overkill.

the_gipsy 5 years ago | |

In Spain each person has a unique ID number assigned at birth. The numbers for newborns are geographically pre-distributed to guarantee uniqueness despite delay in paperwork. It is universally accepted that this ID "number" (it actually has one letter too) is all you need to identify yourself, ever.

Except that I knew a coworker who had a duplicate ID. An extremely rare event, they messed up the pre-assignment and there is another dude somewhere with his same ID. So from time to time, some system would tell him that his ID was already registered. A lot of banks and stuff like private healthcare systems like to use the DNI as usernames.

He tried to get his ID changed, but that was such a foreign concept to any of the involved institutions, that he had to give up because there simply is no such procedure. I guess he could have taken it to court, but the guy decided to just live with it (the justice system is quite slow here).

Cthulhu_ 5 years ago | | |

The fact that it's fixed / can never be changed is a massive problem with social security numbers. That, and the fact it's often used as authentication instead of identification. They're moving away from that slowly, but it's taking a lot of time and effort.

wwweston 5 years ago | |

It does seem that a "natural key" is frequently just a really foreign key in a database you and your org don't manage.

jeff-davis 5 years ago | | |

That's a good observation. The only meaningful distinction between a natural key and a surrogate key is whether the number ever escapes the original system.

For instance, a driver's license number is printed on the card itself, so a human sees it. Therefore, it's a natural key, just like a name.

When you decide that whatever natural keys already exist aren't good enough for your organization, and you make a new key, it's not good to think of that as a surrogate key. The number will make it out somehow (as a "record locator" in a customer support call or something), and eventually become a natural key.

It's best to just plan for any new key to be a natural key, which means using best practices for natural keys. That means it should be something reasonable to print, read, say, and hear; and it should also follow a pattern so it can be distinguished from other special numbers.

Auto-increment is a shortcut, but usually not great in the long term unless it's something that will be well-contained inside the database as an implementation detail (e.g. a join key designed to refer to rarely-accessed fields of a wide table).

dragonwriter 5 years ago | |

> > Now, sometimes a table has a natural primary key, for example the social security number of a country’s citizens.

> You know, you think that, but it's never that simple.

It’s that simple if you’re the Social Security Administration and its a table of Social Security Accounts, not people.

Other than that, using SSNs as a primary key is just plain wrong.

Izkata 5 years ago | | |

> It’s that simple if you’re the Social Security Administration and its a table of Social Security Accounts, not people.

Nah, they keep track of duplicate usage: https://www.nbcnews.com/technolog/odds-someone-else-has-your...

> The IRS often knows when this happens, when the imposter pays taxes. The Social Security Administration knows, too, for the same reason. And the nation's credit bureaus usually know, because the imposter often ends up applying for some form of credit. Plenty of financial institutions also have access to this information.

chrischen 5 years ago | | |

Maybe using them as passwords is what's wrong.

iratewizard 5 years ago | | |

Right? Credit card numbers as a primary key is far more efficient.

wvenable 5 years ago | |

I finally got our company to standardize on someone's employee number as a primary key for everything employee related. It's a simple monotonically increasing integer value -- the best possible primary key.

We moved to a new HR system and they have a set of "reserved" employee numbers that cannot be used and we have employee numbers in that range. Arg!

robaato 5 years ago | | |

We had a classic situation at a software house I worked at in the '80s - employee numbers were 1-999 and then jumped to 5,000 - because, you guessed it, this "unique" field was used with magic numbers 1,000 - 5,000 being reserved for project ids in various key accounting systems!

And we were supposed to teach our customers good design principles...

berkes 5 years ago | |

In an event-sourced setup, I found that for projections, this is less of a problem. And opens some possibilities, like more semantic schema's and easier, or simpler API's.

A projection in ES, is more a cache, not your primary store. The primary store is the eventlog. The latter should, obviuosly, never use natural ID's.

tomnipotent 5 years ago | |

I've been bitten by using natural keys on several occasions, but I can' think of a time surrogate keys failed me beyond the tediousness of implementation.

Thiez 5 years ago | |

In the Netherlands SSNs are not unique, they handed out some duplicate ones back in the day. So not great as a primary key. Besides, I think using them as primary keys is illegal anyway.

mrweasel 5 years ago | | |

In Denmark they are specifically not allowed to be used as a primary key. I mean many do, but technically your suppose to have a separate internal ID and then you use the SSN to look up that ID.

Aeolun 5 years ago | | |

Huh? How does sign-in work for people with duplicate ID’s?

sneak 5 years ago | | |

Imagine a man with a stick or gun coming to your office to shackle you in chains for how you arranged your database schema.

magicpointer 5 years ago |

About UUID as Primary Key and performance, the following article has some insights and benchmarks as well: https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...

Essentially, they observed sizeable performance improvements by using UUID generators that are tweaked to get more sequentia resultsl. It results in better indexes. The articles compares sequences, random UUIDs and 2 kinds of sequentialish UUID generators.

pritambarhate 5 years ago |

A little late to comment here. But for database IDs, I have found that Instagram's technique to generate IDs works very well: https://instagram-engineering.com/sharding-ids-at-instagram-...

They are not serially incrementing but still sortable. Thus prevent index fragmentation issues observed with UUIDS. Are 8 bytes in length. So index size is smaller compared to UUIDs. So you get all benefits of serial IDs but they are not easily guessable thus preventing sequential access attacks.

orangepanda 5 years ago | |

> With more than 25 photos and 90 likes every second

What unimaginable scale

giansegato 5 years ago | | |

That was in 2012, when they "only" had 15M users

Today, a decade later, they're at 1.074B

codeflo 5 years ago | |

> they are not easily guessable

I don't see how that's true. From reading the article you linked, you only need a valid shard ID (which you can extract from known IDs), the millisecond (which is guessable) and a 10-bit sequence (which you can easily brute-force).

(And that's completely fine if their security model doesn't require unguessable IDs.)

pritambarhate 5 years ago | | |

>> which you can easily brute-force

It will results in a very high number of 404s. These can be monitored and the origin IPs can be banned.

pmontra 5 years ago |

Meta: this company wrote an impressive number of articles about PostgreSQL since 2013. List at https://www.cybertec-postgresql.com/en/tag/postgresql/

lhenk 5 years ago | |

Also, here's a list of blog posts from Laurenz Albe (the author of the OP post): https://www.cybertec-postgresql.com/en/author/cybertec_albe/ His blog posts are a great read, I'd recommend checking them out!

aidos 5 years ago | |

I just had to do a double take as I was reading a stack overflow post at the same time and recognised it as the same author.

lhenk 5 years ago | | |

Laurenz (the author) was Postgres person of the week not too long ago: https://postgresql.life/post/laurenz_albe/

conradfr 5 years ago |

UUIDs are great when you use the id "publicly" but using an incremental value would be too revealing for different reasons.

So it's good to know that performances are not bad.

eric4smith 5 years ago |

Simple rules:

Use integer primary keys internally for identifiers and relationships.

Use English/Other Language permalinks for URL's

Use UUID's in places like API's one-time action links and "private" links that you only want to share with other people.

Worked fine for me for many, many years.

sk5t 5 years ago | |

A vote here against integer/serial PKs, not only because they leak information, but also because they can result in incorrect joins.

IME it's much more often I've quickly made a table with a serial PK and later wished it were uuid; just about never made a uuid and later wished for the compactness or natural clustering of bigint. Maybe for a table of millions and millions of time-ordered events.

eric4smith 5 years ago | | |

Note I said "internal use". But how can primary keys result in incorrect joins?

Unless you're changing a foreign key, joins will always be correct.

Unless I'm doing something wrong in the last 30 years of using SQL.

runeks 5 years ago | | |

> […] but also because they can result in incorrect joins.

Side question: can I get Postgres to throw an error if I try to join on two IDs where neither of the IDs have a foreign key reference to the other?

simonw 5 years ago |

Something I really like about integer incrementing IDs is that you can run ad-hoc "select * from table order by id desc limit 10" queries to see the most recently inserted rows.

I end up doing this a lot when I'm trying to figure out how my applications are currently being used.

Strictly incrementing UUIDs can offer the same benefit.

barrkel 5 years ago |

Another point: if there's any temporal locality to your future access patterns - if you're more likely to access multiple rows which were inserted at roughly the same time - then allocating sequential identifiers brings those entries closer together in the primary key index.

I used to work on a reconciliation system which inserted all its results into the database. Only the most recent results were heavily queried, with a long tail of occasional lookups into older results. We never had a problem with primary key indexes (though this was in MySQL, which uses a clustered index on the primary key for row storage, so it's an even bigger benefit); the MD5 column used for identifying repeating data, on the other hand, would blow out the cache on large customers' instances.

vinayan3 5 years ago | |

To add on. If you are joining against a table where you are joining on a UUID the join becomes quite slow with very large tables, like >10 million rows.

PG will say it's doing a hash look up and you'd think it'd be fast but it will take quite sometime relative to joining two large tables with integer IDs. With UUIDS PG will give up doing a hash look up sometimes and try to do table scans unless you adjust random_page_cost.

In general joining on UUIDs for large tables is a bad idea. It can be great if you are joining a single row to another row.

foresto 5 years ago |

I once pondered how I might generate IDs that were as compact as a machine word, without a value (or small set of values) revealing the size of the data set. One application might be user-visible customer numbers that don't easily reveal how many customers there are.

I eventually came across the idea of using maximal period linear-feedback shift registers to transform an integer variable through every possible value (minus one), but in a non-incremental sequence that depends on the LFSR arrangement.

I never ended up putting the idea to use, but I've always been curious about people who have and how it worked out for them. [Edit to clarify: It was meant for obfuscation, not security against a determined attacker.]

dpifke 5 years ago | |

I've used a small block cipher like Skip32 or Speck to obfuscate database sequences, either on INSERT or as part of the encoding scheme.

This works well against the German Tank Problem when there's no oracle allowing an attacker to guess lots of IDs quickly (such as when there are reasonable rate limits). It does not provide enough entropy when such an oracle exists (especially an offline one).

For something like a password reset token, it still needs to be paired with suitably random bytes.

BatteryMountain 5 years ago | |

Please see my previous comment, feel free to give feedback.

So far I haven't encountered any problems in the short term by using the approach described.

slver 5 years ago | |

The problem is that if your encoding algorithm leaks, it’s game over.

topspin 5 years ago |

I just started a little side project and chose to use UUID for Postgresql keys. The schema is highly generic and I anticipate the possibility of merging instances. UUID precludes collisions in such a case.

RedShift1 5 years ago | |

That includes foreign keys?

topspin 5 years ago | | |

Yes.

cratermoon 5 years ago |

Postgres (and other relational DBs) really need to implement something like snowflake[1] or ksuid[2]

1 https://blog.twitter.com/engineering/en_us/a/2010/announcing...

2 https://segment.com/blog/a-brief-history-of-the-uuid/

vbsteven 5 years ago |

I’m currently prototyping a little database+api+cli todo app and I want identifiers that can be abbreviated in the same way as partial git commit hashes can be used on the command line. What should I use?

I was thinking of generating random character strings and simply retry when the db throws duplicate key error on insert. No sharding is necessary and I’d like to have efficient foreign keys. Any thoughts?

alexis2b 5 years ago | |

You could try NanoID[0]? Seems available in many languages.

[0] https://blog.bibekkakati.me/nanoid-alternative-to-uuid

mutatio 5 years ago | |

You could use a serial int and just hex-encode when interacting with the CLI? You could then use range queries to match short hashes by zeroing out the remaining bytes and using >=

adav 5 years ago | |

Check out linear congruential generators or other pseudorandom number generators. Then map the resulting number to letters.

rsync 5 years ago |

I have no particular expertise with modern databases and it has been decades since I did any work as a DBA.

However, I cannot imagine creating table entries without a datestamp. No matter what else you are doing, or what you index by, I would want YYYY-MM-DD_HH-MM-SS in every row.

Maybe I'm just weird that way ...

BatteryMountain 5 years ago | |

Same. Every entity always gets a created column at the minimum, that way when we query later we can order by created to see the last few days worth of data first. Can't do that if you don't know when something was created.

NoNotTheDuo 5 years ago | |

And ideally there is a created time stamp and a last updated time stamp.

BenjiWiebe 5 years ago | | |

(at least in my cases) Ideally nothing ever gets updated, there's just a newer version of the row.

BatteryMountain 5 years ago |

I feel the whole debate is overkill: 99% of businesses/systems will never have so much data that they NEED to use uuid's. I personally don't like using integers for keys either as I've been burnt by them before. I also doubt any software I build today or have built in the last 10 years will be used 100 years from now.

Recently I built a new system (typical business-type backend) and forced to use sqlite + C# + dapper. Using this combination I cannot use guid/uuid as dapper cannot properly map it back to c# from sqlite, and my dislike of int's got me thinking. I have a random string generator (have used it for years for things like OTP's and other reference numbers), where I give it an alphabet + length of the desired string. Using 8 to 12 characters, I can get a few million unique permutations. That is, if used as a primary key, few million per database table. Then I hear in the back of my head, guys from work who would argue I would run out of unique combinations or would have to do lookups to see if they exist. So I decided slap the year and month on it as a prefix, so a key might look like this: 2105HSUAMWPA. This gets indexed really well too and there is some inherent information that can be seen from looking at the key: Year 21, Month 5 and then the unique bits.It's basically 4 lines of code that gets called on every new database entity. I think it will be easy to shard/partition the data too if the need arise in the future, by simply looking at the first 4 digits.

Thus to summarize:

Data is sliced by entity type (customer, invoice, etc), then by date (2105 for May 2021) then by unique string.

What do you guys think about this approach? Anyone been burnt by something like this?

panny 5 years ago |

It seems like int vs bigint is brushed off rather quickly here. bigint is twice the size of int, therefore indexing will be larger as well. Furthermore, all the FK storage and indexing will also be bloated by this choice. If you design a customer table with a bigint PK, and everything will point to customer (invoices, billing statements, etc), then that's not an insignificant amount of space. While most of us may want to have "billions served" like McDonald's, the reality is my company and your company will never have 2 billion customer accounts, even in the wildest of imaginations. If you ever did reach this point, it's "a good problem to have" and relatively easy to move from int -> bigint. Moving in the reverse direction is likely difficult or impossible.

It would be nice to see real benchmarking on millions of rows to compare the three, but my gut tells me you use int by default, bigint if you outgrow int, and UUID if you have plenty of money for hardware and need distribution capabilities a UUID would enable.

twhitmore 5 years ago | |

In datamodelling, tables can often be categorized by lifetime. 'Business Relationships' eg. customers, suppliers, products have a fairly long lifetime; whereas 'Business Transactions' are created on a much higher frequency.

I'm generally fairly comfortable using int for business relationships, and bigint (long) for transaction data.

For performance, insertion speed often seems to be dominated by 'commit latency' to sync to the disk; rather than by record size. I would agree that record size affects table scan, but for many datamodels keying may often be a relatively small proportion compared to the size of text fields and other data.

I like to model keyspaces to work for 200 years, for the largest forseeable market growth, times at least a factor of 10 for safety.

JshWright 5 years ago | |

Speaking from personal experience, just use bigint... If you aren't dealing with billions of rows, the size difference isn't that big a deal, and if you are dealing with billions of rows, the int -> bigint migration is definitely not "relatively easy".

One of the most memorable anecdotes of my professional career is a production environment going down because we hit maxint on an important (and busy) table. The dirty hack we used to get the site back up (hint: int is _signed_), and the weeks it took to plan, test, and execute the migration.

strangeattractr 5 years ago |

This is making me reconsider how I do IDs. I thought the performance of sequential IDs was significantly better. So my approach was to use a standard auto-increment primary ID and then obfuscate by id * p mod m where p and m are coprime and very large. then i get back the original ID using the mod inverse. Should I just be using UUID?

eloff 5 years ago | |

I would use uuid in this case. If p and m are too large you get overflow. If they are too small your keys are guessable. If it matters, use uuid and don't waste time and mental energy on it.

zzzeek 5 years ago |

> You are well advised to choose a primary key that is not only unique, but also never changes during the lifetime of a table row. This is because foreign key constraints typically reference primary keys, and changing a primary key that is referenced elsewhere causes trouble or unnecessary work.

in one sense I agree with the author that things are generally just easier when you use surrogate primary keys, however they really should note here that the FOREIGN KEY constraint itself is not a problem at all as you can just use ON UPDATE CASCADE.

dragonwriter 5 years ago | |

ON UPDATE CASCADE avoids much developer impact, but it isn’t free and has (potentially quite large) performance impacts.

foobarbazetc 5 years ago |

Always, always use a bigserial.

(Actually, all serials are bigserial’s but the “base type” they add to the table differs, and it’ll always come back to bite you later. Ask me how I know…)

ainar-g 5 years ago |

I don't think I've ever seen this mentioned anywhere, but if you need a unique ID for an entity with not a lot of records planned (≤10,000,000), why not use a random int64 with a simple for loop on the application side to catch the occasional collisions? Are there any downsides besides making the application side a tiny bit more complex?

staticassertion 5 years ago |

Another benefit of using sequential integers is that you can leverage a number of optimizations.

For one thing you can represent a range of data more efficiently by just storing offsets. This means that instead of having to store a 'start' and 'end' at 8 + 8 bytes you can store something like 'start' and 'offset', where offset could be based on your window size, like 2 bytes.

You can leverage those offsets in metadata too. For example, I could cache something like 'rows (N..N+Offset) all have field X set to null' or some such thing. Now I can query my cache for a given value and avoid the db lookup, but I can also store way more data in the cache since I can encode ranges. Obviously which things you cache are going to be data dependent.

Sequential ints make great external indexes for this reason. Maybe I tombstone rows in big chunks to some other data store - again, I can just encode that as a range, and then given a lookup within that range I know to look in the other datastore. With a uuid approach I'd have to tombstone each row individually.

These aren't universal optimizations but if you can leverage them they can be significant.

lgas 5 years ago | |

Doesn't the offset approach run into trouble when sequence values get skipped due to rollbacks?

staticassertion 5 years ago | | |

It's going to be an optimization that assumes some constraints on how you interact with your database.

rini17 5 years ago |

I'm a fan of generating primary key by copying natural key (if it's one integer) or hash of natural key. This is done only once when row is created and is never updated, even if natural key changes. In this case you are left with valuable bit of information that something happened to natural key.

rossmohax 5 years ago |

Another alternative is ULID, which can be stored as UUID on a Postgres side, but is more b-tree friendly.

pm90 5 years ago | |

Are there articles/examples on how to use ULIDs in postgres?

rossmohax 5 years ago | | |

ULID and UUID are same size, passing ULID as UUID to PostgreSQL works seamlessly.

hardwaresofton 5 years ago |

Yeah, just use a UUID unless the bits to store the UUID really are your driving limitation (they're not), having a UUID that is non-linear is almost always the most straight-forward option for identifying things, for the tradeoff of human readability (though you can get some of that back with prefixes and some other schemes). I'm not going to rehash the benefits that people have brought up for UUIDs, but they're in this thread. At this point what I'm concerned about is just... what is the best kind of UUID to use -- I've recently started using mostly v1 because time relationship is important to me (despite the unfortunate order issues) and v6[0] isn't quite so spread yet. Here's a list of other approaches out there worth looking at

- isntauuid[1] (mentioned in this thread, I've given it a name here)

- timeflake[2]

- HiLo[3][4]

- ulid[5]

- ksuid[6] (made popular by segment.io)

- v1-v6 UUIDs (the ones we all know and some love)

- sequential interval based UUIDs in Postgres[7]

Just add a UUID -- this almost surely isn't going to be what bricks your architecture unless you have some crazy high write use case like time series or IoT or something maybe.

[0]: http://gh.peabody.io/uuidv6/

[1]: https://instagram-engineering.com/sharding-ids-at-instagram-...

[2]: https://github.com/anthonynsimon/timeflake

[3]: https://en.wikipedia.org/wiki/Hi/Lo_algorithm

[4]: https://www.npgsql.org/efcore/modeling/generated-properties....

[5]: https://github.com/edoceo/pg-ulid

[6]: https://github.com/segmentio/ksuid

[7]: https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...

CREATE TABLE u(i INT8 UNIQUE); -- insert random unique value in the range 0..n -- into table u, retrying if it's already present -- -- NOTE: this will not terminate if 0..n are all -- present CREATE OR REPLACE FUNCTION insert_uniq(n INT8) RETURNS VOID LANGUAGE plpgsql AS $$ DECLARE x INT8; BEGIN <<retry_loop>> LOOP BEGIN x := (random() * n)::int8; INSERT INTO u VALUES(x); RAISE NOTICE 'inserted unique value %', x; EXIT retry_loop; EXCEPTION WHEN unique_violation THEN RAISE NOTICE 'collision with value %; retrying', x; END; END LOOP; END; $$;