Why should a Trace-ID be 128 bits? (A Surprisingly Long Answer)(newsletter.signoz.io) |
Why should a Trace-ID be 128 bits? (A Surprisingly Long Answer)(newsletter.signoz.io) |
UUIDs are already generated randomly for exactly the same reason. Rather than inventing something new, they should have just used a UUID.
UUID v7 is a 48 bit timestamp, 74 random bits and 6 bits fixed. Sure, this is a little more complicated, but it's often worth it for many applications because it can be sorted, so keys will be approximately monotonically increasing.
I bring this up because this article starts with "I asked Claude", but it doesn't explore the the length of time you're generating IDs over at all, which is an important aspect to consider when selecting size.
If you don't drop any spans from a trace, you can completely disambiguate a collision since the trace will have two distinct root spans. If you are missing spans, you might have a break in the parent-child links.
Even with infinite retention, your analysis will bucket by time somehow, so a collision might have no effect if the collision doesn't happen at a proximate time. If you are manually looking at traces, it will be very obvious there is a collision unless they happen at the same time.
Also, birthday paradox only expresses probability that there is a collision somewhere, but if you are filtering or looking at single spans, then the probabiliy that you actually see a collision is greatly reduced.
I think for basically all systems, an additional 64-bits has insignificant additional cost, so you may as well prevent collisions, but I think it could be a reasonable tradeoff if it mattered.
The future is going to be filled with "best practices" trendslop decision-making.
In my opinion, UUID v7 is useful because you per millisecond, you still have 74 bits split between user defined (up to 12) or randomness (minimum 62). If you choose the minimum 64 bits randomness, you can read the numbers straight from the article - 1 million UUIDs per millisecond with less than one in a million chance of collision, but you still have 10 bits to add additional data, such as which machine generated it.
If you stick with just time and have the full 74 bits of randomness, you can generate a trillion (10^12) UUIDs per millisecond with less than one in 40 billion chance of collision (2.6 x 10^-11) using UUID v7.
I think the fact the formula is (k^2/2N) actually shows that having a time component makes better use of the bits than a purely randomised space. In this example, we have a lower chance of collision with a trillion (10^12) UUIDs generated per millisecond than a quadrillion (10^15) UUIDs across all time.
The old schemes generated numbers that weren't uniformly distributed across the 128-bit space as they were intentionally biased in certain ways, such as time [0] and MAC addresses [1]. This means that most of the IDs generated in previous schemes would have many bits in common, and so the UUIDs that had been generated were not uniformly distributed across that 128-bit space [2] and so if you just used the whole 128-bits for random data, but didn't use those extra bits to avoid conflicts with the previous schemes, then random IDs that happened to be valid in the previous schemes would be more likely to collide.
Of course, this only matters if the properties of globally unique matter to you. For a closed system with a guaranteed scope, sure who cares? But given that the extra randomness doesn't add any useful value beyond a certain threshold, you might as well use a UUID because you don't know what that identifier might end up being used for in the future, plus you can use off-the-shelf systems to generate them.
[0] Ironically, future proofed time fields with many bits are more likely to be non-linearly distributed - e.g. the original version 0 UUID supported timestamps from 1582AD to 5236AD but was only used from 1987 for around a decade.
[1] With certain manufacturers of network cards massively more popular than others, their MAC address prefixes showed up significantly more frequently, and there were privacy concerns were you could correlate between UUIDs generated on a single machine, and sometimes infer machines that might be on the same network because they had similar MAC addresses and so the cards were probably all from the same manufacturing batch.
[2] Which is fine within the scope of UUIDs as they are still very likely to be globally unique, so it doesn't really matter if bits are wasted in this scheme
Actually, because the birthday paradox has k^2 as a term, this is actually less true than you might think. Having a time component actually reduces the chance of collisions over the long run, albeit at a cost of reducing the number that can be safely generated in any given quantum.
If you consider a 128-bit random number, you effectively have 64 bits of allocation space before you are likely to get a collision.
If you devote 48-bits to time, which provides millisecond accuracy for 9000 years, you then have 80 bits of randomness, effectively giving 40 bits of allocation space per millisecond before you are likely to get a collision.
Instead of approx 2^64 allocations across all time before a collision, you instead have 2^40 (1 trillion per millisecond). That sounds like a poor deal, until you realise that the factor is only 2^24, or 16777216ms or under 280 minutes.
So in reality, reducing the random space and increasing bits that are guaranteed unique is actually a great trade.