Some of the details escape me (this was a decade ago) but it was a fun combination of statistical inference and CS knowledge that I don't get to use often. Whenever integer overflow comes up in a systems engineering context I get a little tickled.
The integer generation was pretty simple, there was a fixed id of each server, and unless I a mistaken we have 5 servers per datacenter. Each id was basically <time><offset><id> where time was a millisecond timer, offset was the number of ids generated in that same millisecond by the same server, and id was the machines unique identifier. When we first talked about this process I thought that offset was going to roll, every id would increment it by one. This was changed to resetting it every millisecond specifically so that it would obscure tweet volumes.
At the time I remember reading a LOT of articles estimating tweet volume and most of them were way, way off. I don't know that we ever really put effort into correcting them though. =)
* - Does not account for changes in the system post 2012.
The offset was actually how we calculated volume, because millisecond collisions become a variant of the german tank problem[1]. A few times when y'all made tweet volumes public it mapped pretty closely with our estimates.
This was around 2011, so your knowledge should be relevant.
But IEEE 754 doubles have a significand that supports a 53-bit range. What am I missing?
In other words, what do you mean that it was done by doing mod 3 of a signed int32? If it was a monotonically increasing or random int32, I don't see how that unevenness would manifest in a meaningful way.
I don’t get how mod 3 affects anything if you’re just incrementing…
Also what number are they using to modulo and where is that happening? Because at that point don’t they already have an incrementing ID before generating another one?
0->A
1->B
2->C
3->A
4->B
5->C
6->A
7->B
A and B get hit three times while C only twice, so it will see 66% utilization compared to A and BEDITED s/once/twice/ thanks CyberDildonics
Eventually, we learned to treat Twitter as a lead generation tool for off-platform activity and apply old school funnel mechanics to it. The next problem became how to build a follower count. Sadly, that problem is what I think led to extremism on the platform. Hence: https://madrox.substack.com/p/yet-another-quitting-twitter
Also collecting data like this can be useful if you want to beat markets.
Seems this was less likely a "someone else will deal with it" problem, and more of a development / QA testing problem.
Then 10 or so years went by...
Whenever I write code like that which may break in say, 5 years, I'll sign it in the comments and put my personal email and phone number inviting future people to call me and I'll fix it for free (cause I take responsibilities for my code pretty seriously). Nobody has ever taken me up on it though...
So am I being naive when I use 64 bit values for unique IDs? Or is it actually plausible that 64 bits is plenty of information for centuries to come?
Edit: Also, technically reddit was using signed int32s. So they really only had 2^16 unique digits. If they used unsigned int32s, then that would have bought them a lot of time.
So, everything worked great until it didn't, and they segment a lot of time future proofing it.
If you open a transaction, INSERT with AUTO_INCREMENT, then rollback the transaction, no data is saved, except the auto generated id is used and the next INSERT uses id+1.
0: https://arstechnica.com/information-technology/2014/12/gangn...
Well, I say that, but actually, "adding an extra bit" is basically what going from signed to unsigned would do. So maybe they just added an extra (32nd) bit?
Small correction: Signed int32s means you have 31 bits for the integer value (2^31 values), not 16 bits (2^16 values). There are 32 bits; 1 bit is used up for the sign, and 31 remain for the number.
I should also probably point out some ambiguity in your analysis. To be clear, under the faulty assumption of linear consumption, if Int32s get exhausted in 10 years, switching to UInt32s will last 20 years of which 10 are already spent. So, under the faulty assumption of linearity, switching to UInt32s buys you an extra 10 years.
64-bit identifiers will last quite a while. Another defensible choice would be to switch to 128-bit unguessable identifiers (such as assigning them via AES-256 in counter mode) in order to provide defence in depth. If something goes wrong with your authentication, but the attacker is likely to spend millennia hammering your REST API with invalid IDs before exploiting an actual asset, then the consequences aren't so bad.
Signed integers have 2^31 positive values (well, just about). It doesn't take 16 bits to store the negative sign :-)
That said, it would still take quite a while to get to 64 bits.
Infinite options at that point.
Isn't the answer to your question buried in your assumptions? For example, you seem to assume the rate of comsumption of unique IDs is static.
All I'm saying is the jump from 2^32 to 2^64 is astronomical. I don't see using 64 bit integers for uids in my hobby code as something to be concerned about. In production code for a company I would use something more significant, but even then I feel like 64 bits will last a very long time.
[0]: https://bernardmarr.com/how-much-data-do-we-create-every-day...
2^31 seconds is 68 years, a middling human lifetime.
2^63 seconds is 290 billion years.
It appears that the original developer thought that for 32 bits :)
I say that as someone who inherited a system that allowed for 2^32 session references stored in the db. Lets just say that sessions were created a lot faster than anticipated for the amount of traffic we had.
So one fine Sunday morning in a November a long time ago we ran out.
Well students uncovered the flaw and had their cards be at negative 1 dollar, which the vending machine read as a very large balance.
That said, I say that in my experience based upon locking selects in MySQL - when the row doesn't exist is a big problem, oftentimes leading to deadlocks. So insert... on duplicate key is a life saver. I don't have as much experience using postgres, so I'm not sure how much of a problem that is there, so it's possible the "on conflict" isn't as useful.
I feel like there's also implications for spam, like you want to target X demographic, all you have to do is try numbers with common name combinations from various cultures. Like in Nathan For You when he made a targeted billboard for a psychic that said "Maria Garcia, I had a dream about you" so all the Maria Garcias who saw the sign would become potential customers.
Yes 128 bits is only 4x times the bit length, but the address space is exponentially bigger.
Some predict if we do run out, it might take ~100 years.
You think we can use 340,282,366,920,938,463,463,374,607,431,768,211,456 addresses in 100 years? Thats ... ahem, ambitious!
And the subnets are in a hierarchy that helps with routing but limits the packing efficiency.
We could feasibly "run out" to the point that we can't keep doing things this way.
Been through it twice in my career…
No process that touches the real world gets into 2^64. No process at all gets into 2^128, at lwast not while we live around a single star.
This happens way too often
You're not wrong; but fyi it only happens in your originating universe. Over here we all experienced it.
A Mandela effect here would be if the explanations themselves that talked about overflow never existed.
Right?
I should have taken more time to read over both posts and respond more clearly though
Yeah, numbers get you fun stuff like overflows and wrap-arounds, etc. But sorting is faster (sorting "99" before "100" is more complex than 99 before 100) and space requirement is lower (6 bytes can store a unique ID for 281 trillion objects, but 6 characters only permits 1 million if you're storing them as strings).
Or is there some datatype that combines the space- and sort-efficiency of a numeric type, but doesn't bring the baggage of assuming they represent quantities?
You want everything to treat that as text? Sure. That text has been 6 characters long for ages, and it's about to hit 7. I personally expect to see more things break when that happens.
The problem of overflow isn't special to numbers.
Suppose you store your ids as a 64 bit number in the database, that's plenty of room, we shouldn't need to put an alarm for when we're reaching the end (although operationally you might want to still) - it's 9,223,372,036,854,775,807 when signed and 18,446,744,073,709,551,615 unsigned. That's a fine, sensible decision to make at the database level. Now suppose you have an API that interacts with these ids in the database and exposes them down to users in JSON. Javascript (and JSON by extension) numbers aren't integers they're double precision floating point numbers -- quick pop quiz, without looking, what's the maximum representable integer before you lose integer precision? Do all of your APIs work with auto increment numbers (which should still be safe) or are some of them vendor supplied, like UPCs or ISBNs? Will your developers remember this when they make a new API or will they look at an old api and think "it's a number there, I should match it"? Do some vendors front pad their numbers with 1s (this happened to us)? There are a lot of variables here and one big juicy thing developers want and know other developers want - consistency of rules. So I'm offering an easy, consistent rule:
Store it however your database lets you store it compactly, know your limits, set alarms, stop apologizing for database lack of features, but MOST IMPORTANTLY: insist to your developers that its type in code is a variable length utf-8 string, regardless of it's contents. Overflows happen, you need to accommodate them at business logic levels all the time, but stop letting developers trust numbers as being self-bound. They're not numbers in the URL, they're utf-8 letters in a string. They're not numbers in JSON, they're utf-8 letters in a string. Give them an a function that verifies the format of a string and stores it however you want in the database, but stop treating IDs like numbers even if they're numeric.
That's my opinion. Not that ints are a bad db row type.
Sorry, I thought I was following you, but I don't understand the last half of this quote. What does "trust as self-bound" mean?
You listed C twice
So initially we used 2 bind servers with authoritative zones serving using round robin. This worked fairly well as the load on each server was high enough to keep the round robin flowing and the connections from ruby would "stick" for 30 seconds anyway. We had a very large pool of front ends so its not like it mattered. Eventually we had to put a local bind cache on each server, but this introduced another super annoying bug. For some reason, the interaction between the authoritative server, and the caching server would end up causing the served record set to be ordered the same. Normally the authoritative server would serve A, B, C for the first query, then B, C, A for the second, etc. When the cache in the middle for some reason it would reset al queries to A, B, C after the refresh happened. So effectively every 30 seconds it would reset to A, B, C and start round robin again. Since the front ends would connect and stay sticky via connection reuse for a while this meant that A got like 50% of the load, B got 33%, and C got 17%. I am guessing you latched on to this by seeing that one of the servers was horribly underutilized in comparison and reproduced the math accidentally. =)
The fix for this was the massively disgusting, but super effective "make every single machine a DNS zone secondary". Rather than having a simple record cache we just fetched the whole zone if it changed. This actually made DNS quite a bit faster as every machine had a local copy of the zone.. Once that happened distribution returned to normal (20/20/20/20/20) fix for all of our internal services which used DNS for discovery, including snowflake.
I will have an especially hard time phrasing questions in a respectful manner for the Adobe devs, though ))
Early in my time there we had a major, unexpected load spike in the middle of the day. People where tweeting like crazy, breaking records for volume, etc. The system was groaning under strain but it mostly held. Turns out Michael Jackson had died. We sustained 452 (or maybe 457, it was roughly 450-something) tweets a second. this quickly became 1 "MJ" worth of tweets. We informally used this to define load the entire time I was there. Within a few months we got to a point where we were never BELOW 1 MJ, within a year I think we had peaked at double digit MJs and sustained several even in the lowest periods. Before I left we had hit 1 MJ in photo tweets, etc.
Around the time I left we did something like 300 MJ's per second, and I was only there 3 years.
[1]https://twitter.com/elonmusk/status/1595505413113323520?s=20...
The event also did not reach 20,000, it "almost" hit it.
Kind of obvious now, but I bet we could detect major world news just by sampling traffic size of chat sites.
If you know the code may break in 5 years, why don't you fix it now? Because your employer doesn't want to pay you to do that (probably for good reason, surely there's higher priorities).
Then why would you do it for free years later?
You can say, anticipate breaking changes that is calendared to happen as an eventually rollover or deprecation
Essentially things have to work one way now and there's solid reason to think it will no longer work that way later and there's no valid path to fix because exogenous conditions change.
This alternatively, can be done in bad faith to guarantee a future paycheck by intentionally placing timebombs in and then charging high rates to come back. There's incentives for abuse.
I'm making it clear and acting in good faith. I can give them 2 hours for free in 5 years, whatever. I'm a competent and responsible person and I get paid well by not being so cheap.
In my first year of university I was working for a very small web company. I got paid per website an amount that was enough for my student needs, nothing much. I wasn't a very good developer in my 1st year, so I often received emails about bugs in past projects. The bugs were my fault, so naturally I fixed them in my own free time.
It took me 2 years to see how badly I was getting owned by the company. If they wanted less bugs, they should have asked someone with experience and a much higher salary.
I like your accountability, but it's sufficient to document the issue in code, open a jira (or equivalent) and move on.
When I asked what in the hell a caviar problem was, he said it meant by the time we had to worry about those things, we'd all be so rich we'd be eating caviar.
Leaving your email and phone number is absurd and should not be promoted as good practice. No sane company would take you up on the offer anyway.
I broke that unspoken rule and got a really intractable problem that held us up for days finally fixed by making a short phone call early in my career and told myself if I was ever causing what we were going through I'd want to make it clear that I'm open to that rule being broken as well.
Apparently even suggesting such a rule should be questioned and to try human solutions to human problems is met with mockery and disdain. Alright, cool.
[1] https://blog.twitter.com/engineering/en_us/a/2013/new-tweets...
Edit to add: whether that's a big enough effect for the use case they're talking about, I don't know. This sort of thing is definitely significant in cryptographic code, though.