Reddit's photo albums broke due to Integer overflow of Signed Int32

Reddit's photo albums broke due to Integer overflow of Signed Int32(old.reddit.com)

328 points by Rebles 3 years ago | 139 comments

madrox 3 years ago |

A long time ago we discovered Twitter used a round robin of three servers for assigning IDs to tweets. We inferred the round robin was done by doing mod 3 of a signed int32, and because that space doesn't divide neatly by two it meant one of the three servers saw less load than the others and we could map ID assignment volume according to how often it overflowed and hence estimate total tweet volume for a given period.

Some of the details escape me (this was a decade ago) but it was a fun combination of statistical inference and CS knowledge that I don't get to use often. Whenever integer overflow comes up in a systems engineering context I get a little tickled.

liquidgecka 3 years ago | |

I am pretty sure that snowflake didn't use a mod of a signed int32. It used a service discovery pool as part of finagle (and prior to that dns iirc). The server used a very simple method internally to convert time into a integer (that was 52 bits because of javascript). In fact it was completely open source: https://blog.twitter.com/engineering/en_us/a/2010/announcing...

The integer generation was pretty simple, there was a fixed id of each server, and unless I a mistaken we have 5 servers per datacenter. Each id was basically <time><offset><id> where time was a millisecond timer, offset was the number of ids generated in that same millisecond by the same server, and id was the machines unique identifier. When we first talked about this process I thought that offset was going to roll, every id would increment it by one. This was changed to resetting it every millisecond specifically so that it would obscure tweet volumes.

At the time I remember reading a LOT of articles estimating tweet volume and most of them were way, way off. I don't know that we ever really put effort into correcting them though. =)

* - Does not account for changes in the system post 2012.

madrox 3 years ago | | |

Awesome to hear from someone on the other side of the API with knowledge of this. This is ringing bells for me. Yeah, the id parameter was how we knew how many servers there were, and we saw more assignment in some servers than others that neatly mapped to a int32 max failing to divide by the number of IDs we saw. I thought I recall Twitter confirming that was how round robin happened but I could be totally misremembering. We never got a contact to talk to Twitter about it. FWIW, we did eventually see this fixed. I imagine it was pretty easy to spot one server seeing less load than others.

The offset was actually how we calculated volume, because millisecond collisions become a variant of the german tank problem[1]. A few times when y'all made tweet volumes public it mapped pretty closely with our estimates.

This was around 2011, so your knowledge should be relevant.

1: https://en.wikipedia.org/wiki/German_tank_problem

lathiat 3 years ago | | |

Someone previously created a tweet linking to itself by predicting the likely ID range: https://oisinmoran.com/quinetweet

manigandham 3 years ago | | |

A simpler scheme I've used for adtech (billions of requests per day) is to simply reserve a chunk of numbers for each server from a central source. Easy to implement, very fast since each node can just increment in process, and using a 64-bit integer is effectively infinite.

duskwuff 3 years ago | | |

Twitter didn't always use Snowflake -- that was introduced in November 2010. There was another, much simpler algorithm used before that which generated much smaller IDs (e.g. under 3e10).

pwdisswordfish0 3 years ago | | |

> 52 bits because of javascript

But IEEE 754 doubles have a significand that supports a 53-bit range. What am I missing?

vecter 3 years ago | |

Wouldn't that unevenness only affect 2^31 - 2 and 2^31 - 1, so a negligible fraction of the integers? Was that tiny discrepancy enough to make your calculations?

In other words, what do you mean that it was done by doing mod 3 of a signed int32? If it was a monotonically increasing or random int32, I don't see how that unevenness would manifest in a meaningful way.

madrox 3 years ago | | |

In another subthread, we realized my memory was wrong and we were measuring millisecond collisions. The serving ID imbalance was a side-effect. Also, it might've been an int16 I was thinking of but turns out the whole thing was shadows on cave walls.

mashygpig 3 years ago | |

Maybe a dumb question, but I don’t follow what you mean by “the space doesn’t divide neatly by two” and also how that connects with overflowing ints. Asking because I’m genuinely curious and would like to know more about this. Sounds really neat!

_3u10 3 years ago | |

If they were incrementing and modding wouldn’t that server see an extra 1/2 billionth more traffic?

I don’t get how mod 3 affects anything if you’re just incrementing…

manigandham 3 years ago | |

If it’s round robin then it should be an even load, how does the modulo change that exactly?

Also what number are they using to modulo and where is that happening? Because at that point don’t they already have an incrementing ID before generating another one?

andreareina 3 years ago | | |

Take a 3-bit counter:

    0->A
    1->B
    2->C
    3->A
    4->B
    5->C
    6->A
    7->B

A and B get hit three times while C only twice, so it will see 66% utilization compared to A and B

EDITED s/once/twice/ thanks CyberDildonics

pedrovhb 3 years ago | |

That's really neat, I'd love to hear more about it. Was this something you were actively trying to find out, or was it poking around until something caught your eye?

madrox 3 years ago | | |

A little of both. I was working in social media analytics, and we were collecting everything we could to understand how to communicate the value of this new medium to businesses who could use twitter for marketing. This was still in an era where privacy wasn't at the front of anyone's minds, so there were zero retention policies. Hard to believe that was only 10 years ago.

Eventually, we learned to treat Twitter as a lead generation tool for off-platform activity and apply old school funnel mechanics to it. The next problem became how to build a follower count. Sadly, that problem is what I think led to extremism on the platform. Hence: https://madrox.substack.com/p/yet-another-quitting-twitter

kypro 3 years ago | | |

Not op, but at an ecommerce company I worked for we did similar things to track how well our competitors were doing relative to us so could be something like that.

Also collecting data like this can be useful if you want to beat markets.

ipqk 3 years ago | |

Reminds me of the self-quoting tweet: https://news.ycombinator.com/item?id=25244872

boosteri 3 years ago |

Nostalgic flashback to Premier Manager games, where players stats decreased as they aged. When they went /below 0/ they flipped around to 127. So a good strategy was to scout out really bad players from lower leagues about to hit age 30+. And offer them very long contracts to prevent them retiring .. and give time for most of their stats to flip around, turning them into superstars.

johnfarrelldev 3 years ago | |

I always found the funniest occurrence of this was in the Civ game though it seems it originally being a bug is disputed.

https://en.wikipedia.org/wiki/Nuclear_Gandhi

vikingerik 3 years ago | | |

The bug never existed at all in Civ 1. It was an urban legend all along. Similar behavior was intentional in Civ 5 as a joke, which convinced everyone that it really did happen in Civ 1 when it never did.

Rebles 3 years ago |

Two days ago, Reddit ids have finally incremented passed the 2,147,483,647, or the maximum range of a signed int32. It seems one of Reddit's subsystems, the one that serves its photo albums broke due to the integer overflow.

cowsup 3 years ago | |

Strange thing is that photo albums re relatively new. Imgur was the go-to host for Reddit, and then they made their own uploader a looong time later. The "albums" functionality only came out in July of 2020, according to a Google search.

Seems this was less likely a "someone else will deal with it" problem, and more of a development / QA testing problem.

Gigachad 3 years ago | | |

For some reason most stuff still defaults to i32 and a lot of people use them for new code. At this point I'd not be against linters warning against using 32 bit ints unless you have a good reason.

curioussavage 3 years ago | | |

I’m pretty sure the table in question stores image metadata for all user uploaded images. As well as images scraped from posted links which goes back way before images in posts

jimmytucson 3 years ago | | |

A new feature but it wasn’t built on a new codebase. Reddit is a monolith and a lot of things users think of as different “entities” live in the same set of tables.

fdgsdfogijq 3 years ago | |

Someone probably joked they would never reach that scale when they wrote that code

kristopolous 3 years ago | | |

Thinking "if that ever gets anywhere close to a problem we'll have vast resources and plenty of time to fix it" and then, I'm guessing, that person left a few months later and nobody owned that part of the code because it worked.

Then 10 or so years went by...

Whenever I write code like that which may break in say, 5 years, I'll sign it in the comments and put my personal email and phone number inviting future people to call me and I'll fix it for free (cause I take responsibilities for my code pretty seriously). Nobody has ever taken me up on it though...

sph 3 years ago | | |

So you migrate to int64 and one day someone will wonder why the hell did we ever think no one would reach 2^64 rows in a database table. Or that 2^128 IP addresses would be enough for everyone.

_gabe_ 3 years ago |

I'm just curious, I know it's a long running joke about how we're so stupid to think that we would never run out of unique digits with 2^32 possible values, but is this also the case with 64 bit values? Every new bit doubles the amount of information, so if 32 bits lasted reddit 10 years, presumably 33 bits would last them 20 years, 34 would last 40 and so on. Eventually, 64 bits would last them 10×2^32 years, which seems like a safe bet.

So am I being naive when I use 64 bit values for unique IDs? Or is it actually plausible that 64 bits is plenty of information for centuries to come?

Edit: Also, technically reddit was using signed int32s. So they really only had 2^16 unique digits. If they used unsigned int32s, then that would have bought them a lot of time.

davidjfelix 3 years ago |

A classic case of "ids aren't numbers even if you choose to make them numeric"

knodi123 3 years ago | |

ids being sortable has a lot of advantages over random guids.

marcosdumay 3 years ago | | |

Them being dense is advantageous too. Numbers are a very convenient format for encoding IDs, but that doesn't mean that IDs are numbers.

davidjfelix 3 years ago | | |

This is really a false dichotomy you don't have to use guid/uuid. I'm saying even if you use sortable auto increment numbers, stop storing them like numbers.

iamdual 3 years ago | | |

There are timestamps for sorting.

NaturalPhallacy 3 years ago | | |

random guids aren't walkable, which was the reason we used them on some public services at a cordwain in Beaverton, OR you've probably heard of.

scrame 3 years ago |

Ha! Slashdot had a similar problem in the early 2000s because they did a difficult migration for user/post ids, but left the indexes at 32(?).

So, everything worked great until it didn't, and they segment a lot of time future proofing it.

dvh 3 years ago |

-2,147,483,648 photos should be enough for anybody

Thaxll 3 years ago |

The famous AUTO_INCREMENT that you though you would never reach...

btown 3 years ago | |

Fun fact: if you do a lot of INSERT... ON CONFLICT calls in Postgres from automated systems that are updating much more often than you insert, your autoincrement primary key can increment far far faster than your data volume (since it doesn't de-increment on a conflict) and overflow an int, grinding things to a halt. One of the more maddening outages I've had to deal with!

hu3 3 years ago | | |

Similar for MySQL.

If you open a transaction, INSERT with AUTO_INCREMENT, then rollback the transaction, no data is saved, except the auto generated id is used and the next INSERT uses id+1.

BooneJS 3 years ago |

Happened a few years ago to YouTube[0]. I don’t know why counters that start at zero and only increment are stored as signed integers.

0: https://arstechnica.com/information-technology/2014/12/gangn...

akoster 3 years ago |

Reminds me of a similar Chess.com iPad app issue from a few years back https://news.ycombinator.com/item?id=14539770

_iyhy 3 years ago |

It's 2022 we still are using int32 for anything.

darylteo 3 years ago |

If I'm understanding the shitty change log exactly, was the solution to add an extra bit?

maverwa 3 years ago | |

I guess thats a joke. Adding a single extra bit would usually be more complex then going to 64bit or going to 32bit unsigned.

Well, I say that, but actually, "adding an extra bit" is basically what going from signed to unsigned would do. So maybe they just added an extra (32nd) bit?

dmtroyer 3 years ago |

unique identifiers are so passé.

mhh__ 3 years ago |

For a supposed tech company reddit really are bad.