Leaving the Basement(community.hachyderm.io) |
Leaving the Basement(community.hachyderm.io) |
Post mortem on Mastodon outage with 30k users - https://news.ycombinator.com/item?id=33855250 - Dec 2022 (101 comments)
(Offtopic meta note: Alert users will note that that thread was posted later than this one. This is because the second-chance process (https://news.ycombinator.com/item?id=26998308) has a race condition: the events "story makes front page" and "moderator puts story in second-chance pool" sometimes diverge and can happen in any order.)
I've also seen NFS/ZFS on Linux have very... bizzare... issues with locking, latency, and poor handling of errors bubbled up from the block layer taking down clients or even the host.
All of these went away when we redeployed everything into a Solaris-based distro (still exporting ZFS shares to Linux clients via NFS). It does seem something specific to the interaction of these two components under load on a Linux kernel.
Unfortunately, it also only happens under real-world production load and was impossible to create reliable test-case with simulated stress tests or benchmarking :(
That said, I think OpenSolaris is technically superior in most ways to any of the BSD's.
Unfortunately we had some strange HBA issues with our disk shelves, which went away with an Illumos downstream. Since our use case for this was basically an isolated box that just supplied NFS shares, the limited ecosystem wasn't a major concern more so than stability :)
But then I realize that they are only getting these many people because they are not driven by commercial interests: even with donations, I can bet they are not collecting enough to keep things afloat and they only keep going because they don't mind spending all this time, money and resources of their own on this project. They can treat it as a (relatively expensive) hobby, and they can keep it running as long as it satisfies them.
The problem is that I think that this is harmful in the long run. Yes, people now are finally seeing the issue with ad-funded social media. But if we want to have a healthy alternative, we need to understand TANSTAAFL, we need to accept that we need to give real money to the people working on this and to have the servers available 24/7 to store and distribute the hot takes and stupid memes that we so bizarrely crave every day.
I worry that if we don't change the mindset quickly, the whole Twitter drama would be a wasted opportunity and Mastodon (and the Fediverse in general) will go back to the status quo, where surveillance capitalism is the norm and truly open systems are just a geeky curiosity.
I wish I could fund a tech-equivalent of the "buy local and organic" campaign. I wish I had more people thinking "ok, I will pay $5/month to this guy and I will bring 10 people to this instance" because it is the ethical thing to do.
The latest post, "Yelping: Action Through Criticism", includes links to additional Hachyderm-related content near the top, then talks about they handle an obnoxious online behavior they've experienced because of the recent popularity of Hachyderm.
How does that work?
I guess they use Nginx to reroute traffic which by default is targeting aws.amazon.com?
Disclaimer: I have this setup using Humanmade/S3-uploads for WordPress, but am not currently running Mastodon.
Maybe you think if the original system were _really_ beautiful and of high quality, it would have scaled with the adoption on its own, with no need for ugly patches... but in that case the original system would have had the capacity to do a lot more than what was originally required. It would have been overengineered, in other words, and it would have been more beautiful if it had met its original requirements more cheaply.
The notion that a sudden change in requirements that must be dealt with quickly results in an uglier system seems fairly straightforward to me, and certainly not offensive.
As a point of reference, look at what Stack Overflow is run on. As a caveat, SO is probably more read-heavy than Mastodon, but it also serves several orders of magnitude more volume (on a normal day in 2016 they would serve 209,420,973 HTTP requests[0]). They did this on 4 DB servers and 11 web servers. And in fact, it can (and has) worked serving this volume of traffic on only a single server.
With this setup SO was not even close to maxing out their hardware (servers were under 10% load, approximately). SO also listed their server hardware[1] in 2016. I don't know enough about server hardware to assess the difference, but to my eye they look similar on the web tier with similar amounts of memory, similar disk, etc.
I'm not saying Hachyderm is doing anything wrong, but it makes me wonder if there's a fundamental problem with the design of Mastodon. And to be clear I understand that this particular issue was caused by a disk failure, but that they even had this hardware in place running Hachyderm is surprising to me.
[0] https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar...
[1] https://nickcraver.com/blog/2016/03/29/stack-overflow-the-ha...
> Our limiting factor in Hachyderm had almost nothing to do with the amount of users accessing the system as much as it did the amount of data we were federating. Our system would have flapped if we had 100 users, or if we had 1,000,000 users. We were nowhere close to hitting limits of DB size, storage size, or network capacity. We just had bad disks.
also hacker news: why would you try to run something in your basement? Just use the cloud!
Note that Dell R620 and R630 servers have been discontinued for a couple of years now, were probably bought used, and can probably be re-sold.
https://news.ycombinator.com/item?id=33855686
I would recommend people read that thread before responding with the same answers.
The fediverse needs to figure out a hubs-and-spokes or supernodes pattern so that service providers can scale up syncing, indexing etc.
ie, my personal instance should be able to offload most of the message passing to an supernode intermediary that lots of other instances use for federation so that my instance only needs one connection, and the supernodes only need to connect to each other and their local network.
> but it makes me wonder if there's a fundamental problem with the design of Mastodon.
I also note the article says,
> During the month of November we averaged 36.86 Mbps in traffic with samples taken every hour
That seems like a large amount of bandwidth to service 30,000 users (who knows what fraction of them are actually active at any given moment). But I guess there's going to be a lot of video and image content. I have tried searching all of their linked blog posts about scaling but can't find any number that might map to requests per second without making huge assumptions.
There's some inherent overhead in a federated model (vs a single-source one), and the ActivityPub protocol Mastodon happens to use, wasn't necessarily designed to be the lightest possible thing in all use-cases.
Also, there's just a lot more traffic. My instance said, after Twitter's major struggles, they saw something like 30x more traffic and 20x more daily registrations. For instances that, prior to the influx, were running by volunteers in spare time out of people's bedrooms or small cheap VPS's and such.
These instances weren't necessarily ideally performance-tuned prior to the influx (and even if yours was, the remote ones your users might need to hit to fetch content from may not have been)
I don't see why you don't accept "it's Rails." There are other issues as sibling comments have pointed out, but by starting with an ecosystem known to have performance limitations, this sort of outcome is inevitable, is it not? I'm sure the Mastodon team were never expecting the degree of usage which has been thrust upon its larger instances, but now that it has happened and the limitations have become apparent, I'd encourage people who are interested in setting up fediverse/"Mastodon network" instances to consider the alternatives to Mastodon, however paltry they currently are.
I know that the Pleroma front end and its forks are written in something called Elixir, which I have no idea about but I can't imagine it could be much worse than Ruby. What I'd really like to see is something written in a language known to be actually fast, though - PHP or Lua.
The problem probably starts with the inefficiency of RoR, as you've guessed. Mastodon is a very dynamic site which limits the amount of caching that can be done, and there are hot code paths like filtering streams using a user's block lists and word filters that are not particularly optimized - all this happens in Ruby.
But there are other inefficiencies, compared to SO:
1. Mastodon is a media heavy site, with a lot of uploading by users. Mastodon has to convert user-uploaded media to standardized representations (e.g. JPEG and h.264), which takes a lot of CPU time.
2. Mastodon has a "firehose" feed which is available in the UI and actually used by many users. Filters apply to the firehose feed as well. Obviously this requires quite a lot of bandwidth and processing.
3. Federation is a weakness when it comes to traffic. If user X has an account on server A, and at least one user on 1000 other instances follow user X, server A has to immediately send any posts to all 1000 other instances, regardless of whether anyone on the other end will ever deliberately view them. (Of course, some users may view them in their instance's firehose feed.) The instance then has to duplicate this traffic when sending it to the actual subscribed users. By this standard both large non-federated "servers" (like Twitter) and widely federated pull-only servers (think RSS) are more efficient than ActivityPub (the open standard Mastodon uses).
4. Federation is a weakness when it comes to trust. Instances do not (and must not) fully trust each other, except for things like "@x@thisinstance said 'P'". So for example, the little Open Graph based preview cards you're used to seeing on Twitter and elsewhere have to be generated for links per instance. The first time a Mastodon server sees a link, it must fetch that link and generate a preview card itself. Because new posts by popular accounts are syndicated immediately, this is a burden on websites as well. https://www.jwz.org/blog/2022/11/mastodon-stampede/ (note: copy link or disable sending referrers from HN for this site)
5. Scaling is not really a solved problem yet for Mastodon, because in practice it hasn't had to be. It's easy to pass the buck to instance operators, who end up needing a $20/month VPS to run a small instance rather than $5/month. Even the very biggest servers are scarcely larger than 1M users. At that kind of scale you can patch over performance problems by just throwing more hardware at the problem - and e.g. mastodon.social has the funds from Mastodon (the org) to do that. Note that Hachyderm, AFAIK, is an obvious example of this; it was started by a tech worker in Seattle with much better access to expensive hardware than most casual instance operators can dream of. It's not surprising that they can pull the funds together to scale up before they start seeing performance issues.
I am also closely following https://github.com/nostr-protocol/nostr to see how they go along, because I am growing weary of the "tech elite" that is moving to Mastodon and is pushing for "moderation by committee". I've gotten myself with discussions already with people who actually want server operators that want only to open federation for those that abide by some "Covenant". This seems rooted in good intentions, but it reeks of something that might lead to a corporate copout of a network which is supposed to be open.
Is it? I mean, let's assume you post something popular, and 17,000 servers request it. How many people do 17,000 servers cover? Like even if we assume only 10 people per server. That's 170,000 people. How many people have 17,000 followers, nevermind 170,000 ? And 10 per server seems implausibly low for an average.
17,000 hits is... not particularly notable from a server perspective, triply so when they're all requesting the same "just posted" item which is still cached.
Sure if like, someone with multiple millions of followers is on your server you're going to have issues, but seriously, twitter had issues with that too.
Also keep in mind that there's no algorithm pushing people towards the same "popular" posts. Things grow organically.
So, at what point are you suggesting that this becomes "fatal" and is that a point that anyone that isn't hosting literal superstars is going to encounter?
The issue is that there's no federated "popularity" metric. Every user that has followers on 17,000 servers has every single one of their posts pushed to all 17,000. Automatically and immediately, not on demand. Posts by users with only a few followers will occasionally go viral, but an outsize portion of the load is due to "whales".
Also, Wikipedia has donation drives, non-profit status and a highly controversial history of how it spends the funds they collect. Lots of high quality contributors already left because they were not recognized in any way and ended up feeling exploited.
I do not agree that just calling for moderators and volunteers means a business is unsustainable.
Perhaps it does in this case - I don't know enough about Hachyderm to know for sure. But it's possible that the roles of moderators and volunteers at Hachyderm might not be (or could be made not to be) so terrible, in which case relying on free labour is a proven and sustainable business model (for some businesses anyway).
Also worth noting that a "highly controversial history of how it spends the funds they collect" at Wikipedia is not directly correlated to Wikipedia's sustainability at all - the history shows otherwise.
This is why they are quitting in droves, and this is why I stopped donating to them.
The problem is that example you gave (Wikipedia) is not a business.
> I don't know enough about Hachyderm to know for sure.
My point is not about Hachyderm in the particular. Like I said in the first comment, I think they have what it takes to continue operating and serving their community for the longer term.
My issue is with the overall ecosystem and the expectations of the people coming in from "traditional social media sites". If we want to provide an ethical alternative to Twitter/Facebook/Instagram/WhatsApp, we need to find a way to serve hundreds of millions of users. How are these people going to be spread around the instances? For context, we would need ~15000 Hachyderms to replace Twitter and ~60000 pixelfeds to replace Instagram. Are they all going to be dependent on volunteers? Are all these instances be operated by highly paid professionals from SV who can sink a few hundred dollars every month? Or are they going to go to appeal to "your donation is very important to us" and expect that a few generous souls make up for the free-riders? Or are they all going to eventually cave in, start treating it as a business and start charging from their users something that can pay actual salaries for everyone involved?
You then give that person the ability to delete stuff that violates the rules. Repeat with giving the highest trust / highest use users that power until you stop seeing garbage get through.
It is not at all any more stressful than being any other user, but instead of occasionally seeing garbage posted and being disgruntled that it's there, they can just delete it.
That said, although /. lives on, it is only a shadow of its former self AFAIK. Perhaps that indicates unsustainability.