Fly.io Status – Consul cluster outage

Fly.io Status – Consul cluster outage(status.flyio.net)

126 points by purututu 3 years ago | 118 comments

mrkurt 3 years ago |

This has been a rough week, and I'm sorry we broke peoples' apps. We had a big Nomad outage on Monday, and then a suspiciously similar Consul outage today. Both tipped over faster than we could detect and mitigate, and we ended up having to do serious surgery to build entirely new Consul/Nomad clusters.

There's nothing to brag about here, I just wanted to let y'all know we're listening (even when things aren't on the HN front page).

wjossey 3 years ago | |

This stuff is hard. As someone who runs infra teams for a living, these are the worst kinds of weeks.

Hang in there. You all will learn from this and be better for it. Your architecture will improve. Customers will give you a second chance. This too shall pass.

Sending positive vibes.

srhyne 3 years ago | | |

Lovely response. Ah, kindness. So refreshing to see.

mrkurt 3 years ago | | |

atonse 3 years ago | |

We had mysterious consul outages (and related nomad outages) causing us to never deploy our new hashicorp stack to production.

Shame cuz we were excited about our nomad+consul+vault setup and invested a lot of money into building it. But just didn’t have the time or enough depth of expertise to babysit it.

Mizza 3 years ago | | |

From my experience with the Hashi stack, I don't think it's a coincidence that Fly has a lot of downtime and are a major Hashi user. Terraform makes excellent bait though.

Still love using Fly, please add static assets hosting/CDN.

suryao 3 years ago |

Fly is building everything in hard mode - since they are not layering on top of an existing cloud like pretty much everyone else (heroku, render, railway, ...).

It's either very smart (if they pull it off) because they will have a ginormous cost advantage or they fail.

I'm personally of the opinion that the ux on top of aws/gcp/... is worse than a doo-doo in a shoe. However, they are as stable as can be (all complex systems go down once in a while). There are very few mature projects that do not rely on aws/gcp/... managed services anyway. Might as well put in the little bit of effort to set yourself up for the future instead of painful migrations. This obviously doesn't hold for hobby projects.

In any case, I have a lot of respect for the engineering that fly does. Kudos.

candiddevmike 3 years ago | |

Are they really building everything in hard mode or do they just have a bad architecture?

mrkurt 3 years ago | | |

Yes. Both. This outage was caused by a bad architectural decision. We had an incident a few weeks ago caused by "hard mode".

faizshah 3 years ago | | |

AWS invested a ton into limiting the blast radius of failures by isolating AZs, regions and using a cellular (service level isolated shards) architecture. I am surprised these ideas have not propagated to newer companies trying to build clouds: https://m.youtube.com/watch?v=swQbA4zub20

AWS isn’t perfect but these lessons were learned by fire because these sorts of global outages can seriously harm reputations.

suryao 3 years ago | | |

They build everything from scratch - on bare metal, including sourcing hardware (though I'd presume they use a data center manager for it). Arch, from their engineering blogs, is pretty sound.

luhn 3 years ago |

Relevant: "Reliability: It's not great" from last week https://news.ycombinator.com/item?id=35044516

They even specifically call out Consul as a source of trouble.

> We propagate app instance and health information across all our regions. That’s how our proxies know where to route requests, and how our DNS servers know what names to give out.

> We started out using HashiCorp Consul for this. But we were shoehorning Consul, which has a centralized server model design for individual data center deployments, into a global service discovery role it wasn’t suited for. The result: continuously stale data, a proxy that would route to old expired interfaces, and private DNS that would routinely have stale entries.

jen20 3 years ago | |

They call out THEIR USAGE of Consul as a source of trouble. This is quite different.

markthethomas 3 years ago |

Been a fan of fly and have had most, if not all, of my side and semi-side projects on there for some time now. But...the ratio of good/fun/snarky blog posts to reliable service has gotten a bit too large for me, starting to look for other providers at this point just in case they can't turn this trend around. Honestly been a good object lesson for me in the importance of backing up marketing/hype/"mind-share" stuff w/ absolute rock-solid performance/reliability or just forgoing the former for the latter.

As an aside, it's also taking down some decently-load-bearing web infra like unpkg => https://www.unpkg.com/

zachallaun 3 years ago | |

Relevant response from the Fly community forums: https://community.fly.io/t/frequent-outages-is-really-demons...

markthethomas 3 years ago | | |

Yeah, I saw; I've kept up w/ everything pretty closely. Still decently frustrating as a paying customer, but I hope they can figure it out. If they can and can show some real reliability, I'll be an even bigger fan.

ericpauley 3 years ago | |

Wow, part of Delaware’s tax website was hanging on unpkg today, now I know why!

markthethomas 3 years ago | |

(unpkg seems to be up now)

pawelduda 3 years ago |

I really really wanted to like and recommend fly.io but I wouldn't risk deploying anything more than a side project to tinker with, given how many random issues I encountered in a relatively short development time. It was a simple Phoenix app which made me wonder "am I doing things totally wrong?" quite a few times, after exhausting all info sources. But when I tried the same process the next day, it would deploy just fine. Plus the outages that appear to be getting more frequent don't make me optimistic.

At least they're transparent about their issues, gotta give them that. I still kinda root for them, maybe they'll make a comeback.

mrcwinn 3 years ago | |

Same. I’m so disappointed because I’ve been rooting for them. We were close to a major deployment/migration (well, major as is mid four figures per month, not major like Google) but they were removed from the decision set. It would not have been responsible to bet on them at this time. I hope they get this sorted - they’re really good folks!

mrkurt 3 years ago | | |

Thank you! I'm both sorry it didn't work out (because $$$$) and also glad we didn't create any agony for you. Someday, we hope to create mild irritation for you, though, if we can.

drewbug01 3 years ago |

I love this update:

“ We are working to build a new Consul cluster with 10x the RAM. We aren't yet sure, but believe a routine DNS change might have created a thundering herd problem causing Consul servers to immediately increase RAM usage by 500%. This is not ideal.”

_This is not ideal._

gzer0 3 years ago |

Interestingly, Roblox went down for 73 hours due to a "unique" issue with Consul as well [1].

Great read on how the issue was approached, handled, and ultimately remediated.

[1] https://blog.roblox.com/2022/01/roblox-return-to-service-10-...

jeremyjh 3 years ago | |

Most often the issues that take down a site are with core services like network routing, DNS and service discovery. Consul gets mentioned because it’s in that business and isn’t a standard so it gets called out specifically. Zookeeper, HAProxy and various cluster managers also get slagged for this stuff and yeah, sometimes it’s their fault but that’s what it means to be in that business.

throwdbaaway 3 years ago | |

https://github.com/hashicorp/consul/pull/12080 - this should be the Consul issue that brought down Roblox

felixding 3 years ago |

Was affected by the outage. Didn't know about it so I thought it was just another crash on Fly.io.

Tried to restart our app from the command line, only to be told they had disabled the API. And there is no restart feature on their dashboard. So all I could do was watching flyio logs telling me that our apps were down.

Sigh.

We moved from Heroku to Fly.io only this January, and are already considering moving away from it. The reliability is miserable at best. And so many basic features are missing. Yes it's much cheaper than Heroku, but we ended up paying much more time/resource/money dealing with its glitches. Defeats the purpose why we used a PaaS in the first place.

mrkurt 3 years ago | |

I know blocking deploys sucks, I'm sorry. We disabled them to prevent otherwise healthy apps from going down. When Consul fails, we can't boot new app processes. The ones that are already running continue running. A restart is roughly the same as a deploy, in this respect.

satvikpendem 3 years ago |

At this point I'm not sure why one wouldn't use something like Hetzner and slap Coolify or Dokku or something else on it.

kbumsik 3 years ago |

I have seen some issues around Consul these days.

As a person with no background in distributed systems, I am wondering why people choose Consul over alternatives. Are there features that etcd doesn't offer?

mrkurt 3 years ago | |

We chose Nomad and adopted Consul as a result. Nomad and Consul work well together.

I don't believe etcd would have been any better for us, though. Centralized service discovery that runs through raft consensus doesn't make a lot of sense for the things we need to do. And when I've had etcd blow up on me in the past, it's been similarly painful to recover from.

aeyes 3 years ago | | |

Most people only use etcd at small scale. If you try to store 10 or even 100GB in etcd you are going to run into uncommon problems.

Most people don't even know that the Kubernetes control plane by default has a hard limit on etcd size. It used to be 2GB, not sure what it is now.

grrdotcloud 3 years ago | | |

Raft is amazing and totally frustrating.

I think I understand how you're using it and curious if you've considered how AWS STS API manages their cross region syncing gets solved.

kbumsik 3 years ago | | |

Thanks for the answer!

AFAIK doesn't Consul also use Raft?

pcthrowaway 3 years ago | |

Etcd is really only for basic config.

If you want apps to discover each other and be able to communicate effortlessly, even across datacenters, Consul, in theory, enables this.

I say in theory because I couldn't get federated Consul actually working.

convolvatron 3 years ago | | |

discovery isn't that hard a problem that you should cede your agency to a external party like Hashicorp

I used consul for a clustered service once, it was worth it for bringup. but I when I had problems I just wrote one in a couple days since I'd done so several times before. and it didn't fail for all the years that product was running.

throwaway3838g 3 years ago |

I attempted to deploy a simple app on Fly a couple of weeks ago, but porting it from heroku became a nightmare, servers crashing, cryptic error messages, etc. Maybe I'm in the minority but in any case my experience with Fly definitely left me questioning the hype around it.

mrkurt 3 years ago | |

There are really only a few frameworks where our experience approaches Heroku. And even for those, it's only the newest versions. Phoenix, Rails, Laravel, and Remix are all pretty seamless to launch.

Most others require pretty decent Docker knowledge.

HL33tibCe7 3 years ago |

Respect to anybody who is an SRE at fly.io. Couldn’t pay me enough to do that job

abledon 3 years ago | |

Markdowns forged in god-steel coming out from this incident

abofh 3 years ago | |

They just hired their first if I recall correctly. I feel for their customers more than I do for their shareholders

mrkurt 3 years ago | | |

We've scaled infra ops from 3 to 7 people in the past few weeks. Our very first VP was a VP Infra Ops, because that's the thing we have to get best at to succeed as a business.

Note that we grew the whole company from 25 to 60 over the last six months.

sergiomattei 3 years ago |

I’m rooting for Fly. I use them myself for a project, and love the service.

However, their transparency into outages and service rough edges is a double-edged sword: they’re building a reputation for unreliable software. It’s a shame to see this major outage happen right after last week’s post, it almost confirms the stereotype.

However, even with these flaws, I still think they’re building the best hosting out there. They’re taking bold risks and doing what others aren’t. I wish them the best.

mcsniff 3 years ago | |

> they’re building a reputation for unreliable software

This is a terrific way to word what might be happening unconsciously.

Fly posts about how hard things are during and after service outages -- while I also love the transparency, most people don't want to 'be a passenger on a plane that's being built while it's flying' especially when it comes to their business, myself included.

pm90 3 years ago |

> We are working to build a new Consul cluster with 10x the RAM.

Oh boy. I wouldn’t wanna be the people doing this. Working with infrastructure is hard. Doing it under tight SLAs? Ugh. I really hope the people working on this are being well supported.

mrkurt 3 years ago | |

You wouldn't necessarily know this from the outside, but we have _exceptional_ internal support when things go sideways. This is relatively new, up until about two months ago most incidents were run by 1.5 people. We had 7 people working this one today.

throwdbaaway 3 years ago | | |

I don't really know him, but from what I can tell, https://github.com/wjordan is at least equivalent to 2.0 people.

capableweb 3 years ago | |

1. fly.io SLA only covers users on the Enterprise plan

2. The SLA fly.io has commits to 99.9% uptime, meaning they can "afford" ~1.5m downtime daily, or ~40m monthly. AWS "offers" 99.99% (~4m monthly) if I recall correctly, but their scale is also wildly different obviously.

js4ever 3 years ago |

That's the issue with centralized infra... I expect it to be less and less stable the more customers they have. I still wish them good luck.

On my side I took the opposite direction, each workload is shared nothing.

Thaxll 3 years ago |

They seem to have a lot of issues with Consul, is it the design of Consul or the way they use it that is the problem?

simonw 3 years ago |

"This impacts queries to our API, including creating and modifying apps, as well as incoming network requests for recently deployed apps."

Would be really interested to understand why it affects recently deployed apps but not apps that are already established - something to do with how the Fly Router works?

mrkurt 3 years ago | |

We still pipe service discovery through Consul, we just propagate it with a different, gossip based mechanism. Services are stored in local sqlite DBs on every host that runs our Proxy. They are designed to keep running, even when we can't get updates to them.

This outage prevented us from writing services to Consul, so we couldn't read them back out. Nomad will only really write service information to Consul, so we're kind of stuck with Consul in the loop until we're fully off Nomad.

pa7ch 3 years ago |

From my experience etcd would have been a better choice for maturity if they don't need the gossip stuff.

beoberha 3 years ago |

This shit is hard. Running a cloud service at one of the Big 3 is hard, I can’t imagine doing it with such a small team with your own infra.