Fly.io Status – Consul cluster outage(status.flyio.net) |
Fly.io Status – Consul cluster outage(status.flyio.net) |
There's nothing to brag about here, I just wanted to let y'all know we're listening (even when things aren't on the HN front page).
Hang in there. You all will learn from this and be better for it. Your architecture will improve. Customers will give you a second chance. This too shall pass.
Sending positive vibes.
Shame cuz we were excited about our nomad+consul+vault setup and invested a lot of money into building it. But just didn’t have the time or enough depth of expertise to babysit it.
Still love using Fly, please add static assets hosting/CDN.
It's either very smart (if they pull it off) because they will have a ginormous cost advantage or they fail.
I'm personally of the opinion that the ux on top of aws/gcp/... is worse than a doo-doo in a shoe. However, they are as stable as can be (all complex systems go down once in a while). There are very few mature projects that do not rely on aws/gcp/... managed services anyway. Might as well put in the little bit of effort to set yourself up for the future instead of painful migrations. This obviously doesn't hold for hobby projects.
In any case, I have a lot of respect for the engineering that fly does. Kudos.
AWS isn’t perfect but these lessons were learned by fire because these sorts of global outages can seriously harm reputations.
They even specifically call out Consul as a source of trouble.
> We propagate app instance and health information across all our regions. That’s how our proxies know where to route requests, and how our DNS servers know what names to give out.
> We started out using HashiCorp Consul for this. But we were shoehorning Consul, which has a centralized server model design for individual data center deployments, into a global service discovery role it wasn’t suited for. The result: continuously stale data, a proxy that would route to old expired interfaces, and private DNS that would routinely have stale entries.
As an aside, it's also taking down some decently-load-bearing web infra like unpkg => https://www.unpkg.com/
At least they're transparent about their issues, gotta give them that. I still kinda root for them, maybe they'll make a comeback.
“ We are working to build a new Consul cluster with 10x the RAM. We aren't yet sure, but believe a routine DNS change might have created a thundering herd problem causing Consul servers to immediately increase RAM usage by 500%. This is not ideal.”
_This is not ideal._
Great read on how the issue was approached, handled, and ultimately remediated.
[1] https://blog.roblox.com/2022/01/roblox-return-to-service-10-...
Tried to restart our app from the command line, only to be told they had disabled the API. And there is no restart feature on their dashboard. So all I could do was watching flyio logs telling me that our apps were down.
Sigh.
We moved from Heroku to Fly.io only this January, and are already considering moving away from it. The reliability is miserable at best. And so many basic features are missing. Yes it's much cheaper than Heroku, but we ended up paying much more time/resource/money dealing with its glitches. Defeats the purpose why we used a PaaS in the first place.
As a person with no background in distributed systems, I am wondering why people choose Consul over alternatives. Are there features that etcd doesn't offer?
I don't believe etcd would have been any better for us, though. Centralized service discovery that runs through raft consensus doesn't make a lot of sense for the things we need to do. And when I've had etcd blow up on me in the past, it's been similarly painful to recover from.
Most people don't even know that the Kubernetes control plane by default has a hard limit on etcd size. It used to be 2GB, not sure what it is now.
I think I understand how you're using it and curious if you've considered how AWS STS API manages their cross region syncing gets solved.
AFAIK doesn't Consul also use Raft?
If you want apps to discover each other and be able to communicate effortlessly, even across datacenters, Consul, in theory, enables this.
I say in theory because I couldn't get federated Consul actually working.
I used consul for a clustered service once, it was worth it for bringup. but I when I had problems I just wrote one in a couple days since I'd done so several times before. and it didn't fail for all the years that product was running.
Most others require pretty decent Docker knowledge.
Note that we grew the whole company from 25 to 60 over the last six months.
However, their transparency into outages and service rough edges is a double-edged sword: they’re building a reputation for unreliable software. It’s a shame to see this major outage happen right after last week’s post, it almost confirms the stereotype.
However, even with these flaws, I still think they’re building the best hosting out there. They’re taking bold risks and doing what others aren’t. I wish them the best.
This is a terrific way to word what might be happening unconsciously.
Fly posts about how hard things are during and after service outages -- while I also love the transparency, most people don't want to 'be a passenger on a plane that's being built while it's flying' especially when it comes to their business, myself included.
Oh boy. I wouldn’t wanna be the people doing this. Working with infrastructure is hard. Doing it under tight SLAs? Ugh. I really hope the people working on this are being well supported.
2. The SLA fly.io has commits to 99.9% uptime, meaning they can "afford" ~1.5m downtime daily, or ~40m monthly. AWS "offers" 99.99% (~4m monthly) if I recall correctly, but their scale is also wildly different obviously.
On my side I took the opposite direction, each workload is shared nothing.
Would be really interested to understand why it affects recently deployed apps but not apps that are already established - something to do with how the Fly Router works?
This outage prevented us from writing services to Consul, so we couldn't read them back out. Nomad will only really write service information to Consul, so we're kind of stuck with Consul in the loop until we're fully off Nomad.
Also, "self-healing" isn't really one thing. There are hundreds of different problems that can take out such a cluster, and every single one of them needs its own "self-healing" mechanism. These systems are literally the most complicated kinds of systems.
I stayed away from the so-called "stacked" control plane of etcd inside kubernetes because it can make a tiny fire into a sharkfirenado but recently I've heard discussions of k3s (which uses dqlite) managing the etcd members and then "formal" kubernetes managing the workloads pointed at that k3s-stacked-etcd but I haven't tried it yet in order to know how theory and practice differ
https://developer.hashicorp.com/consul/tutorials/datacenter-...
Paired with cloud discovery, it makes for a tolerable operational experience when instances are expected to occasionally disappear.
Generally there's a master node or multiple nodes in agreement. If the cluster cannot agree on it's current state the entire system may run multiple versions or be completely unavailable or provide inconsistent response bringing down other systems that rely on it.
Inspection itself is hampered by elections or syncing state or other process/race related/caching/ddossing itself or other services.
Meanwhile, hundreds of thousands of Consul, Nomad and Vault clusters used appropriately work perfectly well…
The short version is that "using Nomad and Consul for the type of global workloads we run is not a good choice". I do not believe we'd have the same problems with Nomad + Consul in a single region. But running a single, global cluster of each of these is suboptimal.
The second problem was using some Consul features that forced us to keep it single region. What we actually need is a global view of a single service. Federated Consul doesn't quite give us that. Earlier versions of our infrastructure were using a bunch of Consul watches to update local state, so we couldn't really federate.
Some of this I'd do very differently if we rewound. But we were also building an idea with no actual users. Nomad and Consul gave us a nice platform to experiment on. We just outgrew the "prototype" as we learned what people actually wanted from us.
[1] We're using so little infra at present that we're within their free usage tier. However, I want to clarify that this isn't because we aren't willing to pay, we specifically want to pay for reliable managed offerings. That's actually the entire point! If Fly.io can deliver on their vision, we'd gladly be billed at 100x the current usage rates.
You don’t need to orchestrate a complex cluster to serve thousands or even millions of users. You can scale to hundreds of gigs of memory on a single machine nowadays.
Though I think a lot of this is incidental to just not really knowing the deal, and ops from scratch mean you have to make a lot of tiny decisions like "OK how do I get this package over here, how do I set it up, do I wipe the VM on OS-level udpates, do I need scripts for resetting the machine..." Having pre-made decisions for a bunch of questions means you aren't spending a bunch of time on tedious stuff when starting up a project.
My gut with Consul is don’t use it for high-load distributed services.
[1] https://blog.roblox.com/2022/01/roblox-return-to-service-10-...
I don't have a relationship with Hashicorp, and have tried using Consul. Everything about it is amazing in theory, but you might need a few years of experience with kube, consul, go, and maybe even the hashicorp stack to even begin debugging when things don't work as advertised.
I still think my company is going to take another stab at consul in the future, because we do need service discovery. But they're advertising a solution to an incredibly hard problem with a shit ton of variations in network topology and infra that it should (theoretically) work on. I imagine if you stay on the happy path everything works out just fine with Consul (even then, maybe only most of the time). The problem is that they don't spell out what the happy path is, and that all the other knobs they expose off to the side are actually down paths beleagured by dragons.
It's atlassian from Arkansas, just faster
And AWS has had a few of those in my 13-14 years I have used them :D
Azure reliability sucks more for sure though. Especiallly networking.
Edit: us-east-1 going down disrupts most of global AWS pretty severely fwiw.
In a disaster scenario, the data plane operations can continue so customer workloads can still run while the control plane might experience downtime. This is another lesson where in the case of fly isolating the control plane (deployment of services) from the data plane (executing customer code) could have limited the blast radius of this fault instead of using a global cluster manager.
But these stability issues actually make me more nervous about the fact that I’d have to manage my own postgres cluster and have to learn how to recover it in such an event. AWS RDS has made me soft!
Wishing you guys the best. We’ll still use fly for QA until a few of these issues are sorted out. And until there’s fully managed pg (first party or third party)
However, related to that, for big-time clusters (q.v. https://news.ycombinator.com/item?id=35174655 and https://news.ycombinator.com/item?id=25907312) one should without question move events over into their own etcd cluster: https://openai.com/research/scaling-kubernetes-to-2500-nodes...
There’s also max object size of 1MB on the apiserserver side I believe
The single-group raft is the hard limit.
Heh, that kubebrain TODO is some "oh, really?"
* Guarantee consistence in critical cases
but I give them huge props for calling out Jepsen
I have seen very few strongly consistent distributed KV store that scales beyond 10GB+
As someone who had to do SRE-style work in a smaller company for a long time despite obsetnsibly being a backend dev, the institutional knowledge you get from "real" SRE people is so valuable, and makes me a bit hopeful for the future.
You're making a claim that Hashicorp sold themselves as being a solution for problems that they can't solve. But the comment from an actual Fly.io employee suggests that isn't the case. They're stating that Fly.io pushed the product beyond its limits, and they don't seem to be projecting any of that as being the fault of Hashicorp the company or of their products.
Hashi is dishonest and diminuative, providing products that generally should've been written off as a loss
Edit: We're apparently not allowed to interact beyond three replies: I have no beef with any hashi product that is satisfactory.
Your reply quotation explanation is unsatisfactory, you tried to quote something into a thread in the most irresponsible way you could - got called out for it and tried to top post to make it work.
I have never had a client using a hashi stack that was happy about it: price, quality or reliability it's a failure
I don't begrudge their work, their work is just subpar Quote that if you'd like, I won't interact with someone that starts with fraudulent misrepresentation.
So yeah, you've built a ton of shit nobody wanted as a product, been there, done that. You've convinced me fly doesn't fit business, we're done.
It seems pretty clear you have beef with Hashicorp and with their products. It’s entirely possible you’re right. But your original claim, that I was replying to, attempted to answer a question about why Fly.io was experiencing issues with Hashicorp products. And your answer doesn’t line up with clear public statements from Fly.io staff.