Ask HN: Plenty of large sites down; Reddit.com, GNU.org, Discord, coincidence? |
Ask HN: Plenty of large sites down; Reddit.com, GNU.org, Discord, coincidence? |
Remember: the cloud is someone's else computer. When it's broken, you cannot do anything
This is fine if three nines of availability is all you need. Doesn't matter much if you prefer a big brand employee fixing things or a small brand employee. It doesn't change the outcome.
However there are a lot of things that simply cannot live with crappy three nines availability. And the only way to do better is to stop relying on any single cloud, which inevitably requires infrastructure engineers aka random devops dudes.
What you're trying to get at is this: would you rather trust your infrastructure to a large organization whose core competency it is to do so, or would you rather manage it yourself? For many companies it makes more sense to have someone else manage it because of division of labor.
If you believe you're better suited to managing your own hardware for cost or capability reasons, you should. But of the arguments in favor of that decision, pointing out that "you cannot do anything" when GCP/AWS/Azure has downtime is a pretty poor one. It's an exceptional circumstance if you're 1) able to achieve better uptime than a cloud provider, 2) at nearly the same cost (in personnel, hardware and software), and 3) while being relatively unaffected by the downtime of major cloud providers anyway.
The companies for which the calculus shifts in favor of managing their own hardware probably don't need to be told "the cloud is just someone else's computer." In contrast, most companies using a cloud provider do not have a readily available alternative because they do not have in-house talent capable of maintaining baremetal hardware (local or colocated).
I consider myself personally capable of maintaining a baremetal distributed system with high availability, because I presently do that. But for the most part I wouldn't encourage companies using a cloud provider to invest in their own infrastructure. It's usually expensive in personnel, time or both.
[0] http://www.digitalattackmap.com/#anim=1&color=0&country=ALL&...
I noticed problems with Reddit earlier, too.
>Edit: Nevermind, I see what you mean (on the map). I'd be interested to know too... maybe PL is a big player in their attack monitoring?
I have a static website at https://alexandreviau.net/. It sits behind AWS CloudFront. Good luck taking it down.
I'm Turkish and have been watching the news but I don't see any reason why someone correlates large websites being down with Turkey. With no explanation too.
Can you elaborate please? This is an honest question and I would like to know if my government is hacking foreign sites in retaliation for sanctions.
They just need to hit you over a longer time scale and avoid making obvious peaks so that you can’t ask for the DDoS refund.
They will change career after being forced to work on week ends and holidays a few times. Incidentally, today is a Sunday AND the most taken holiday of the year.
The ones who are the single "random DevOps dude" at a small company trying to emulate AWS and Google, do have zero balance.
Why would you do anything else for your sysop/sysadmin?
You need at least 10 sysop/sysadmin to achieve anything close to that SLA, with a sustainable rota. Contrary to the parent posters who believe it can be done with THE right guy.
Hiring some dude won't give you that.
[1] https://techcrunch.com/2017/09/15/why-dropbox-decided-to-dro...
Please wait until Monday 20th to get a status on the issue. Thank you.
You can't look at the status page and believe what it says, so you go and ask people anyway (on irc, reddit, hnews, whatever community you like). Meaning that page might as well not have existed.
Initially the status page worked. But as more and more people subscribed to it, it became a bigger issue, to issue an alert.
And unfortunately an issue couldn’t be raised only to those it was relevant for.
All this lead to was, not updating the status page and thus it becoming a useless tool to determine if an issue was occurring.
Back to Twitter...
I feel the product needs a lot work in practice, and possibly in implementation and training.
It's insane really; a company puts out a status page to say to their customers "you can trust and rely on us through that dedicated medium to know our status", and if the customers in question buy into the proposition and use it the very first thing that company does is make it so you cannot trust and rely on them through that dedicated medium. Succedding is what causes it to ultimately fail.
Status page should have stayed as undocumented features for "the little guys" behind the scene to communicate and never get into the open world where PR and marketing and decision makers can roam.
I setup mine to automatically monitor my website from another service provider in a different datacenter. That way I know if the server is down for any reason and it updates automatically.
If my server goes down, within 5 minutes the status page is red. End of story.
If it's backup the status page goes green again.
Manual status pages are a mistake.
And keep in mind, 20 minute response doesn't mean you fix the problem in 20 minutes, it means you respond in 20 minutes to the callout.
I think you're victim of an easy-going startup culture.
I’ve been oncall for escalations for like 15 years. That’s miserable enough, IMO frontline guys need fixed schedules and rotation if the volume is high.
I definitely cast my perspective on this and apologize if I came on too strong,