Fastly Outage(fastly.com) |
Fastly Outage(fastly.com) |
:-|
I always have a small stash of my favorites saved locally in case of internet outage or I’m caught in a situation where I don’t have internet but need a few minutes.
On top of that I’ve been really trying to rely less on an app. So I throw a lightly guided or unguided session in every couple days at least where I focus on going solo so I don’t need an app and just need a timer.
I don't understand this.
Teach me your ways, master! /s
Jokes aside, people can do whatever they please. Reddit has a bunch of niche communities around many hobbies and fun things. No need to be bitter about it.
I thought cdns had fallback configured ?
What kind of things do you put in place to manage these kind of centralised issues that are beyond your control?
Is fixed
Edit: nope, just worked for 2-3 requests (10 secs)
Obligatory LOL ...
Edit: Elsewhere in the comments: https://status.fastly.com/incidents/vpk0ssybt3bj
The issue has been identified and a fix is being implemented. Posted 1 minute ago. Jun 08, 2021 - 10:44 UTC
The issue has been identified and a fix has been applied. Customers may experience increased origin load as global services return.
Let's see
That time to find the issue is always the stressful part. < 1 hour is pretty good for weird stuff, and fortunately the east coast of the US is barely online this early (sorry Europe!).
Presumably the BBC has some kind of fallback in place.
The journalists ought interview their own techies :)
I thought that one of the principles behind the Internet is to be able to reroute around failures, but neither these service providers nor their clients ever seem to learn.
I guess in their mind that only applies to packet routing not services. SMH
It seems like a pattern that CDN have overly centralized the web and lead to issues like this.
Maybe its time to build a CDN that distributes your static assets to multiple CDNs and has a set of fallback states for service outtages.
Fastly itself has its status page up as well: https://status.fastly.com/
A government should not rely on CDNs like that. In fact government websites should not have any traffic going over third parties. When I want to use/view a government website, I should not be subjected to sharing any data with unwanted third parties and the government should not be affected, when some private company makes mistakes or has outages. It is an unacceptable situation.
They can set up their own state-owned CDN, using the same underlying technology. Compared to where they spend all that tax money, some servers and some engineers would be a very cheap investment, in relation to the independence achieved.
Fastly error: unknown domain: www.reddit.com. $ nslookup images-eu.ssl-images-amazon.com
Server: 127.0.0.53
Address: 127.0.0.53#53
Non-authoritative answer:
images-eu.ssl-images-amazon.com canonical name = m.media-amazon.com.
m.media-amazon.com canonical name = media.amazon.map.fastly.net.
Name: media.amazon.map.fastly.net
Address: 199.232.177.16
Name: media.amazon.map.fastly.net
Address: 2a04:4e42:1d::272Reddit BBC News Twitch.tv Twitter emoji cdn?
are all down 503 service error
Stack Overflow, The Guardian, Gov.uk too as some other biggish names getting hit.
Anyone know if there is any legitimacy to this?
[1] status.fastly.com
Stackoverflow.com, reddit, qoura down. (and probably more, those are the ones I tested)
I did that moments ago, and I regret it.
So many companies sweep this sort of things under the rug if it’s only customer data that’s been breached. If they can’t sweep they have a high priced PR agency do the communicating.
I do not trust companies who handle things this way.
connection failure
Not sure if that provides anyone here with more insight into what might have caused this!Edit: and now "I/O error" on Reddit.
I was assuming there are couple of services like Fastly and companies might have architected keeping in mind the alternatives too, I guess.
It should be planned for, especially by major tech organizations like reddit, or Amazon, etc.
But I won't fault news organizations, who already don't have boatloads of money for not having fail over cdns
Let's use a handful of providers for everything, they said. It will be cheaper, they said. It will be easier to manage, they said.
And it was cheaper, until downtimes began to affect more and more sites when central SPOFs got hit.
And I wonder how much of that need for these centralized SPOFs actually comes from the sheer absurd amount of bloat, ads, code and assets that sites these days "have" to deliver to the customer. I 'member times when pages had 100kb total size, loaded in an instant and were perfectly usable.
What is fastly? Why are a huge number of web sites dependent on them? They are some kind of web host for companies that don’t want to run their own servers/data centers?
Basically the closer the server serving the webpage is to the end user the faster it is for the end user to see and interact with.
But running servers all over the world 1) isn't efficient 2) costs a lot of money.
So a few companies (fastly, cloud flare, akamai) figured, hey, why don't we build a bunch of small data centers all over the world and then provide a distributed way to serve web traffic from it.
It originally was brought about for services like Netflix, but has expanded greatly.
You still host your servers, but a copy of the webpage/media is given to the CDN to serve to customers.
Wouldn’t you build in a failsafe that bypasses Fastly and sends traffic to your own servers in the case of this kind of outage? Or outages are so rare that it’s not worth the trouble?
They literally have their own directly competing CDN product. You'd think they'd be dogfooding it.
Alternatively you could use DNS to fail over to the content you host, instead of another CDN. But in many cases that would be the same as an outage since the CDN exists to reduce the impact of all those requests on your infra
EDIT: Hexdocs is down, elixir-lang.org is down
In fact, you can probably remember most of them if you were given dates.
Plus, going around the CDN can be very complex (depending on the type of content), very expensive (all of a sudden you have a massive data out network traffic that didn't exist previously), and not guaranteed to work (DNS updates can take longer to get to everyone than the actual CDN outage lasts).
There are places where it is worth it and useful, but for a lot of the sites listed it's not useful.
This, I can't remember the last Fastly outage in this dimension, so the time spent on setting up a secondary server serving your assets is probably not really worth it for small-medium companies. Although i'd think otherwise for a company like Shopify.
Edit:
Fastly's incident report status page: https://status.fastly.com/incidents/vpk0ssybt3bj
Fastly Engineer 2: I have some very bad news...
With Reddit however, these days almost all comments are locked behind “view entire discussion” or “continue this thread”. In fact, just now I searched for something for which the most relevant discussion was on Reddit; Reddit was down so I opened the cached version, and was literally greeted by five “continue this thread”s and nothing else. What a joke.
Fastly error: unknown domain: www.fastly.com.
Details: cache-syd10161-SYD
Error 503 Service Unavailable
Service Unavailable
Guru *Mediation*:
Details: cache-lon4236-LON 1623146049 854282175
Varnish cache serverMaybe a good way to work out which versions are being used.
Fastly error: unknown domain: numpy.org.
Details: cache-pdk17841-PDK
edit: 12:05 up again for me, no images or custom fonts loading though ... and down again 1 minute later
edit: 13:01 reliably up again for me
So it is a "performance" issue when all pages give a 503.
https://www.streamingmediablog.com/2020/05/fastly-amazon-hom...
> Statuspage Automation updated third-party component Spreedly Core from Operational to Major Outage.
> Statuspage Automation updated third-party component Filestack API from Operational to Degraded Performance.
Oh, right. :-D
Don't get me wrong, I love the proliferation of APIs and easily-integrated services over the past 20 years. We're all one interdependent family, for better and for worse.
edit: PayPal looks be back up at least in US East but when I turn off my VPN and access from Asia I get "Fastly error: unknown domain: www.paypal.com."
Now I'm seeing a 503
Looks to be working again my end.
"A number of leading media websites are currently not working, including the Guardian, Financial Times, Independent and the New York Times."
:(
Edit: There seems to be a major empathy outage in this thread. Disgusted but not surprised, unfortunately.
The whole idea of the internet was a distributed network impervious to most attacks.
The reality is that a single failure can knock out 90% of the services people use.
This is the page that should be linked:
As of 10:44UTC, this status page has just updated to say the issue has been identified and a fix is being implemented.
Gov.UK is supposed to be a bit like BBC 1 or Radio 1 – in a national emergency they can be taken over to disseminate critical information, like if there was a nuclear attack launched on the UK.
Guys, you are offline with a 503 error, this is a little more than "potential impact to performance".
"some users may experience degraded service" => site completely down for all locations
> CDN Performance Impact
Doesn't seem the status page is automatically updated or perhaps whatever event or polling is used is also broken.
How come we are affected by this in the Netherlands?
>North America (Ashburn (BWI), Ashburn (DCA), Ashburn (IAD)), Europe (Amsterdam (AMS)), and Asia/Pacific (Hong Kong (HKG), Tokyo (TYO), Singapore (QPG)).
It has now been updated to a pretty sizable list.
edit: And now it looks like it includes every location.
[https://www.streamingmediablog.com/2020/05/fastly-amazon-hom...: CDN Fastly Wins Content Delivery Business For Amazon.com and IMDB Websites)
Quoting:
> "But with small object delivery, like images loading fast on Amazon’s home page, it’s the opposite. Customers will pay for a better level of performance and in this case, Fastly clearly outperformed Amazon’s own CDN CloudFront. This isn’t too surprising since CloudFront’s strength isn’t web performance, or even live streaming, but rather on-demand delivery of video and downloads."
dig +short www.amazon.com
tp.47cf2c8c9-frontier.amazon.com.
d3ag4hukkh62yn.cloudfront.net.
65.8.70.16
dig +short www.amazon.co.uk
tp.bfbdc3ca1-frontier.amazon.co.uk.
dmv2chczz9u6u.cloudfront.net.
13.224.0.89
dig +short www.amazon.in
tp.c95e7e602-frontier.amazon.in.
d1elgm1ww0d6wo.cloudfront.net.
13.224.9.30
dig +short www.amazon.co.jp
tp.4d5ad1d2b-frontier.amazon.co.jp.
www.amazon.co.jp.edgekey.net.
e15312.a.akamaiedge.net.
104.71.134.162reddit, stackoverflow, github, paypal, pypi, twitter, twitch, NYT, CNN, BBC, the Guardian...
edit: wow, even Amazon.com relies on Fastly for some of its edge caches!
“This basic architecture is 50 years old, and everyone is online,” Cerf noted in a video interview over Google Hangouts, with a mix of triumph and wonder in his voice. “And the thing is not collapsing.”
The Internet, born as a Pentagon project during the chillier years of the Cold War, has taken such a central role in 21st Century civilian society, culture and business that few pause any longer to appreciate its wonders — except perhaps, as in the past few weeks, when it becomes even more central to our lives.
How will they troubleshoot the error messages now?
dig bbc.co.uk
bbc.co.uk. 193 IN A 151.101.64.81
bbc.co.uk. 193 IN A 151.101.128.81
bbc.co.uk. 193 IN A 151.101.192.81
bbc.co.uk. 193 IN A 151.101.0.81Good luck to the on call engineers!
The internet is designed for redundancy. Wonder why these companies don't have a fail over network. Makes me wonder if cost is factor considering their already massive infra. But a single point of failure ... <confused>.
Well, Internet was indeed designed for redundancy, and it worked as intended. A no point in time it failed to make you reach the server it was supposed to make you talk to.
What are failing are all the application protocols that are running on top of the network.
Seems like this is being resolved; curious to see the details afterwards
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKIN... fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/... ERROR: http://dl-cdn.alpinelinux.org/alpine/v3.12/main: temporary error (try again later)
Or that companies need to have better DNS strategies.
Except if the HTML/CSS is hosted on that CDN?
At some point some of them will start to become popular.
Which is causing $15+ million in lost product sales for every hour of outage.
Not to mention the loss of any new customers.
A very significant amount of people won't go back. It's why the most effective marketing campaign by far is retargeting those people to convince them to come back. Unfortunately that's not possible in this case since you can't track the users as the site is unusable.
This incident affects: North America (Ashburn (BWI), Ashburn (DCA), Ashburn (IAD)), Europe (Amsterdam (AMS)), and Asia/Pacific (Hong Kong (HKG), Tokyo (TYO), Singapore (QPG)).
This happened with Cloudflare before too. I think we are a little too dependent on these services.
/s
A decent number of tries is rejected right at the Varnish front door:
< HTTP/2 503 < server: Varnish < retry-after: 0 < date: Tue, 08 Jun 2021 10:11:41 GMT < x-varnish: 271470009 < via: 1.1 varnish < fastly-debug-path: (D cache-bma1666-BMA 1623147101) < fastly-debug-ttl: (M cache-bma1666-BMA - - -) < content-length: 450 < Service Unavailable Guru Mediation: Details: cache-bma1666-BMA 1623147101 271470009
Many more reach some backend system that just dumps "connection failure":
< HTTP/2 502 < content-type: text/plain; charset=utf-8 < content-length: 18 < connection failure
And a tiny few do get through:
< HTTP/2 200 < content-type: text/html; charset=UTF-8 < cache-control: max-age=0, must-revalidate < date: Tue, 08 Jun 2021 10:11:43 GMT < via: 1.1 varnish < vary: accept-encoding < set-cookie: ...snip... < server: snooserv < content-length: 275036 < <!doctype html><html>...snip...
It's still early days, but I'm hopeful that it can provide a real solution to today's CDN centralization.
Unless most nodes are high performance, I guess?
Personally I think a distributed database system, where entries are being made redundant in something like a blockchain+dht, would be a good start?
Decentralizing the internet works if it financially makes sense for platforms to build such tools.
I'm grateful for HN. I rebooted my computer. I thought it was my device and then saw this on my phone while rebooting.
Quite a few Fastly customers have more than vanilla requirements though, and may have a lot of business logic performed within the CDN itself. That Fastly is "just Varnish" and you can perform powerful traffic manipulation is one of it's main selling points.
With a standard TCP/UDP session, it mostly just works or doesn't and you can get a proper traceroute to know what's up. With these fancy CDNs, there's a whole new can of worms to deal with and from a client's perspective you have no clue what's happening because it's all taking place in their private network space where we have no "looking glass".
Fuck the cloud, i want real Internet.
edit: My whole Twitter timeline is full of posts saying "Twitter outage? what outage?". Same on Reddit and Twitch chat, feels like for a short time I was invited into some exclusive circle lmao. StackOverflow and other StackExchange sites also work so I can look stuff up for you.
>At the core of Fastly is Varnish, an open source web accelerator that’s designed for high-performance content delivery. Varnish is the key to being able to accelerate dynamic content, APIs, and logic at the edge.
Just checked, thank god the NHS vaccine site is still available - vaccines just got rolled out for under 30s today.
I would blame anyone who claimed otherwise or couldn't deal with it while not having a fallback.
Think its best to show a large amount of support and empathy for the individuals having a really bad day today, and how awful they may feel. Some will probably end up reading this thread (I know I would).
And of course, still hold Fastly the business accountable for their response (but objectively, once we understand what the root cause was, and the long term solution).
I get that you're implying that the job itself is not worth that much concern, but it seems you're ignoring that jobs bring in income, pay your mortgage, etc.
If i lost my job tomorrow i'd be terrified.
It sucks. Working on CDN reliability is like working on wastewater management: the public forgets you exist until something breaks, when they start asking why you weren't doing your job. Fortunately, internal people at least seem to get it -- I hope this is the same as Fastly.
People need to be blamed, and responsibility for actions taken (without covering asses)
I have no empathy for Fastly-the-company. I hate the fact that the Internet is centralized around CDNs. I wish this idea of 'but we _must_ run a CDN for our 1QPM blog!' would die in a fire. But I can still empathize with the Fastly engineers handling this shitstorm right now.
Do a post-mortem, work out root causes, work as a unit to ensure this doesn't happen again.
Obviously if there are levels of gross negligence or misconduct discovered during post-mortem, that will need to be dealt with accordingly, but coming into this with an attitude of "we must find someone to blame and incur repercussions" isn't healthy at all.
We are humans - don't forget that.
edit: forgot some words.
"An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization."
The best way (in a team), to tackle mistakes, is to ensure the process in place corrects these mistakes. The only way to do that, is a post-mortem/learning from the mistake. If you blame it on some engineer who did it, that guy will eventually be replaced by some other guy, who may make the same mistake.
And we, especially companies, typically only learn if there is something at stake. Stock-price, a job, customers, liability etc.
(Call me old fashioned, but what I learned from it, having no stake in the game, is we are truly demolishing the resilient, decentralised nature of the internet; or already have done so)
Post-mortems make far more interesting submissions IMO, but I suppose people up-vote 'yes down for me too'.
A good leader will take the hit (and the repercussions) for their underlings, compensate customers where compensation can make it better (and offer to make it easy to use fallbacks if this happens again) -- and internally fix the problem so it can't happen again, without throwing anyone to the dogs.
What i think this syntactically invalid sentence is trying to say is:
People need to be blamed, and held responsible for actions taken.
Why do people need to be blamed? Why do we need to make someone the scapegoat? What does being held responsible look like?
Let say we find some sacrificial engineer to pin this on:
* does the downtime magically disappear?
* does the engineer suffering (say losing his job or whatever) make your downtime meaningful? You'll recoup your revenue somehow from it?
* does the fact that there's a scapegoat mean that everyone else at fastly is perfect and it's ok to keep using them?
Emapthy and responsiblity are not mutually exclusive.
This. When people talk about "HugOps", "empathy" and all that when a worldwide incident affecting a huge amount of time critical customers (e.g. trading, hft, cargo, food delivery, etc.) is happening for an hour, it has catastrophic consequences.
I hope the engineers also understand the other side and why we are paying huge sums of cash for their service.
Flag and downvote all you want, you know this is true.
The fault is theirs and they have said that they have failover, this worldwide outage caused by them just goes to show you that Fastly does not actually have a failover system in place.
> "Fastly’s network has built-in redundancies and automatic failover routing to ensure optimal performance and uptime." - status.fastly.com
Even their status page was down. Very embarrassing, Fastly did not work as advertised and mislead its customers.
Edit: Offended flaggers circling around silencing misled Fastly customers. How pathetic.
I don’t know Fastly at all, but in my experience there’s no such thing as a foolproof failover system that covers all possible scenarios.
What's your SLA with them?
Just assuming things will always work because the marketing copy said so is recipe for disaster. It's hoping that things never go wrong, and when they inevitably do, being caught pants down.
Everything fails sometimes. You must know how much your SaaS provider contractually promises, ensure that any SLA breach is something financially acceptable for you, and ensure that you can handle failure time within SLA.
-- https://twitter.com/tveastman/status/1069674780826071040
:-(
ps. "The Internet was build to survive attacks" is not true. It's a myth made popular by Robert Cringely in the early 1990s. The Arpanet was simply a protocol for mainframes used by computer scientists to connect. The Internet is relatively resilient against attacks, but that was not the "whole idea". It was not in the design at all.
Bob Taylor: “In February of 1966 I initiated the ARPAnet project. I was Director of ARPA‘s Information Processing Techniques Office (IPTO) from late ‚65 to late ‚69. There were only two people involved in the decision to launch the ARPAnet: my boss, the Director of ARPA Charles Herzfeld, and me. The creation of the ARPAnet was not motivated by considerations of war. The ARPAnet was created to enable folks with common interests to connect with one another through interactive computing even when widely separated by geography”.
Vint Cerf says the same about invention if TCP/IP transport protocol.
Even email has a method baked into to the protocol for handling failure.
Fallbacks are good, baking in resiliency is better.
A lot of dynamic sites use Fastly for its programmatic edge control and a near immediate ( ~1s-4s, typically around 2 ) global cache invalidation for any tagged objects with a single call to the tag. That feature alone simplifies backend logic significantly. To make this feature portable to CDNs that do not support it and provide only regular cache invalidation requires a complicated workflow setup which significantly increases the cache bust time, which in turn removes all the advantages of the treat dynamic content as static and cache bust on write approach.
Isn't a CDN fundamentally all about files too?
> Decentralized/distributed generally has slower network performance. Unless most nodes are high performance, I guess?
There is definitely more work to do here before this is really useful, but it's well within the realm of things that IPFS should be able to do at reasonable performance for production sites in future. Good performance still requires a serious CDN node network similar to traditional CDNs today (to seed your content for day to day use) but with IPFS if that CDN goes down then existing users on your site can _also_ serve the site to other nearby users directly, or other CDNs can serve your site too, etc etc. Your DNS wouldn't be linked to any specific CDN in any way, just to the hash of the content itself, so anybody could serve it.
> Decentralizing the internet works if it financially makes sense for platforms to build such tools.
There's a platform company called Fleek who already do this today: https://fleek.co/hosting/ (no affiliation, and I've never even used the product, just looks cool). Seems to be designed as a Netlify competitor: push code with git and it builds it into static content and then deploys to IPFS.
The benefits don't exist today of course, because no browsers natively support IPFS, so most users can only access the content via an IPFS gateway, which means you're back to fully centralized server infrastructure again... If we can get IPFS support into browsers though then fully decentralized CDN infrastructure for the web is totally possible.
It is an open-source software that allows you to keep and read offline static versions of websites in a specialized archive format (zim-files)
It was originally designed to allow you to read wikipedia offline, but there are also dumps of stackoverflow available on the relevant page : https://wiki.kiwix.org/wiki/Content_in_all_languages
int main() { int arr[100][200][100]; // allocate on the stack
return 0;
}According to the status page.
Fastly error: unknown domain: dashboard.heroku.com.
What a joke!
So they didn't need what they were about to purchase and saved their money. Doesn't sound like a net loss to me.
I was talking about the economy in general, not specific e-commerce sites. People that actually need what they were looking for but don't go back will buy it elsewhere. The money still flows, just somewhere else. And if they don't need the item(s), they'll perhaps use the money for something more useful.
EDIT: Most sites seem fixed now here in Canada. Tested stackoverflow, reddit, GitHub, PayPal, gov.UK and all worked fine.
In the case of the software Microsoft uses, it monitors endpoints for the websites in question and then changes which IP(s) are returned based on the availability of those endpoints, the geographic region and other factors.
Priority for A records would a nice feature.
Can always improve the process for the next outage.
It’s a commentary on work / life balance and the all-too-common phenomenon of employees sacrificing for a company (in this case, feeling such personal stress that they would lose sleep) and contrasting it with the fact that most employers will fire you without a second thought if it’s what’s best for the business (they won’t lose any sleep).
It’s a critique of the asymmetry that often exists and is frequently exploited by companies. This is often seen in statements like, “we are one big family so put in a few more hours for this launch” coupled with announcements like, “profit projections didn’t meet expectations so we are downsizing 5% of the work force.” You are family when they need you to work hard, and an expendable free market agent when your continued employment might risk hitting the quarterly goal.
It is, of course, reasonable to lose sleep if you think your employment is in jeopardy. Very few companies, especially in the competitive SV market are firing engineers because of a single outage, even a bad one, because you just paid a bunch of money to train those engineers how to see this coming and fix it.
I think Varnish uses mediation intentionally though, it was this way 7 years ago when I last used Varnish.
https://en.wikipedia.org/wiki/Guru_Meditation
Or did one of you already edit the Wikipedia page to reflect this discussion on hn?
Probably going to short the hell out of $FSLY.
They proudly stated this from their own website to their customers:
> "Fastly’s network has built-in redundancies and automatic failover routing to ensure optimal performance and uptime."
If that isn't one huge lie, I don't know what is.
I expect huge clients to be knocking on Fastly's door lining up for answers because of this.
We do not have a system that adjusts to "oops"
Engineers are paid because their companies have customers. The it is pure madness that #hugops is the thing. I sincerely hope that Fastly's customers wack it $$ wise so hard that it actually affects #hugops engineering culture.
At least HFT traders don't get paid to spy on their own customers with trackers littered everywhere, I find that very unethical that engineers get paid to even do that sort of thing, and every damn website has these trackers because engineers put them there.
> They're both pretty privileged jobs and HFT is not known for having tons of benefits to society
So HFT firms don't have their own foundations and grants to give to charities and organisations then?
I proposed and lead our multi-CDN project at Pinterest for both static and dynamic content and I can tell you, many many times over, it has been well worth the effort. Everybody should do this if not only for contract negotiating leverage.
Cache invalidation is fast enough on all CDNs now for most use cases (yes, including Akamai). But realistically, most sites (Pinterest included) are not using clever cache invalidation for dynamic content because it’s not worth the integration effort (and it’s very difficult to abstract for large 1k+ engineering teams). Most customers are just using DSAs for the L4/L5 benefits (both security and perf). In that case, it’s not complicated to implement multi-cdn.
The basic design BGP is very vulnerable against malicious attacks. Email security is nonexistent.
Many here have been responsible for web service outages albeit on much smaller scales, and in my experience it feels awful while it's happening but you quickly forget about it because so does everyone else.
> you quickly forget about it because so does everyone else
This is definitely not the case here, and the experiences are bound to be very different.
But I think our disagreement mainly stems from how we interpreted the parent comment. I thought it was very double, at one hand claiming to show support, at the other hand emphasizing how big of a catastrophy this was.
I just wanted to say that I think it most likely was a completely natural mistake, only exerbarated by the scale of the company, and that while you should take some action to prevent it in the future, you should not spend so much time dwelling on it. Shit happens, it's fine.
> Notices will be posted here when we re-route traffic, upgrade hardware, or in the extremely rare case our network isn’t serving traffic. - status.fastly.com
The extremely rare case happened for an hour, which is a very long time in internet time.
- ignoring warnings
- acting against known-to-them best practices
- repeating a previous mistake
But, again, these are just indicators, not a checklist.
Interestingly, any of these can happen also due to stress, burnout and generally broken company/team culture. Including something like a CYA culture where if they don't do something fast, they will be blamed for it, and thus they need to move fast and break things.
.. but of course XKCD is down too.
1: https://blog.emojipedia.org/content/images/2018/04/microsoft...
https://en.wikipedia.org/w/index.php?title=Guru_Meditation&d...
And it seems to be incorrect, since this "spelling variation" is only used by Fastly and not part of Varnish?...
It's normal to have downtimes but they are usually scheduled and quick (think <10 minutes per month for rebooting and/or hardware parts replacement). I'm pretty sure most non-profit hosts like disroot.org or globenet.org have similar or better 9's than all these fancy cloud services.
nslookup m.media-amazon.com
Name: media.amazon.map.fastly.net
It is very interesting that they are not using CloudFront!Amazon is also known to use Akamai. Sure, Amazon relies heavily on AWS, but why should it surprise anyone that a retail website obsessed with instant loading of pages decides to use non-AWS CDNs if the performance is better.
Even if CloudFront became the default, I'm certain amazon.com would keep contracts with fastly and akamai just so they can weight traffic away from CloudFront in an outage.
People must be held accountable to have good incentives to reduce such outtages in the future.
I do agree though that we should always be compassionate and realistic with other humans.
How do you make sure that mistakes don't happen, then? Do you blame and fire people who make mistakes, and hope that the next person put in the same spot doesn't make a mistake? Or do you figure out what caused that person to make the mistake and ensure there are processes in place so that next time this is less likely to happen?
Extrinsic motivators like 'we will give you a bonus' or 'we will fire you' are surprisingly bad at getting people to not fuck things up.
Maybe its a cultural thing. I hear a lot of firing at the US. I am from Europe.
v2. "The issue was caused by a previously unidentified pathway that caused a feedback loop and overloaded our servers in a cascading fashion (or whatever). We have implemented a fix for this and updated our testing and deployment processes to stop similar cascades."
Which solves the problem long term?
As an architect making product choices, v2 wins every time.
(With the caveat that if the cause was something that reveals a fundamental problem with the larger processes/professionalism/culture of the company, especially to do with security concerns, then I'm not buying that product and migrating away if we already use it.
Otherwise you develop internal process that's entirely scar tissue, and only stops your teams doing their jobs.
I am critizing myself all the time for stuff. No hurt feelings there.
Holding specific people "accountable" for outages doesn't incentivize reducing outages; it incentivizes not getting caught for having caused the outage.
As a result, post-mortems turn into finger-pointing games instead of finding and resolving the root cause of the issue, which costs the company more money in the long run when a political scapegoat is found but the actual bug in the code is not.
I feel like this requires some nuance.
Don't blame an IC for introducing a bug or misconfiguration that led to the outage.
Do consider blaming (and firing!) management if, during the postmortem, it turns out that it was in the way of fixing systemic problems.
Ultimately, rule #1 should be: don't blame somebody unless malice or gross negligence is proven. Rule #2 should be the assumption that ICs will not have done either. Rule #3 is that sometimes, individual responsibility is required.
For example email - the other big "internet-user" is technically not part of the WWW, but most (? I don't have any stats, just a guess) of our mailclients run on the WWW, nonetheless.
There are roads (or shall I say tubes?). There are cars and busses on the road. Over time, almost everyone has migrated to just a few bus companies. One of them suffers a complete collapse for a few hours. Yes, this means chaos when it comes to transporting people. But the roads are just fine.
This doesn't mean that the situation is fine and that people aren't affected. But it would be entirely different if the roads had been washed away or something.
I must admit, it has been strange seeing my US peers getting the vaccine months before I can in the UK, but I guess I take comfort knowing that both countries are still doing pretty well!
https://ig.ft.com/coronavirus-vaccine-tracker for reference.
What's important is important to share vaccines with all nations, and non-nations.
It could be a typo or an attempt to be clever.
> or in the extremely rare case our network isn’t serving traffic.
reports also came in that this was a service configuration[1] issue, so not only there is no failover system, not even any validation automation was in place that could have prevented this.
[0] https://status.fastly.com [1] https://twitter.com/fastly/status/1402221348659814411
I'm not sure what the native clients for Netflix and Spotify actually run, but I use their WWW clients mostly. Making most of my internet bits&bytes go over the WWW.
The scar tissue: this is where good choices come in because it's certainly not a rule that a change as a result of an incident review is an impediment to work. These definitely occur, and sometimes linger after the root cause is phased out. But best practices often reduce cognitive & process overheads.
A rough example is that there are still people out there FTPing code to servers, having to manually select which files from a directory to upload. Replacing this error prone process with a deployment pipeline leads to a massive reduction in the likelihood of errors and will actually speed up the deployment process. It's all about making the right choices, not knee-jerk protections, and sometimes the choice is to leave things as they are.
Depends whose security. I value my security dearly and that's why i use the Tor Browser. Cloudflare has decided i cannot browse any of their websites if i care about my security (they filter out tor users and archiving bots agressively) so i'm not using any cloudflare-powered website. Is it good for security that we prevent people from using security-oriented tooling, and let a single multinational corporation decide who gets to enter a website or not? In my book creating a SPOF is already bad practice, but having them filter out entrances is even worse.
Also, are all of these CDNs and other cloud providers are solving the right problems?
If you want your service to be resilient against DDOS attacks, you don't need such huge infrastructure. I've seen WP site operators move to Cloudflare because they had no caching in place, let alone a static site.
If you want better connectivity in remote places where our optic fiber overlords haven't invested yet, P2P technology has much better guarantees than a CDN (content-addressing, no SPOF). IPFS/dat/Freenet/Bittorrent... even multicast can be used for spreading content far and wide.
Why do sysadmins want/use CDNs? Can't we find better solutions? Solutions that are more respectful to spiders and privacy-minding folks with NoScript and/or Tor Browser?
https://news.ycombinator.com/item?id=13718752
Only discovered we should not forget,due to the good graces of google project zero.
A certain those of skepticism towards any technical offer out there would be advised.
Fastly's free offering gives you "$50 worth of traffic" whereas Cloudflare has a perpetually free option. And for Akamai you have to apply for a free trial.
Sorry what?
You've just witnessed almost the entire internet break because of a catastrophic cascading outage that affected lots of huge companies, since third party services used and trusted Fastly.
Shopify stores couldn't accept payments on their websites, Coinbase Retail/Pro transactions and trading apps failed to load, and delivery apps stopped loading all of a sudden. These are just a few that this outage has caused, and now you are trying to blame this onto me for not checking their SLA when millions were indirectly affected by this?
Fastly offered a product, their main product which is a CDN which took down lots of websites. I don't care if everything fails sometimes. There are sites that should NOT go down because of this configuration issue which they messed up.
You can say you don't care for reality, but it's not going to help you have better systems.
> There are sites that should NOT go down
Then they surely either engineered their system to not 100% rely on Fastly or negotiated appropriate terms with Fastly (Or decided Fastly going down was an acceptable business risk, which it is for nearly everybody). Everything else would be negligent, and surely nobody would be negligent when operating a site that "should NOT go down"?
No where in my sentence I said this so quit the strawman argument.
I know a client using a service that has 100% uptime for the year, that also relies on huge clients, I don't understand why Fastly can't guarantee at the very least and a failover system to counteract this, but clearly didn't work. (or even existed)
> (Or decided Fastly going down was an acceptable business risk, which it is for nearly everybody).
Then why did this cascade to almost everybody even indirectly? Surely their advertised failover system would have prevented this from prolonging further but lasted longer than it should have.
I don't think a store, exchange or trading desk not accepting payments from people for an hour is acceptable at all.
Blame the companies that relied on Fastly being up 100% of the time, even though Fastly explicitly states that they might be down any number of hours, and they will even give you money back for that [1]. If they did offer 100% SLA, it would probably be out of budget for most users, as that kind of systems are prohibitively expensive to run.
Depending on a single CDN like Fastly is building an SPOF into your product. It is not less of a design blunder that whatever Fastly did internally to have an outage. If Shopify lost millions because of a short, simple third-party outage they have at least as much of a high-priority postmortem to write and issues to address as Fastly.
[1] - https://docs.fastly.com/products/service-availability-sla
Why didn't this trigger? where was this system in place to prevent further cascading failures?
> Blame the companies that relied on Fastly
So it's everybody's fault Fastly went down now? That is a new one.
We understand you're upset and passionate about this, perhaps now when more information has been published you understand better the circumstances that caused this problem.
> pReTtY CeRtAiN
This, the wording in of itself shows you have absolutely no clue whatsoever at all of Netflix's culture.
Twitter is a media between people. Removing emoji representation differences on user devices is a way to hopefully reduce misunderstandings between users.
Its pretty easy: browse marked up documents, not applications. If some developer conflates the first for the second, move on.
Using Tor doesn't imply that your machine is also a Tor exit node.
So, the rarest of cases (our network isn’t serving traffic) just happened right now, and their failover system just took a snooze then, but 'it exists apparently' according to you.
Tell that the huge clients that lost sales because of this, and all you have to say is: "wE DoN'T kNoW..."
Tell these clients that they should've carefully read their contract with Fastly, especially the 'Service Level Agreement' part.
So if it would go down, it would cripple vast amount of internet.
To that end I've only used cloudflare and netlify. The others have too much friction to try out. I expect I would get experience on the job if necessary.
A worldwide outage happened that affected almost all locations and everybody, so actually SLA is meaningless in this case. Where was the extra redundancy? Where was the failover system? Why was other companies indirectly affected?
As far as I know Fastly's status page was even down during the outage, the fact that the best answer to this 'is we don't know' tells you everything you need to know. Maybe stop victim blaming this situation and focus on the main culprit.
Okay? some proof please? This is not far off from a baseless character attack which isn't really effective when trying to convince me about your point on you knowing about Netflix's culture.
If you really want a proper answer, the truth is, unfortunately for you I am in management (previously was an engineer) and have always known Netflix to have a stellar performance oriented (and fear driven) culture, their playbook operates like a sports team. Not for everyone, but that's the point and it works for them.
Maybe you should look inward to yourself if you're so vexed with me to call me silly names, that you can't handle the truth or the culture about why some companies like Netflix adopts this.
Peace.
And back to the main point, So I assume you agree that Netflix did go completely down the other day then right? It seems according to you that you know better of Netflix's management culture.
> I'm pretty certain this is not how Netflix's culture is.
Would you be willing to share your expert insight of this if you know better then?
But now getting back to Netflix, they have post-mortems and they don't fire people willy-nilly over mistakes. Sure it's not hugops (a term I don't care for either), but they don't just up and fire people over a mistake. I never said anything about netflix going up or down on that day, but they also have problems just like everyone else. Their SLA is not 100% uptime and neither is Fastly.
In closing, you are being a pedantic little bitch who wants to argue minutia and I'm done with your trolling. I'm done responding to you, feel free to have the last reply as I really don't care anymore.
Arrange the html so that the list of comments is at the end (via css). Keep the http connection open, have the show more button send some of request, and when you receive that request send the rest of the page over the original http connection.
As usual, solve people problems via people, not tech.
Maybe css to load an image on :active or is there some better way?
> As usual, solve people problems via people, not tech.
So true..
“View entire discussion” couldn’t be implemented perfectly with <details> in its present form, but you can get quite close to it with a couple of different approaches.
I think the infinite scrolling of subreddits is about the only thing that would really be lost by shedding JavaScript. Even inline replies can be implemented quite successfully with <details> if you really want.
https://old.reddit.com/robots.txt
is very different from this:
I guess there is a market for search engine (maybe accessed through tor) which does not care about robots.txt, DMCAs, right to be forgotten etc. Bootstrapping it should not be that hard since it can also provide better results for some queries since nobody is fighting about the position until it's widely known.
I'm not sure how far are we from being able to do full text internet search. Or rather even quote search, preferably some fuzziness options. That would be cool, Google's quotation marks were really neat back when they were working.
$ curl https://old.reddit.com/robots.txt
User-Agent: *
Disallow: /
Also, even if search engines are allowed, old.reddit.com pages are not canonical (<link rel="canonical"> points to the www.reddit.com version, which is actually reasonable behavior), so pages there would not be crawled as often or at all. User-Agent: bender
Disallow: /my_shiny_metal_ass
User-Agent: Gort
Disallow: /earthThat’s not going to happen before Cloudflare is dethroned. See this recent thread for some perspective: https://news.ycombinator.com/item?id=27153603
And even if there’s no Cloudflare, large sites that people want to search will always find ways to block bad bots.
The only thing I can think of that might work is using crowd-sourced data, with all the problems that come with crowdsourcing.
There is a solution for all this mess and I'm blocking HN and a few different domains until I implement at least the first step after which I can share it here.
/etc/hosts
reddit.com old.reddit.com
www.reddit.com old.reddit.com
np.reddit.com old.reddit.com
Sync is so much better than the official app it's not even funny.
① A submit button or link targeting an iframe which is visually hidden. (Or even don’t hide it. If only seamless iframes had happened, or any other way of auto-resizing an iframe: relevant spec issues are https://github.com/whatwg/html/issues/555 and https://github.com/w3c/csswg-drafts/issues/1771.)
② A submit button or link to a URL that returns status 204 No Content.
(CSS image loading in any form is not as robust because some clients will have images disabled. background-image is probably (unverified claim!) less robust than pseudoelement content as accessibility modes (like high contrast) are more likely to strip background images, though I’m not sure if they are skipped outright or load and aren’t shown. :active is neither robust nor correct: it doesn’t respond to keyboard activation, and it’s triggered on mouse down rather than mouse up. Little tip here for a thing that people often get wrong: mouse things activate on mouseup, keyboard things on keydown.)
.button:active { background-image: url('/some-reference-thats-actually-a-tracker'); }