How Stack Overflow plans to survive the next DNS attack(blog.serverfault.com) |
How Stack Overflow plans to survive the next DNS attack(blog.serverfault.com) |
Also, it'd be a great public service to publish the results. Even if it's just enabled for a day per year or so the results would probably appreciated by many. And you could always sell your altruism as the need to continually monitor the situation :)
However, when our relationship with CloudFlare ended and we moved to Fastly, one of the reasons we did so was unreliable lag with DNS updates at CloudFlare. Sometimes it would take minutes or hours for DNS changes to be affected in their system, and they could never provide a satisfactory reason why.
Do you think after the Dyn outage everyones sysadmins are running round adding redundancy, too worried to trust the uptime of their site in the hands of just CloudFlare?
[1] http://nickcraver.com/blog/2016/02/17/stack-overflow-the-arc...
Source:
https://meta.stackoverflow.com/questions/323537/cloudflare-i...
Cloudflare was unable to diagnose this or give us any comfort that it would improve.
I ended up with a Dyn / Route53 configuration. We used libcloud to sync everything together. We also added the exported zone to Cloudflare but did not enable it.
We had actually planned for this, but in no way did we ever come close to your in depth testing. The @ Azure issue - thank you for uncovering this for the rest of us.
We should be open sourcing this rather shortly, so stay tuned.
Sorry, I'm not who you asked, but that is how we are doing it at stack overflow now.
Here's the math for expected number of tries if half of the servers are offline. (It's a hypergeometric distribution but I couldn't find a closed formula)
E(2 server) = 1 * 1/2 + 2 * 1/2 = 1.5
E(4 server) = 1 * 2/4 + 2 * 2/4 * 2/3 + 3 * 2/4 * 1/3 = 1.67
E(8 server) = 1 * 4/8 + 2 * 4/8 * 4/7 + 3 * 4/8 * 3/7 * 4/6 + 4 * 4/8 * 3/7 * 2/6 *4/5 = 1.73
You are correct in saying that more empirical data could be used here. We might even end up changing our minds. I'm not much of a numbers person but I might pass this onto some of the people in our company who love solving problems like this.
Hurricane Electric supports this but most of the providers mentioned in this article do not.
And as other have said, while Cloudflare may not be for everyone, their DNS is possibly the fastest. Not sure why SO decide to drop them.
*Some old Data http://www.dnsperf.com/
I also wonder on the performance of DNSimple. But they dont see to emphasis much on performance.
EdgeCast were dropped due to pricing, and that there's talk of Verizon selling the EdgeCast services again.
DNSimple didn't make it to performance testing because they only had 5 POPs, as opposed to 20+ of other providers.
CloudFlare's DNS was consistently one of the fastest, you are correct about that. If you read my responses to other comments here, you'll find that we decided not use their DNS service because of some fairly pervasive API issues we had with it.
Last commit of substance was in Sept 2015.
I wonder what Netflix is doing instead.
Very good analysis of SO and a smart move to roll this out _before_ a new DNS outage!
If you could have a unified API that would create the records on multiple providers that would be money, it's just that you'd lose out on some things like Route 53 health checking, etc.
It's up to the client resolver to handle failover, so it's not perfect in terms of availability, but better than nothing.
For example:
$ dig ns amazon.com
amazon.com. 3599 IN NS ns4.p31.dynect.net.
amazon.com. 3599 IN NS ns1.p31.dynect.net.
amazon.com. 3599 IN NS ns3.p31.dynect.net.
amazon.com. 3599 IN NS ns2.p31.dynect.net.
amazon.com. 3599 IN NS pdns1.ultradns.net.
amazon.com. 3599 IN NS pdns6.ultradns.co.uk.
(note that this is also TLD redundant, since there's a .co.uk included)Managing whitelists between multiple 3rd party DNS providers is likely to break frequently as servers move around, are added, removed, etc.
Interestingly, Hurricane Electric would have been one of our top choices if they had a first class API and a commercial SLA. Their ability to support zone transfers is admirable and did not go un-noticed. DNS Made Easy also supports zone transfers.
Hurricane Electric supports zone transfers and requires you to only allow AXFR's from a single host -- slave.dns.he.net (IPv4: 216.218.133.2, IPv6: [2001:470:600::2]). NOTIFYs should not be sent to slave.dns.he.net but instead to ns1.he.net.
n.b.: ns1.he.net is not anycasted, but ns[2-5] are. In addition, ns1 does not have an AAAA RR.
We (ISP) currently run our own authoritative name servers in our own facilities but I've been seriously debating adding another provider into the mix so "secondary" service is an important feature to me.
Everyone seems to be inventing their own custom API for this, which I guess is the 'modern developer friendly' approach, but it results in a bit of a mess. Example: Caddy's implementation of the Let's Encrypt / ACME dns-01 challenge has all these plugins: https://caddyserver.com/download
We ended up running our own authoritative nameservers, which is not ideal. But at least cloud offerings allow you to spread across regions.
That's one of the reasons why the DNS hosting I support, which uses git-hooks to trigger updates, only currently pushes the DNS data to Amazon's route53 infrastructure.
At the time of the most recent Dyn outage I looked at allowing users to support multiple back-ends, to abstract away the pain of redundancy, but it seemed there was surprisingly little interest.
We (ISP) run our own authoritative name servers. Ideally, I'd have a single hidden ("stealth") master (maybe two, w/ anycast) and all of the public name servers would simply slave from that one. If you run PowerDNS -- which supports MySQL/PostgreSQL backends, among others -- you can keep everything in a local database and use standard tools (or write your own) to manage it.
(If I was pretty much anywhere besides an ISP, I'd definitely be using a provider with a fully-featured API. I use Route 53 now for my personal domains but I manage the zones by hand in the console since the RRs practically never change.)
As you can whitelist Tor traffic in Cloudflare, it seems to be down to these 503 errors (edge to origin). But haven't heard that before, so not sure if it's a problem that occurs more often.
For Imgur, I could imagine that purging by cache tag is just too restrictive at Cloudflare (the limit is very low, even for Enterprise clients). Fastly doesn't have a limit there, they encourage you to cache everything and purge where needed. Makes it much easier to cache APIs and HTML pages.
Similarly, there are a fair number of DNS providers that don't allow you to use all DNS record types. For something so simple, providers can really go out of their way to screw it up.
So... I wouldn't invest time in it unless you want to take over stewardship (no one else has offered in the last 7 months).
It sounds like your personal domains you're happy enough as-is, and for an ISP I expect you'd not want to outsource something so critical as DNS..
Though with a decent API it wouldn't be hard to write the glue to do it - I've certainly converted from bind to my own representation, then from that to Route53.
It's just a shame we all have to keep reinventing the wheel.
Pretty sure the captcha can be switched off for enterprise clients (which I presume SO would be).
Regarding caching, SO's caching is extremely aggressive. It's especially problematic when editing answers more than once, because when you click edit you'll be presented with a cached copy of your answer, potentially excluding your latest revisions. So you have to refresh the edit page and then edit the answer.
It might seem like an edge case, and it certainly teaches you to be meticulous about answering lest you have to go through the refresh hell, but it's not what some would designate as 'good UX'.
I am never surprised by SO's reports about how they run their busy website from a mere handful of machines precisely because of the caching they do.