Too many things can go wrong and you are all around better off outsourcing this to something like Pingdom. You don't have sufficient levels of reliability, you aren't dual homed across twilio and another phone system. Maybe the cause of your outage is that AWS is having issues. Now your site and your monitoring is down.
Much better to outsource to people who obsess over doing this right and making sure they are properly redundant.
Completely agree! I often have to fight that "I could just build that myself" mentality, which glosses over the points you made so well.
It's the same as "Twitter clone" with just posting messages with 140 char limit and "build a blog in 15 minutes".
Alerting over a downed website are is sorta like a glacier, there's so much under the surface, if you just see the surface you're missing out.
1. Multiple locations 2. Multiple check intervals 3. SMS/email provider switch on fail 4. Auto recovery of your checkers 5. Multiple providers with a single storage.
You make valid points about redundancy and levels of reliability but keep in mind that even Pingdom can go down: http://royal.pingdom.com/2016/10/24/ddos-attack-affects-ping...
Diversify to avoid cascading failures ;)
Also, if you want a 'proper' ops alerting SaaS, you're looking at something along the lines of $50/user/mo or $15/server/mo, neither of which is trivial.
(It has. Completely and silently stopped processing against Kinesis queues for a few hours recently. Guess what AWS Step is built on?)
It could be very useful to, for example, keep an eye on your monitoring system. At $work, we have a pretty extensive monitoring system that we've built out. We use an external service to watch over the monitoring system, though, to alert us of any issues with it that we haven't otherwise caught.
Besides, like he said, it's "fun" and kinda neat.
Just running this docker image on a server you want to monitor is enough.
Instead of Twilio it uses Simplepush (https://simplepush.io).
EDIT: Just seen that it is Android only! :-/
I use it for some personal automation scripts that might need to get my attention if something goes wrong.
Most providers have SMTP gateways for SMS services. Verizon runs @vtext.com
But my bigger point here is that you're essentially asking "well how do you monitor your monitor?" At which point up the chain do you have enough? Also, I think the original post was simply a demo of what is possible. Yet whenever someone posts something, people go in the comments to belittle it. "Yeah, you built a monitoring solution... Well what happens if that goes down?"
Which is a legitimate question. But obviously if your production service is that critical to your business, you won't be monitoring it with a service that costs $0.0000002 per execution.
I think you underestimate the interdependency of services in AWS. Historically, if there were problems with S3 or EBS in us-east-1, you could expect the entire API to be flaky, and things like autoscaling to fail. These have been better distributed, but failures still cascade.
> I think the original post was simply a demo of what is possible
No, it wasn't a demo, it was an actual production issue. No alarms, no error logs, no way to tell it wasn't working other than someone noticing the queues were getting larger and contacting AWS.
> people go in the comments to belittle it
Only because the original project projects AWS Lambda as "the solution" for such problems, not realizing that it is just as fallible a solution as everything else.
> Well what happens if that goes down?
The solution to this is well known - two monitoring systems in physically separate locations that monitor each other as well as mission critical systems. Nagios, Icinga, and a dozen other well-tested solutions work remarkably well for these roles, yet people keep writing "new" solutions over and over and over.
> But obviously if your production service is that critical to your business, you won't be monitoring it with [this] service
Then what's it's value, other than as an intellectual exercise?
The whole setup will still cost $0/month.
> The solution to this is well known - two monitoring systems in physically separate locations that monitor each other as well as mission critical systems. Nagios, Icinga, and a dozen other well-tested solutions work remarkably well for these roles, yet people keep writing "new" solutions over and over and over.
Because not everyone needs heavy solutions to do something simple. Side projects, small sites, etc. And some people enjoy implementing old use cases using new technology. When Go was rising in popularity, half the posts on the front page were re-implementing fairly common features in Go.
Even if you're not going to implement this yourself, there can still be some value for other readers.
I hope not. But then it's not just Lambda triggered by cloudwatch alarms anymore. You'd probably have to set up something to ensure that Lambda, when called via cloudwatch alarms, is being triggered properly. Useful, but suddenly a lot more complicated.
> The whole setup will still cost $0/month.
Unlikely. A small amount, but certainly not 0. Especially when you start adding Lambda heartbeats.
> And some people enjoy implementing old use cases using new technology.
Which is fine; call it an experiment, call it exploration, I have no problem with that. It's frustrating to see such a stripped down article treating it like it's going to be the one, without reasonable discussions about how it could fail. There are a minimum of three failure points in this system alone, with no discussion on how to compensate for them.
Multiple ping locations is helpful in bringing more data points, but it doesn't address the problem of explaining what the data means. For example, pingdom could provide triangulation of the failure if fault identification was part of the businesses model of monitoring.
I would describe the criticism of pingdom as a failure of expectations. Pingdom is not a security service, a monitoring service, or fault identification service. They are a single test, and the data you get back is useless unless interpreted and verified.
If you're providing a service to your users, and they say that the service is down using pingdom, you should be looking into, not just saying "Works on my machine".
I mean, what qualifies as “being up”? If some random link in the middle of the Internet goes down, and you suddenly, for 30 seconds, are unreachable for the few hundred people going through that exact link because it happens to be the best path between those people and your server, can they claim that you have failed to provide adequate uptime? If such a fault happens, are you then responsible to troubleshoot it? I say no. The Internet is the ISP’s responsibility, and the only faults actually meaningful to report to your ISP are the repeatable or long-lasting ones. Small stuff like this is not worth anybody’s time (except ISPs) to go digging into.
If Pingdom can't get to your site, it's highly likely your users can't either.
You've seem to think that you have to investigate the issues. On the contrary, you bump it up to your isp to investigate. If your ISP is regularly having these issues then it might be time to change ISPs to one with a better peering agreement.
1. Reported as being experienced by an actual user of a web site,
2. Longer than a a couple of minutes at most (usually just a few seconds),
3. or happened more frequently than a few times per month,
then I might consider reporting it to my ISP. As it is, it’s not worth it. “Cosmic rays, man.” (https://www.joelonsoftware.com/2001/07/31/hard-assed-bug-fix...).