Things We Forgot to Monitor(word.bitly.com) |
Things We Forgot to Monitor(word.bitly.com) |
2) Whether your slave DB stopped replicating because of some error.
3) Whether something is screwed up in your SOLR/ElasticSearch instance so it doesn't respond to search queries, but respond to simple heartbeat pings.
4) If your Redis db stopped saving to disk because of lack of space, or not enough memory, or you forgot to set overcommit memory.
5) If you're running out of space in a specific partition you usually store random stuff like /var/log.
I've had my ass bitten by all of the above :)
Augh. I ran one of my servers hard into that wall, and now it's something I watch. At least I learned from that mistake.
Fixed by using the undocumented option isolate_network=NO in vsftpd.conf.
* lack of proper/default monitoring advocated for your tools (2), (4).
* Choosing poor (default/recommended) settings (1), (4).
* Keeping stateless server/instances when you don't need to (5), (6).
* Not tracking performance as part of monitoring (3), (4)
Albeit, I have made the same mistakes too.
edit: formatting
One thing that drives me nuts is how frequently monitoring agents/dashboards report and graph only free memory on Linux, which gives misleading results. It's fine to report it, but to make sense of it, you have to stack free memory along with cached and buffered memory, if you care about what's actually available for applications to use.
Another often-overlooked metric that's important for web services in particular is the TCP accept queue depth, per listening port. Once the accept queue is drained, remote clients will get ECONNREFUSED, which is a bad place to be. This value is somewhat difficult to attain, though, because AFAIK Linux doesn't expose it.
Even that is misleading. It's actually non-trivial to find out exactly how much "freeable" memory one has on a linux system these days as not all the cached memory bits are truly freeable.
Extremely valuable when something is acting up.
Starting at a new shop, one of the first things I'll do is:
1. Set up a high-level "is the app / service / system responding sanely" check which lets me know, from the top of the stack, whether or not everything else is or isn't functioning properly.
2. Go through the various alerting and alarming systems and generally dialing the alerts way back. If it's broken at the top, or if some vital resource is headed to the red, let me know. But if you're going to alert based on a cascade of prior failures (and DoS my phone, email, pager, whatever), then STFU.
In Nagios, setting relationships between services and systems, for alerting services, setting thresholds appropriately, etc., is key.
For a lot of thresholds you're going to want to find out why they were set to what they were and what historical reason there was for that. It's like the old pot roast recipe where Mom cut off the ends of the roast 'coz that's how Grandma did it. Not realizing it was because Grandma's oven was too small for a full-sized roast....
Sadly, that level of technical annotation is often lacking in shops, especially where there's been significant staff turnover through the years.
I'm also a fan of some simple system tools such as sysstat which log data that can then be graphed for visualization.
I was off work for a few months recently (motorcycle wreck) and removed my e-mail accounts from my phone. Now, I have all my alerts go to a specific e-mail address and those are the only mails I receive on my phone. It has really helped me overcome the problem of ignoring messages.
And the best general advice I have is split your alerts into "stuff that I need to know is broken" and "stuff that just helps me diagnose other problems". You don't want to be disturbing your on-call people for stuff that doesn't directly affect your service (or isn't even something you can fix).
1. A vendor tomcat application had a memory leak, consumed all the RAM on a box, and crashed with an OOM
2. The warm standby application was slightly misconfigured, and was unable to take over when the primary app crashed
3. Our nagios was configured to email us, but something had gone wrong with ssmtp 2 days prior, and was unable to contact google apps
3a. No one was paying any attention to our server metric graphs / We didn't have good enough "pay attention to these specific graphs because they are currently outside the norm"
A very embarrassing day for us that one.
We're now working on better graphing, and have set up a basic ssmtp check to SMS us if there is an issue. Monitoring is hard.
And what will happen when the network (or the alert server) is down?
You must put some check outside your network, with independent infrastructure. Adding another protocol on the same net is still subject to Murphy law.
http://support.opsgenie.com/customer/portal/articles/759603-...
HTTP/1.1 200 OK
Date: Mon, 10 Feb 2014 20:13:28 GMT
Server: Apache
Content-Length: 15
Content-Type: text/plain; charset=UTF-8
X-RTFM: Learn about this site at http://bit.ly/14DAh2o and don't abuse the service
X-YOU-SHOULD-APPLY-FOR-A-JOB: If you're reading this, apply here: http://rackertalent.com/
X-ICANHAZNODE: icanhazip2.nugget
Would seem only fair. :D
Sometimes api providers change the damn response format. Or their urls change. Or they blocked your ip without notifying you.
Are you arguing that alerts are useless, and we must fix the issue for once? Because if so, I'd point that some things can not be fixed (because the Earth is finite, we don't know all things, etc) and you are better alerted sooner, rather than later.
Now, if you are arguing that email is not the right medium for an alert, well, what medium is better? Really, I can't think of any single candidate. Yeah, email may go down, that's why you complement it with some system external to your network (a VPS is cheap, a couple of them in different providers is almost flawless, and way cheaper than any proprietary dashboard). Yes there is some delay involved, that should be of a few minutes at most, because you create some addresses specifically for the alerts, and make all hell break loose then a message gets there. Some standard IM protocol that federated between all your net (and external point of control), could be reached from anywhere, and had plenty of support on all kinds of computers would be better, but it does not exist.
For airline pilots, an excessive number of warnings themselves (bells, alarms, audible warnings) are known to distract the pilots and cause errors.
Once you start sending emails for things, you start sending emails for everything. It's easy to fall into the trap of not accurately categorizing what is critical (like real, real, critical, I mean it this time guys!) and what are merely statuses. So what happens is everything starts being ignored, and your systems become obscure black boxes again.
I would recommend an SMS sent via GSM modem for out-of-band emergency notifications.
NPR covered this a few days back, I've written on it at more length:
http://www.npr.org/blogs/health/2014/01/24/265702152/silenci...
http://www.reddit.com/r/dredmorbius/comments/1x0p1b/npr_sile...
Reminds me of a few times the email queues got backed up to hell and beyond. Fuck you, Yahoo.
I was thinking some sort of end-point test myself, hadn't considered the specific case of APIs.
A pageout might suggest memory pressure, but not nearly as much as a swapout does. (pgmajfault is a better indicator.) Writing dirty pages is just something the kernel does even when there's no memory pressure at all. Also, unfortunately you can't use pgpgout for anything useful as ordinary file writes are counted there.
Where I saw it crop up was 32K folders under /tmp on a cluster system. So no it's not a limit on number of directories entirely (that's inodes), but rather how many subdirectories you can have.
http://en.wikipedia.org/wiki/Ext4#Features <-- Fixes 32K limit
20+ years of experience tells me most monitoring tools aren't reasonable.
You're wrong.
As a sysadmin, I typically receive something on the order of 1,000 to 10,000 emails daily (the specifics vary by the system(s) I'm admining). Staying on top of my email stream is a significant part of my job, both in not ignoring critical messages which have been lost, misfiled, or spamfiltered, and in getting bogged down in verbose messages which convey no real information.
Alerts which tell me nothing have a negative value: they obscure real information, they don't convey useful information, and each person who comes on to the team has to learn that "oh, those emails you ignore", write rules to filter or dump them, etc.
Worse: if the alerts might contain useful information, that fact has to be teased out of them.
The problem with emails such as that is that they're logging or reporting data. They should be logged, not emailed, and with appropriate severity (info, warning, error, critical). Log analysis tools can be used to search for and report on issues from there.
As I said: in a mature environment, much of my work goes into removing alerts, alert emails, etc., which are well-intentioned but ultimately useless.
Sorry, but you're not a very good sysadmin then. You have chosen poor tools or do not understand how to distill the information. Knowing that, I can see why you think email alerts don't work. They are effectively broken FOR YOU.
It's usually not the system administrators that get to decide what the Corporate Overlords purchases or who they do business with. So I think it's pretty unfair to blame the admins for "choosing poor tools".