Things We Forgot to Monitor

Things We Forgot to Monitor(word.bitly.com)

232 points by jehiah 12 years ago | 61 comments

AznHisoka 12 years ago |

Also: 1) Maximum # of open file descriptors

2) Whether your slave DB stopped replicating because of some error.

3) Whether something is screwed up in your SOLR/ElasticSearch instance so it doesn't respond to search queries, but respond to simple heartbeat pings.

4) If your Redis db stopped saving to disk because of lack of space, or not enough memory, or you forgot to set overcommit memory.

5) If you're running out of space in a specific partition you usually store random stuff like /var/log.

I've had my ass bitten by all of the above :)

contingencies 12 years ago | |

6) Free inodes (as distinct from space) per filesystem.

caw 12 years ago | | |

Similar to free inodes, you should also check for maximum number of directories. dir_index option helps, but I've seen it become a problem.

Gracana 12 years ago | |

> Maximum # of open file descriptors

Augh. I ran one of my servers hard into that wall, and now it's something I watch. At least I learned from that mistake.

apaprocki 12 years ago | | |

Related to this, if you've ever built/run anything on Solaris, you probably found out the hard way that even in modern times, fdopen() in 32-bit apps only allows up to 255 fds because they oh so badly want to preserve ages old ABI. Funny bug to hit at runtime in production when you aren't aware of this compatibility "feature".

wtracy 12 years ago | | |

I learned the hard way that MySQL creates a file descriptor for every database partition you create. Someone had a script that created a new partition every week...

teddyh 12 years ago | |

X) Number of cgroups. We were getting slow performance, apparently related to slow IO, but nothing stood out as being the culprit. Turns out, since vsftpd was creating cgroups and not removing them, the pseudo-filesystem /sys/fs/cgroup had myriads of subdirectories (each representing a cgroup), and whenever something wanted to create a new cgroup or access the list of cgroups, this counted as listing that pseudo-directory, which counted as IO.

Fixed by using the undocumented option isolate_network=NO in vsftpd.conf.

DrJ 12 years ago | |

Feels like this list (and the original post) are problems caused by:

* lack of proper/default monitoring advocated for your tools (2), (4).

* Choosing poor (default/recommended) settings (1), (4).

* Keeping stateless server/instances when you don't need to (5), (6).

* Not tracking performance as part of monitoring (3), (4)

Albeit, I have made the same mistakes too.

edit: formatting

otterley 12 years ago |

Swap rate (as opposed to space consumed) is probably the #1 metric that monitoring agents fail to report.

One thing that drives me nuts is how frequently monitoring agents/dashboards report and graph only free memory on Linux, which gives misleading results. It's fine to report it, but to make sense of it, you have to stack free memory along with cached and buffered memory, if you care about what's actually available for applications to use.

Another often-overlooked metric that's important for web services in particular is the TCP accept queue depth, per listening port. Once the accept queue is drained, remote clients will get ECONNREFUSED, which is a bad place to be. This value is somewhat difficult to attain, though, because AFAIK Linux doesn't expose it.

InclinedPlane 12 years ago | |

> One thing that drives me nuts is how frequently monitoring agents/dashboards report and graph only free memory on Linux, which gives misleading results. It's fine to report it, but to make sense of it, you have to stack free memory along with cached and buffered memory, if you care about what's actually available for applications to use.

Even that is misleading. It's actually non-trivial to find out exactly how much "freeable" memory one has on a linux system these days as not all the cached memory bits are truly freeable.

rodgerd 12 years ago | |

Even then there's some wrinkles; the anon shared memory used by e.g. the Oracle SGA will show up as cached memory, but evicting it is a no-no.

justincormack 12 years ago | |

Yes I can't find the socket backlog anywhere in Linux. FreeBSD exposes it via kqueue http://www.freebsd.org/cgi/man.cgi?query=kqueue through the data item in EVFILT_READ.

otterley 12 years ago | | |

With FreeBSD it's even easier; you can use "netstat -L".

marcosdumay 12 years ago | |

Swap rate still looks like the wrong metric. It'd be better to have the rate of swap lookups, excluding all writes.

otterley 12 years ago | | |

swap-in rate, to be more specific. swap-outs aren't incredibly worrisome.

bradleyland 12 years ago |

Interestingly, an out-of-the-box Munin configuration on Debian contains nearly all of these. I recommend setting up Munin and having a look at what it monitors by default, even if you don't intend to use it as your monitoring solution.

hansjorg 12 years ago | |

Installation on Debian/Ubuntu is also as simple as installing the munin package (munin-node for subsequent hosts) and pointing a webserver at the right directory.

Extremely valuable when something is acting up.

tantalor 12 years ago |

Some people, when confronted with a problem, think “I know, I'll send an email whenever it happens.” Now they have two problems.

dredmorbius 12 years ago |

The corollary of this post is "things we've been monitoring and/or alerting on which we shouldn't have been".

Starting at a new shop, one of the first things I'll do is:

1. Set up a high-level "is the app / service / system responding sanely" check which lets me know, from the top of the stack, whether or not everything else is or isn't functioning properly.

2. Go through the various alerting and alarming systems and generally dialing the alerts way back. If it's broken at the top, or if some vital resource is headed to the red, let me know. But if you're going to alert based on a cascade of prior failures (and DoS my phone, email, pager, whatever), then STFU.

In Nagios, setting relationships between services and systems, for alerting services, setting thresholds appropriately, etc., is key.

For a lot of thresholds you're going to want to find out why they were set to what they were and what historical reason there was for that. It's like the old pot roast recipe where Mom cut off the ends of the roast 'coz that's how Grandma did it. Not realizing it was because Grandma's oven was too small for a full-sized roast....

Sadly, that level of technical annotation is often lacking in shops, especially where there's been significant staff turnover through the years.

I'm also a fan of some simple system tools such as sysstat which log data that can then be graphed for visualization.

jlgaddis 12 years ago |

Be sure to monitor your monitoring system as well (preferably from outside your network/datacenters)! If you don't have anything else in place, you can use Pingdom to monitor one website/server for free [0].

I was off work for a few months recently (motorcycle wreck) and removed my e-mail accounts from my phone. Now, I have all my alerts go to a specific e-mail address and those are the only mails I receive on my phone. It has really helped me overcome the problem of ignoring messages.

[0]: https://www.pingdom.com/free/

comice 12 years ago |

We monitor outgoing smtp and http connections from anything that requires those services.

And the best general advice I have is split your alerts into "stuff that I need to know is broken" and "stuff that just helps me diagnose other problems". You don't want to be disturbing your on-call people for stuff that doesn't directly affect your service (or isn't even something you can fix).

mnw21cam 12 years ago |

Also, are your backups working.

jsmeaton 12 years ago |

We had a perfect storm of problems only 2 weeks ago.

1. A vendor tomcat application had a memory leak, consumed all the RAM on a box, and crashed with an OOM

2. The warm standby application was slightly misconfigured, and was unable to take over when the primary app crashed

3. Our nagios was configured to email us, but something had gone wrong with ssmtp 2 days prior, and was unable to contact google apps

3a. No one was paying any attention to our server metric graphs / We didn't have good enough "pay attention to these specific graphs because they are currently outside the norm"

A very embarrassing day for us that one.

We're now working on better graphing, and have set up a basic ssmtp check to SMS us if there is an issue. Monitoring is hard.

berkay 12 years ago | |

You may want to check OpsGenie heartbeat monitoring, or essentially implement the same idea yourself. Our heartbeat monitoring expects to receive messages (via email or API) from monitoring tools periodically and notifies you via push/SMS/phone if we don't receive it over 10 minutes. I think this pattern is very useful to ensure that alert notifications is working.

marcosdumay 12 years ago | |

> and have set up a basic ssmtp check to SMS us if there is an issue.

And what will happen when the network (or the alert server) is down?

You must put some check outside your network, with independent infrastructure. Adding another protocol on the same net is still subject to Murphy law.

berkay 12 years ago | | |

Independent infrastructure is a good idea but not always feasible for everyone. At OpsGenie, to resolve this problem, we came up with a solution we refer as "heartbeat monitoring". This basically allows monitoring tools to send periodic heartbeat messages to us that indicate that the tools is up and can reach us. If we don't receive heartbeat messages from them in 10 minutes, we generate an alert and notify the admins. Not out of band management but does the trick to prevent situations like jsmeaton described.

http://support.opsgenie.com/customer/portal/articles/759603-...

sp332 12 years ago |

You're using icanhazip.com in production? I see from a quick Google search that Puppy Linux seems to use it in some scripts, but how reliable is it?

jphines 12 years ago | |

$ curl -i -k -L icanhazip.com

HTTP/1.1 200 OK

Date: Mon, 10 Feb 2014 20:13:28 GMT

Server: Apache

Content-Length: 15

Content-Type: text/plain; charset=UTF-8

X-RTFM: Learn about this site at http://bit.ly/14DAh2o and don't abuse the service

X-YOU-SHOULD-APPLY-FOR-A-JOB: If you're reading this, apply here: http://rackertalent.com/

X-ICANHAZNODE: icanhazip2.nugget

Would seem only fair. :D

toomuchtodo 12 years ago | |

jsonip.com is also usable in production.

baruch 12 years ago |

About reboot monitoring, I suggest to use kdump to dump the oops information and save it for later debugging and understanding of the issue. It may even be an uncorrectable memory or pcie error you are seeing and the info is logged in the oops but is hard to figure otherwise. Also, if you consistently hit a single kernel bug you may want to fix it or workaround it.

lincolnpark 12 years ago |

Also, are your API endpoints working properly.

dredmorbius 12 years ago | |

Can you expand on that?

AznHisoka 12 years ago | | |

Ha I can.

Sometimes api providers change the damn response format. Or their urls change. Or they blocked your ip without notifying you.

jlgaddis 12 years ago |

I have gear in three different facilities and I'm typically visiting any of them unless I'm installing hardware or replacing it. Shortly after starting at $job, I realized there was no monitoring of the RAID arrays in the servers we have. That could have ended badly.

herokusaki 12 years ago |

How oversold your VPS provider's server is commonly blamed for slowdown but rarely measured.

stephengillie 12 years ago |

Between PRTG and Windows, almost all of that is handled for us. And PRTG can call OMSA by SNMP.