torstai 27. lokakuuta 2011

Monitoring, part I: What and why, with example

Originally published at Sysart DevBlog, please comment there.

One important part of administrating servers is monitoring. Monitoring tells you when something goes wrong and why it went wrong, preferably before users notice the failure.

There's two somewhat distinct parts in monitoring, alerts and statistics. At higher level, alerts will notify operations staff when something is wrong, for example server stops responding. Then statistics gives you some ideas what went wrong.

We are currently using Icinga as a alerting tool and Munin as a statistics gatherer. Installation and configuration of these tools is little bit complicated, but by following documentation it is possible to have pretty decent monitoring system running in few hours.

Here's an actual case where monitoring proved to be essential. We had a situation where one server would stop responding after running few weeks and only way to resolve this was restarting server software. To make things worse, this particular system wasn't monitored at all and was running some business critical services. So first step was activating Icinga on that server so we could get alerts when the software stopped responding and after that, Munin was installed with pretty basic configuration.

After few weeks, following graphs were visible:





These graphs show information about network connections which were active. As it can clearly be seen from the "Netstat - by month", we had a pretty serious resource leak. We added a new alert to Icinga, which monitored the active connection count and sent an warning early enough so we could restart that server before resources were exhausted. The time window for restarting was pretty easy to deduct, because we could use graphs from Munin to see how fast the connection rate was going up and the upper limit for connections.

The actual reason for this is still somewhat a mystery, but it was tracked into one specific proxy server that one of our client used: it just didn't close connections properly. So we added an timeout for idle connections, which should have been there from the beginning. And currently monthly graph shows this:




Just try to guess when the fix was deployed.