perjantai 28. lokakuuta 2011

Pecking order and crane flogs.

I just read a interesting article about software teams and pecking order.

Even though it is said in the article that software teams aren't like sport teams or flocks, I want to share a view from one ice-hockey coach, Hannu Aravirta. He described his management style to be flying in V-formation, like flock of cranes. The person who is in the front of the formation is changed regularly to give breathing time for everyone. Even though in ice-hockey team (in this case, national team of Finland) there is a formal ranking given by the organization, everyone will take turns in the front of the flock. And everyone must lead the flock in the same direction.

In real world, the rotation is not random, though. The stronger birds might be in the front for a longer time (I'm not sure if this is true). This happens also in the context of software development. The flock, or team, will face variety of problems and issues. Some of those are technical, some are business related, and so on. So it should be the person who has the best capabilities to resolve the current issue who is in front of the flock, giving the direction where to go.

Like all analogies, this isn't complete. It gives an impression that the current leader of the flock will make all decisions alone, which is not the case.

torstai 27. lokakuuta 2011

Monitoring, part I: What and why, with example

Originally published at Sysart DevBlog, please comment there.

One important part of administrating servers is monitoring. Monitoring tells you when something goes wrong and why it went wrong, preferably before users notice the failure.

There's two somewhat distinct parts in monitoring, alerts and statistics. At higher level, alerts will notify operations staff when something is wrong, for example server stops responding. Then statistics gives you some ideas what went wrong.

We are currently using Icinga as a alerting tool and Munin as a statistics gatherer. Installation and configuration of these tools is little bit complicated, but by following documentation it is possible to have pretty decent monitoring system running in few hours.

Here's an actual case where monitoring proved to be essential. We had a situation where one server would stop responding after running few weeks and only way to resolve this was restarting server software. To make things worse, this particular system wasn't monitored at all and was running some business critical services. So first step was activating Icinga on that server so we could get alerts when the software stopped responding and after that, Munin was installed with pretty basic configuration.

After few weeks, following graphs were visible:

These graphs show information about network connections which were active. As it can clearly be seen from the "Netstat - by month", we had a pretty serious resource leak. We added a new alert to Icinga, which monitored the active connection count and sent an warning early enough so we could restart that server before resources were exhausted. The time window for restarting was pretty easy to deduct, because we could use graphs from Munin to see how fast the connection rate was going up and the upper limit for connections.

The actual reason for this is still somewhat a mystery, but it was tracked into one specific proxy server that one of our client used: it just didn't close connections properly. So we added an timeout for idle connections, which should have been there from the beginning. And currently monthly graph shows this:

Just try to guess when the fix was deployed.