torstai 9. marraskuuta 2017

Complex systems, root cause analysis and failure

I just read http://www.michaelnygard.com/blog/2017/11/root-cause-analysis-as-storytelling/ and it reminded me about classic "How Complex Systems Fail" ( http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf ) .

We are building complex systems all the time, and it's actually scary how many defenses against failure are built into them. These defenses can be as simple as checking return value of function, or more complex with fallbacks and alternative implementations. They aren't scary because they are there; they are scare when you think that if even one of those defenses is missing, things go bad pretty quickly.

Currently humans are still superior in defending these systems. They make workarounds and processes that avoid potential failures. It might be really interesting to apply machine learning in these situations, trying to find out the sets of actions that lead to failures.

But meanwhile, we have to learn from our systems by ourselves, so try to avoid hunting that one root cause.