Planning for Failure

When developing applications it is easy for programmers to think in terms of successful occurrences. Code is built to expect certain results because the programmer is thinking of how the application should work. Whether the specifications are being outlined or the code itself is being tested before use in production, it's always easier to think that there are only so many use cases and that the application should account for those use cases. This direction of thought can result in buggy code and frustrated developers because you're planning for a perfect environment that never truly exists.

I've had the opportunity to participate in disaster planning in a few different positions. We looked at a number of possibilities, including catastrophic events to simple human errors, but it was the manner of thought that really got me thinking of application development. One way to approach disaster planning is to imagine certain familiar scenarios and play out how the system would react. The other, and in my mind the better approach, is to analyze the system and identify weaknesses. Instead of thinking 'a fire could burn down the data center' you can think 'the data center is unavailable'. How the data center goes down is not important. It's just gone. It could be fire or flood or an EMP, but as far as you and your customers are concerned, the data center is gone.

Going down to the code level I feel that unit testing has the same fallacy. If a method is looking for an integer and should fail elegantly if a string is passed instead, you can write a test to throw strings at the method and ensure that a elegant failure returns. As far as the developer and end customer is concerned, though, it shouldn't matter if a string, array, object, or anything else is passed in. Only an integer should work and everything else should fail elegantly.

Now, I'm not saying that unit testing is a bad idea for development. There are many positive attributes to unit testing, especially in a team development environment. However, when it comes to quality assurance or code review, just thinking in terms of how the code should work is not enough. It is also not enough to think of specific cases that would cause breaks in the code. Instead, you should look at the system as something that can, and will, fail at some point. The first job of a code reviewer or team lead is to determine what happens when the system fails to make sure that there are backup processes and graceful error handling.

Once all points of failure have been addressed you can then start looking at root causes. What would have caused this specific method of failure? Is there a user action or logic mistake that led to this specific course? The beauty of this method is that if you don't find all of root causes of failure immediately there are always failsafes to protect the integrity of the system. Once you have a plan for what happens when the data center is lost you can start tracing back the individual reasons why it would disappear to attempt to avoid that costly and inconvenient occurrence.