I love the Air Crash Investigation series. The acting is a bit cheesy but the reenactments are detailed and fascinating. But mostly I love watching the detective work, the process and above all, the systemic change (in most cases anyway) that leads to greater safety in air travel. There's something reassuring about that. I wish we had the same approach for passenger vehicle road safety actually. There's an area that could do with a bit more systemic culture improvement!
The software industry likes to model its major incident processes on the same airline industry. Blameless post-mortems, safety culture, continuous improvement... all worthwhile objectives to make workplaces and products better, and improve the outcomes for customers as a result. If things burst into a towering column of flames and NewRelic error logs, there's a process for that! You work the problem, solve one challenge, then the next, then the next. And if you solve enough problems, Matt Damon gets rescued from Mars and you get to write an incident report!
Incident Reports Will Not Save You
There's something about writing an incident report that feels - at a practice level at least - like a warm, snuggly duvet that will keep you protected from the elements next time. You fill out a really good, thorough template that has all the right headings, timelines and a nice clean layout to boot. And you save that somewhere sensible. Like a Google Drive with "Incident Reports" as the name or a Confluence Page under a section coincidentally also titled "Incident Reports". In an ideal world, all the people who were involved in the incident will contribute something to that incident report. Like a secret society whose members each contribute some deeply personal trinket. The incident reports become an archaeological archive of ancient knowledge known only to a few: a fount of lost wisdom the likes of which an ancient Incan civilisation would have been proud to have curated.
John Allspaw has long, documented history in the resilience engineering space and is one of the most respected names in that space. One thing he is not recognised for is lengthy conversations endorsing incident reports. Rather, he (and other prominent leaders in the safety and resilience sphere) talks about
- Learning
- Observation
- Systems thinking
- Analysis
An incident report typically explains what went wrong, how it was fixed and maybe what measures will be put in place to avoid this specific thing next time, but that's the problem with software: it changes. Software engineering is predicated entirely on change. It's an adaptive system responding to human factors in a complex ecology. The things you put in place now are not necessarily guaranteed to be the things that prevent a new issue next time. The steps you took to identify and fix the issue aren't necessarily going to be the same conditions you encounter in future.
Rather than looking at what you learned / uncovered, some better questions would be:
- "how did we learn?"
- "who was learning?"
- "what insights were shared by people involved? What ones were different?"
- "when / where does the learning happen?"
Writing an incident report is ok if that's what your process requires, but there needs to be thought put into why your process requires it. Is it because higher-ups expect some tangible artifact showing due diligence? Maybe other stakeholders do? If the report is designed to inform and educate your teams, then it needs to be written and shared in a way that makes it engaging, clear and provides a full picture of how multiple factors flowed together to create the right conditions for this issue to happen. That means getting out of the habit of using the words "root cause". Events don't happen in isolation; a system has multiple factors that need to be true before things actually break. Often those factors might include:
- how observable the patterns leading up to the issue are
- how well-understood the "normal" modes of operation are
- how well-shared and comprehensive people's knowledge of the system is
And that's just to name a few. Providing a section in a templated incident report of "lessons learned" or filing the report in an archive where people are expected to take time out of their busy day to read it does NOT create conditions for your people to learn. Those lessons need to be given context, narrated in an engaging way, and tell a story of how this river of conditions flowed to this particular outcome so that your people can better see the important signals for what they are.
The objective isn't to stop incidents from happening. Every time you introduce a change, you create a new system. The objective is to get your organisational muscle flexed enough that your response and reactions are stronger and more resilient.