Phil Wheeler

It's frustrating but doesn't surprise me that we're still seeing incidents being framed in terms of a "root cause". The concept itself is misleading and obstructive because the term "root" or "to get to the root of the problem" suggests a single origin point from which all other events then emerge. It's an obstructive short-hand because it stops us from exploring further back than the perceived origin point: as though there is no way to further reduce this equation into more constituent parts.

I've been in several incident reviews recently where a template was filled out and the section for "Root Cause Analysis" was dutifully completed with something along the lines of a technical component that failed or a vital infrastructure system ran out of resources. And that's it: we've explained what went wrong and why customers were angrily submitting support tickets. When we identify the part of the system that failed, we might look at why that particular part failed and stop exploring further. It's obvious, it's easy. Job done.

Except it's never that easy and often the actual contributing factors aren't that obvious.

During an incident review today as multiple people discussed the "root cause" and conversation orbited around one technical "component failure" event, I listened as other stakeholders from our customer support team said things like "I was focused on the impacts of the actual change rather than the possible risks of it", "we'd released part of the change weeks ago and assumed that customers wouldn't be affected by this next piece", and "we had talked about it and agreed as a group that the messaging around the change was ok. I didn't know we had to also consider that other thing". When you start listening closely, you start hearing verbal cues that lead you to wondering if, actually, a big part of what contributed to the problem wasn't the technical failure described in the Root Cause Analysis section of the template at all. The conditions for the failure had started happening several weeks before the change was actually implemented.

Root Cause

Conceptually, most people instinctively think of a root cause as a starting point: whether that's the launching point of investigation or the point of origin for a failure. The former is better, but it's still not an optimal interpretation because the phrase doesn't fit the definition. If anything, "root cause" is the arbitrary point in time at which several related sequences of prior events and decisions came together to cause an undesirable outcome. It's like saying the sinking of the Titanic and the ensuing loss of life was because of icebergs. If icebergs are your root cause, you're going to spend a lot of time and effort eliminating the probability of icebergs in future maritime transport (although one might very reasonably argue humanity has done especially well at exactly that, but that's an entirely separate blog post).

The idea of a singular root cause is problematic in multiple ways:

It assumes there's a single instigating trigger to the event rather than a confluence of multiple factors
It stops further / deeper analysis - and therefore learning
It typically attributes problems to technical or procedural factors at the exclusion of human and psychological ones
It assumes that one person's mental picture of how an incident occurred and how it impacted them is consistent for everyone involved
It frames the analysis of events in negative terms such as failure or error, which can lead people to exclude positive factors and outcomes for involved systems or processes
It centres the story of the event around the storyteller

But for all that, root cause makes us feel like we understand the problem space. It provides a comforting secure feeling that suggests we've plugged the leak and can go back to normal operations.

Counterfactual Reasoning

Counterfactual reasoning is an analytical trap that frames the discussion and analysis of an incident or event in terms of what didn't happen, rather than what did and why. This form of reasoning prioritises what the system did not do at the expense of exploring what did happen and - by extension - looking at why it made sense for the decision-makers to proceed the way they did.

The problem with this is that it's harder to draw practical, actionable insights and lessons from a set of conditions that did not happen. By avoiding discussing what actually did happen and why, counterfactual reasoning essentially attributes blame to an imaginary set of circumstances that were never present in the first place. This analytical trap makes it easier to explain what led to a particular outcome because more thorough discussion and investigation becomes less necessary. It's a placebo that helps the storyteller feel comfortable that they've provided an adequate explanation for a set of conditions or outcomes.

Summary

Short-hand phrases like "root cause" and quick-solve habits that discuss what didn't happen or talk in hypotheticals like "if only X would have happened, then things would have been fine" reduce the effectiveness of incident reviews and subsequently reduce opportunities for learning. While these tools offer a quick sugar hit in terms of answering difficult questions or providing faster explanations, they ultimately don't do as much to move your team forward as taking the time to explore your incident's context and history deeply and patiently.