RCA a.k.a What the bleep just happened in our code
Errors happen. They did and they will. And we cannot avoid it — also in the IT world. The increasing complexity of our systems and solutions makes them more vulnerable to human, machine, and integration mistakes. We can (and should!) do our best to avoid them by writing good tests, conducting some simulated malfunctions, doing regular code reviews, etc but we should also be prepared for the problems. And not only to fix them — the more important is to learn from them. In this article, I wanna share my thoughts about the process called Root Cause Analysis — one of the most useful technics for learning your lesson after problems.
Why you should do RCA?
Let’s be honest — no one likes talking about their mistakes. Also, we (as developers) prefer building staff than doing long and arduous discussions and analyses. But trust me — fixing our processes is worth that effort. There is a popular analogy describing why you need RCAs. Imagine a huge company producing cars. One day one of the employees slipped on the floor and broke his leg. The immediate action will be to take him to the hospital and put the leg into plaster. But is it the end of the story? Not yet. Let’s say the company changed nothing after that event. And the next person broke their leg a few days later and after an additional week, one of the machines stopped working causing delays for the whole production. As you may expect it’s not a random series of events. The investigation discovered that people slipped on oil leakage from the machine. And missing oil caused the seizure of the piston in the machine. Finding the real root cause of the first accident could have prevented the injury, production delays, and huge financial problems for the company. The same story can happen to your system. If you do not find the real root of errors in your system but just patch them superficially, they will reoccur. Also, these small problems may be just the first symptoms of the real problem. Do not ignore them! The faster you react, the lower the effects and repair costs will be! You should not only fix the problem but also take a while to understand why it happened and how to improve your processes to avoid it in the future.
How the RCA process can look like?
OK, we know why it’s important but the question is — how to conduct useful root cause analyses? Fortunately, many people asked that question before and we can use their knowledge. There are many different approaches and frameworks but I’ll describe how I approach those problems — based on the 6-step method defined by the American Society for Quality (ASQ) and the 5-why method developed by Sakichi Toyoda from Toyota. Those two approaches are not exclusive — in fact, they complement each other. ASQ describes the big picture and steps to take when Toyota focuses on 3rd step from ASQ’s system.
6-steps system by ASQ
Let’s be honest — this system is not creative or breathtaking. After some time you could probably come up with sth similar. But I like simple and well-defined checklists to follow —thanks to that you know you didn’t miss any important step and can focus on more problematic or specific to your problem areas. (Btw I strongly recommend “The Checklist Manifesto" book to fully understand why they are so great). So, what is the RCA framework defined by ASQ?
- Define the event — clarify the problem and define its scope. Check if everyone is on the same page and understands it the same way.
- Find causes — find potential causes of the event defined in the previous step.
- Finding the root cause (I’ll tell more about this step in the next chapter).
- Find solutions — what can you do to fix the root problem and prevent it in the future?
- Take action — talking about changes is not enough. Nothing will change without action!
- Verify solution effectiveness — how you can verify if the solution really works?
Let’s focus on the third step from the upper list “Finding the root cause". Sounds easy but how to do so you may ask. Here, we can use the 5 why method mentioned previously. It’s also super easy to remember and perform. The only thing you have to do is ask several times (in original 5 but sometimes the number is smaller or bigger) the short but powerful question “why?”. Let’s see it in the previous example of an employee who broke their leg in the company.
Our event: One of the employees broke their leg during work in the manufactory.
- Why? Because he slipped and fell.
- Why? Because the floor was wet.
- Why? Because it was oil on it.
- Why? Because oil leaked from the machine.
- Why? Because the casing of the machine was rusty and a hole had been made in it.
As you may see solving only our first issue would be ineffective — the oil will leak anyway causing additional problems. The same story can be applied to our systems. It’s important to not only put a band-aid on the wound but heal the real problem. And it’s why the RCAs are so useful!
Sharing the knowledge
The best way to learn is from mistakes… made by others ☺ Be a good citizen — share the knowledge from your mistakes with others to help them avoid your mistakes. Create a post at your company’s slack, share the email with thoughts, or create a Medium article. Who knows? Maybe you’ll become a superhero for someone and save their system?
Ps. If you found this article useful please clap the button below or share it with your friends. Thanks!