Failure is not bad


I'm going to start this article in the most illogical manner possible, by telling you a cautionary tail of when learning from your mistakes could actually be a bad thing.

Sometimes learning from mistakes can be dangerous. For example, in university, as a nuclear engineer, we'd often experiment with new shielding techniques. To do this, we would set up our experiment using shielding plates of alloy or plastic in a testing chamber (heavily shielded) place a sealed casket of Caesium-137 in front of the experiment and close the chamber door, next we'd use a robotic arm, remove the lid of the Caesium-137 and take readings from the counter on the wall of the containment chamber, this would show us how effective the shielding experiment was at blocking or deflecting radiation. There was always a process of replacing the cap, and watching the counter decrease to a safe level, this was how we knew we had correctly replaced the casket lid of the Caesium. One of the other students on the course, a guy we will call 'Brad' liked to rush this step and instead of waiting for the counter to drop to the correct background level which could take some 20 minutes or so, he would wait 5 minutes for it to drop to 40% or so above background and then open the chamber, make adjustments to the shielding and start over. The count dropping took time because the instruments we were using were sensitive. In reality when the casket top was securely back on the casket, there was very little risk from nuclides and beta/gamma, however the procedure was always to wait until the sensors said it was safe. One day, Brad came up with a new shielding configuration which deflected the radiation from the source and wanted to test it. Now we have finally arrived at the key point, thanks for sticking around this long. Brad having safely capped the source replaces the shielding plates with what he thinks is a slightly thicker Alloy, but instead picked up the lead shielding plates, which are capable of stopping the radiation from the source with at least an 85% ratio. Brad closes the draw, and using the robotic arm removes the cap, the sensors arn't reading much at all from the source, so Brad figures something isn't right. Thinking 'Ok, I'll just check the plates' He places the cap back onto the casket, only this time the arm slips a little and the lid isn't completely on, exposing a gap of at least 10mm. Usually this would cause the sensors to continue to report high levels of radiation in the chamber, but due to the shielding mistake they are reading as if the cap is securely on. Brad glances at the read out and sees his expected low count of 20% he walks over to the chamber and opens it. He is instantly greeted with loud alarms sounding behind him, as the lab's safety alarms detect dangerous levels of radiation. Brad is confused and instead of (as is procedure) retreating to the corridor(which in this building all have lead linings) he proceeds to look inside the chamber. He has been exposed now to the full force of the Caesium-137 for a full 20 seconds before he realises whats happening, again with the opportunity to retreat he decides to try and knock the cap into place with his hand. He succeeded. The alarms promptly stopped, moments later members of the safety team came rushing in and Brad is taken to hospital. Brad has been exposed to 35 seconds of direct contact with a very nasty source. He almost lost his hand, but fortunately due to good medical attention recovered, with his hand. He exposed himself to 3 times the expected lifetime dose from someone working in the nuclear industry, before even getting his degree. He doesn't work in the nuclear industry now. Ok no more nuclear examples from now on, I promise.

Why did I start this article with a horror story about how a mistake nearly cost someone there hand? Because I wanted to highlight just how extreme things need to be in order for you to be in a situation where you Shouldn't learn from a mistake. They are rare, like really rare, and usually there are clear markers to indicate 'Hey this is one of those, don't learn from your mistakes kind of things'. There were countless points where Brad was breaking processes put in place or ignoring safety warnings. Within your life time there will be on a weekly bases countless mistakes made, all of which have at least one positive outcome; Lesson's learned, and hopefully you won't loose any hands from making them!

This is why it's important to understand that failure is part of the learning and maturity process. When you are trying to create a new process or mature an existing one, looking at where the process failed or almost failed can help quickly identify where it can be improved, but also it can leave hints at how it can be improved. Take for example the cloud flare outage of 2019, this was caused by the deployment of a bad configuration to the WAF rules. It literally broke the internet overnight. Services were restored within 24 hours but not before causing massive disruption to systems across the globe. We know what failed, that part is easy, but we need to make sure we learn from the mistake, how can we better protect against external vendors we rely on going down, which is totally our of our control?

The answer for the company I was working for at the time was simple. Have a fall over, either turn off the protection cloudflare offers temporarily, or switch to an alternative provider. The important take away was: 'Have a process in place ready, doesn't matter what, but make sure we have a quick reactive process in place'. Learning from mistakes can sometimes be complex, especially if the mistake is complex, here is my take on how to do it. It may not work for everyone, but it works for me:

Dive in as soon as you can

When you've managed to get the servers back up, and the customers have stopped screaming at you, the temptation is to mentally go 'Right that's that'. But the best time to do a in-depth analysis of the situation is usually as soon as its resolved. If I've had a meeting with folks and its not gone well, I perhaps didn't put the points across correctly, or missed something out, I always have a meeting directly afterwards with my team to go over what we said, why we said it and how we can improve, which brings me onto the next point.

Get feedback

Feedback from other people often helps you gain perspective. In some cases asking a customer how badly you failed them would be totally inappropriate, but asking members of the team Hey how do YOU think we could have resolved or prevented this is a productive question, and usually invokes a data rich answer.

Develop a deep understanding

Once you've deep dived into the problem, and gotten as much information as you can, its time to try and develop that understanding of what you can do to learn from this. Understanding often comes with a problem statement and a solution design in software engineering, so why not adopt that in real life? Define the problem you encountered, and build a solution.

Practise, Practise, Practise

More often than not, if you complete this process without practise, you will never improve. This can be in the form of table top exercises, exposure to a specific task as often as possible or reading and understanding processes and policies more. After I had a bad meeting where I demo'd to customers a new process or app, I took away some key points of failure and then I volunteered for every Demo for the next few months. This helped me engage better, understand what management wanted from me and also helped me become a better communicator. Remember no one in the history of anything ever used the line;

"The more I practise the worse I got"

So lets sum up! Dive in, Get Feedback, Develop and understanding and practise, practise practise. I hope this methodology helps you as much as it helped me.

 

TL;DR

Embrace failures as learning opportunities when refining processes. Analyze mistakes promptly to identify improvements. For instance, after the Cloudflare outage of 2019, companies should prepare fallback plans for critical services. Key steps include immediate analysis post-incident, gathering feedback for perspective, deepening understanding of the issue, and practicing solutions to enhance skills and preparedness.