Cloudflare is right to not fire anyone over their recent outage
Recently Cloudflare had a major outage. The faulty code made it past the usual automated systems and tests because it was not anticipated that an outage could be caused by this kind of code.
Cloudflare’s CTO tweeted that “I gave that little team a pep talk and said I didn't need to know the name of the individual who wrote the bad rule.” It seems that whoever typed out the faulty regex won’t be reprimanded or fired, despite people on internet forums calling for just that.
Protecting the individual employee is the right response to this incident, and the reason why is very important to understand. When a problem occurs in complex distributed software like Cloudflare runs, fault lies not with any individual’s action, or any single failure, but in the system. An outage requires multiple failures: one person’s faulty code merged, automated testing and human review did not reveal a problem, and it was deployed into production. Depending on their internal process, there could be other checkpoints that failed to prevent this outage.
By their very nature, complex software systems are unpredictable, and there is nothing we humans can do to stop things from breaking sometimes. At the same time, we naturally have a desire for a cause and effect explanation for accidents. This desire has manifested itself in the world of software many times over the years. It's human nature for people to call for whoever wrote the code that caused the outage to be fired. Hindsight bias (point 8), is the feeling while reading a post mortem that the cause should have been obvious at the time. This bias is powerful and blinds us all at times.
But firing someone over an outage would simply make them a scapegoat, without improving reliability. Cloudflare’s goal is to make their system as reliable as possible. To do so they should (and likely will) address every failure in the chain that caused this outage, preventing this combination of failures in particular from ever repeating. Ideally, their changes will make the system more resilient without introducing new faults.
In addition to it being a fallacy to attribute an outage to a single individual, an employee involved with a system failure and recovery will have gained valuable experience they will use in the future to prevent failures. That goes double if the employee feels responsible. They can either apply that experience at Cloudflare, or their next employer. Why not take advantage of their knowledge? After all, you already paid for it.
The author is a software engineer and recovering lawyer. He can be found on twitter.
Further reading on complexity
How Complex Systems Fail, by Richard I. Cook (5 page PDF)