Now that’s a word with a negative vibe. Among engineering and construction projects, it conjures up the Titanic sinking, the Tacoma Narrows bridge twisting in the wind, or the space shuttle Challenger exploding. These were all failures of engineering design or management.
Most failures in the pure software realm don’t lead to the same visceral imagery as the above, but they can have widespread financial and human costs all the same. Think of the failed Healthcare.gov launch, the Target data breach, or really any number of multi-million dollar projects that basically didn’t work in the end. In 2012, the US Air Force scrapped an ERP project after racking up $1 billion in costs.
In cases like these, playing the blame game is customary. Even when most of those involved don’t literally go down with the ship—as in the case of the Titanic—people get fired, careers get curtailed, and the Internet has a field day with both the individuals and the organizations.
But how do we square that with the frequent admonition to embrace failure in your DevOps culture? If we should embrace failure, how can we punish it?
Not all failure is created equal. Understanding different types of failure and structuring the environment and processes to minimize the bad kinds is the key to success. The key is to “fail well,” as Megan McArdle writes in The Up Side of Down.
In that book, Megan describes the Marshmallow Challenge, an experiment originally concocted by Peter Skillman, the former VP of design at Palm. In this challenge, groups receive 20 sticks of spaghetti, one yard of tape, one yard of string, and one marshmallow. Their objective is to build a structure that gets the marshmallow off the ground, as high as possible.
Skillman conducted his experiment with all sorts of participants from business school students to engineers to kindergarteners. The business school students did worst. I’m a former business school student, and this does not surprise me. According to Skillman, they spent too much time arguing about who was going to be the CEO of Spaghetti, Inc. The engineers did well, but also did not come out on top. As someone who also has an engineering degree and has participated in similar exercises, I suspect that they spent too much time arguing over the optimal structural design approach to take.
By contrast, the kindergartners didn’t sit around talking about the problem. They just started building to determine what works and what doesn’t. And they did the best.
Setting up a system and environment that allows and encourages such experiments enables successful failure in agile software development. It doesn’t mean that no one is accountable for failures. In fact, it makes accountability easier because “being accountable” needn’t equate to “having caused some disaster.” In this respect, it changes the nature of accountability.
Designing for accountability
We should consider five principles when we think about such a system: scope, approach, workflow, incentives, and culture.
The right scope is about constraining the impact of failure and stopping the cascading of additional failures. This is central to encouraging experimentation because it minimizes the effect of a failure. (And, if you don’t have failures, you’re not experimenting.) In general, you want to decouple activities and decisions from each other. From a DevOps perspective, this means making deployments incremental, frequent, and routine events—in part by deploying small, autonomous, and bounded context services (i.e. microservices or similar patterns).
The right approach is about continuously experimenting, iterating, and improving. This is the philosophy that DevOps and agile development bring from the Toyota Production System’s kaizen (continuous improvement), and other manufacturing antecedents. The most effective processes have continuous communication—think scrums and kanban—and allow for collaboration that can identify failures before they happen. At the same time, when failures do occur, the process allows for feedback to continuously improve and cultivate ongoing learning.
The right workflow repeatedly automates for consistency and thereby reduces the number of failures attributable to inevitable casual mistakes like a mistyped command. This allows for a greater focus on design errors and other systematic causes of failure. In DevOps, much of this takes the form of a Continuous Integration/Continuous Delivery (CI/CD) workflow that uses monitoring, feedback loops, and automated test suites to catch failures as early in the process as possible.
The right incentives align rewards and behavior with desirable outcomes. Incentives (such as advancement, money, recognition) need to reward trust, cooperation, and innovation. The key is that individuals have control over their own success. This is probably a good place to point out that failure is not always a positive outcome. Especially when failure is the result of repeatedly not following established processes and design rules, actions still have consequences.
The right culture is, at least in part, about building organizations and systems that allow for failing well—and thereby make accountability within that framework a positive attribute rather than part of a blame game. This requires transparency. It also requires an understanding that even good decisions can have bad outcomes. A technology doesn’t develop as expected. The market shifts. An architectural approach turns out not to scale. Stuff happens. Innovation is inherently risky. Cut your losses and move on, avoiding the sunk cost fallacy.
Properly dealing with accountability and failure in agile IT does require appropriate architectures, tools, and processes to be in place. Low-impact experimentation on a fragile monolithic application will be difficult and it will be hard to avoid costly failures and subsequent blame. However, the culture of an organization still plays an outsized role. Legendary management consultant Peter Drucker once famously said that “Culture eats strategy for breakfast.” Culture has a similar appetite for many aspects of the software development process.
This article is part of Opensource.com’s forthcoming guide to open organizations and IT culture. Register to be notified when it’s released.