Chaos Monkey for Business Agility
Introduce tensions to test team and organization resilience
DevOps and the Chaos Monkey
The past numbers of months I’ve been working on improving the DevOps culture and practices of the teams I work with. Our initial work was inspired by The DevOps Handbook and lately guided by focusing on capabilities of high performing teams as presented in Accelerate.
In the research I did around reliability engineering, I came across this tool that Netflix introduced in 2011 to test the resilience of their IT infrastructure. They coined it Chaos Monkey which was used to unpredictably fail parts of the Netflix production environment to test how the rest of the system would adapt and perform to continue the streaming service to subscribers. The suite of tools grew over time and became referred to as the Simian Army although the project is no longer actively maintained.
Chaos Engineering
Let’s consider a definition of Chaos engineering:
Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions [ 1].
Chaos engineering as a practice in software systems is anything but chaotic. The objective is to break things on purpose (to identify potential weaknesses before they happen) so a disciplined approach is applied to the process. This means the experiments are well thought out, planned, and controlled.
Generally an experiment will follow this pattern:
- Form a hypothesis related to a steady state in the system (aim for “real-world” scenarios of what could go wrong)
- Define the scope to minimize the “blast radius”; choose metrics to observe (aim for the smallest test and related outputs that will teach something)
- Introduce variables, run the experiment, and analyze (apply the learning and/or increase scope)
At least in software systems, another common step to look for opportunities to automate the experiment.
Applying Chaos Monkey to Business Agility
Back to our definition of chaos engineering and focusing in on the part “build confidence in the system’s capability to withstand turbulent and unexpected conditions”; this is aligned with goals for business agility and organization resilience — the ability to respond and adapt to disruptions or change.
How might we take the practices of chaos engineering in software systems and apply them to experiment on team and organization resilience?
Flight Levels
We could think of business agility and organizational improvement with the lens of flight levels as described by Klaus Leopold.
“The Flight Level model is an instrument of communication that reveals the effect of specific improvement steps at different levels, and for finding the most useful starting point within the organization to begin with improvements.” [2]
Flight levels help make sure we are running our business agility experiments at different touch points within the organization; whether that be at the individual team level, the overall customer value stream, or from an organization strategic perspective.
Safe to Try Experiments
Just like in software systems, we want our approach to testing business agility to be planned and safe to try. Remember we are attempting to manage complexity and the resilience of our system by making incremental improvements. The key is that the experiments be proactive.
Below are just a few examples of the types of “chaos monkey” tension experiments that could be tried at different flight levels.
Operational Level
- Solution architect on the team suddenly has a health issue and will be absent for two weeks
- The WIP limit on one of the workflow steps is doubled
- As part of a retrospective someone suggests a new estimating technique
- How would lead time metrics change if our dev team tried pair programming?
Coordination Level
- The upstream UX/design team that the dev team is dependent on has their priorities changed mid-sprint and their deliverable will be delayed
- A project manager proposes a visualization approach to better understand the overall value stream and dependencies
- A feature was changed and released in the product but marketing was not informed
Strategic Management Level
- A competitor product releases a new feature that is still in our backlog
- There is new market possibility with significant growth potential but the organization is at capacity servicing existing customers
Note that in some cases these could be just thought experiments that happen as part of retrospectives. If running an actual live experiment remember to follow the pattern and do the upfront work to define the hypothesis, determine the scope/metrics, and be clear on how the experiment will be managed.
What other experiments can you think of to proactively test business agility?
Brice Walsh is a change agent interested in open leadership, agility, organization design, and capabilities for doing our best work. Follow Brice on LinkedIn or Medium.