Chaos Monkey for Business Agility
Photo by Brian Mann on Unsplash

Chaos Monkey for Business Agility

Introduce tensions to test team and organization resilience

DevOps and the Chaos Monkey

The past numbers of months I’ve been working on improving the DevOps culture and practices of the teams I work with. Our initial work was inspired by The DevOps Handbook and lately guided by focusing on capabilities of high performing teams as presented in Accelerate.

In the research I did around reliability engineering, I came across this tool that Netflix introduced in 2011 to test the resilience of their IT infrastructure. They coined it Chaos Monkey which was used to unpredictably fail parts of the Netflix production environment to test how the rest of the system would adapt and perform to continue the streaming service to subscribers. The suite of tools grew over time and became referred to as the Simian Army although the project is no longer actively maintained.

Chaos Engineering

Let’s consider a definition of Chaos engineering:

Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions [ 1].

Chaos engineering as a practice in software systems is anything but chaotic. The objective is to break things on purpose (to identify potential weaknesses before they happen) so a disciplined approach is applied to the process. This means the experiments are well thought out, planned, and controlled.

Generally an experiment will follow this pattern:

  • Form a hypothesis related to a steady state in the system (aim for “real-world” scenarios of what could go wrong)
  • Define the scope to minimize the “blast radius”; choose metrics to observe (aim for the smallest test and related outputs that will teach something)
  • Introduce variables, run the experiment, and analyze (apply the learning and/or increase scope)

At least in software systems, another common step to look for opportunities to automate the experiment.

Applying Chaos Monkey to Business Agility

Back to our definition of chaos engineering and focusing in on the part “build confidence in the system’s capability to withstand turbulent and unexpected conditions”; this is aligned with goals for business agility and organization resilience — the ability to respond and adapt to disruptions or change.

How might we take the practices of chaos engineering in software systems and apply them to experiment on team and organization resilience?

Flight Levels

We could think of business agility and organizational improvement with the lens of flight levels as described by Klaus Leopold.

Photo by James Youn on Unsplash

“The Flight Level model is an instrument of communication that reveals the effect of specific improvement steps at different levels, and for finding the most useful starting point within the organization to begin with improvements.” [2]


Flight levels help make sure we are running our business agility experiments at different touch points within the organization; whether that be at the individual team level, the overall customer value stream, or from an organization strategic perspective.

Safe to Try Experiments

Just like in software systems, we want our approach to testing business agility to be planned and safe to try. Remember we are attempting to manage complexity and the resilience of our system by making incremental improvements. The key is that the experiments be proactive.

Below are just a few examples of the types of “chaos monkey” tension experiments that could be tried at different flight levels.

Operational Level

  • Solution architect on the team suddenly has a health issue and will be absent for two weeks
  • The WIP limit on one of the workflow steps is doubled
  • As part of a retrospective someone suggests a new estimating technique
  • How would lead time metrics change if our dev team tried pair programming?

Coordination Level

  • The upstream UX/design team that the dev team is dependent on has their priorities changed mid-sprint and their deliverable will be delayed
  • A project manager proposes a visualization approach to better understand the overall value stream and dependencies
  • A feature was changed and released in the product but marketing was not informed

Strategic Management Level

  • A competitor product releases a new feature that is still in our backlog
  • There is new market possibility with significant growth potential but the organization is at capacity servicing existing customers

Note that in some cases these could be just thought experiments that happen as part of retrospectives. If running an actual live experiment remember to follow the pattern and do the upfront work to define the hypothesis, determine the scope/metrics, and be clear on how the experiment will be managed.

What other experiments can you think of to proactively test business agility?

Brice Walsh is a change agent interested in open leadership, agility, organization design, and capabilities for doing our best work. Follow Brice on LinkedIn or Medium.

To view or add a comment, sign in

More articles by Brice Walsh

  • The Open Leadership Symposium

    Is there something new here for business agility leaders? I’ll be heading to the Boston area week of May 13th for The…

    1 Comment
  • “Defaults” to Drive Continuous Improvement

    Standards you can experiment with Do standards hinder innovation? Language and how we use it matters especially given…

  • Mattamy Homes - BILD Award Winner!

    Big congrats to our friends at Mattamy Homes who won Best Website, Low-Rise at 2016 BILD AWARDS last night…

    1 Comment
  • Business Unit of the Year!

    Very honored and proud of my team to have won the highly coveted T4G "Business Unit of the Year" award for 2015…

    7 Comments

Insights from the community

Others also viewed

Explore topics