Chaos Monkey for Business Agility

Brice Walsh

Head of Product @ Carbonmark

Published Apr 15, 2019

Introduce tensions to test team and organization resilience

DevOps and the Chaos Monkey

The past numbers of months I’ve been working on improving the DevOps culture and practices of the teams I work with. Our initial work was inspired by The DevOps Handbook and lately guided by focusing on capabilities of high performing teams as presented in Accelerate.

In the research I did around reliability engineering, I came across this tool that Netflix introduced in 2011 to test the resilience of their IT infrastructure. They coined it Chaos Monkey which was used to unpredictably fail parts of the Netflix production environment to test how the rest of the system would adapt and perform to continue the streaming service to subscribers. The suite of tools grew over time and became referred to as the Simian Army although the project is no longer actively maintained.

Chaos Engineering

Let’s consider a definition of Chaos engineering:

Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions [ 1].

Chaos engineering as a practice in software systems is anything but chaotic. The objective is to break things on purpose (to identify potential weaknesses before they happen) so a disciplined approach is applied to the process. This means the experiments are well thought out, planned, and controlled.

Generally an experiment will follow this pattern:

Form a hypothesis related to a steady state in the system (aim for “real-world” scenarios of what could go wrong)
Define the scope to minimize the “blast radius”; choose metrics to observe (aim for the smallest test and related outputs that will teach something)
Introduce variables, run the experiment, and analyze (apply the learning and/or increase scope)

At least in software systems, another common step to look for opportunities to automate the experiment.

Applying Chaos Monkey to Business Agility

Back to our definition of chaos engineering and focusing in on the part “build confidence in the system’s capability to withstand turbulent and unexpected conditions”; this is aligned with goals for business agility and organization resilience — the ability to respond and adapt to disruptions or change.

How might we take the practices of chaos engineering in software systems and apply them to experiment on team and organization resilience?

Flight Levels

We could think of business agility and organizational improvement with the lens of flight levels as described by Klaus Leopold.

“The Flight Level model is an instrument of communication that reveals the effect of specific improvement steps at different levels, and for finding the most useful starting point within the organization to begin with improvements.” [2]

Flight levels help make sure we are running our business agility experiments at different touch points within the organization; whether that be at the individual team level, the overall customer value stream, or from an organization strategic perspective.

Safe to Try Experiments

Just like in software systems, we want our approach to testing business agility to be planned and safe to try. Remember we are attempting to manage complexity and the resilience of our system by making incremental improvements. The key is that the experiments be proactive.

Below are just a few examples of the types of “chaos monkey” tension experiments that could be tried at different flight levels.

Operational Level

Solution architect on the team suddenly has a health issue and will be absent for two weeks
The WIP limit on one of the workflow steps is doubled
As part of a retrospective someone suggests a new estimating technique
How would lead time metrics change if our dev team tried pair programming?

Coordination Level

The upstream UX/design team that the dev team is dependent on has their priorities changed mid-sprint and their deliverable will be delayed
A project manager proposes a visualization approach to better understand the overall value stream and dependencies
A feature was changed and released in the product but marketing was not informed

Strategic Management Level

A competitor product releases a new feature that is still in our backlog
There is new market possibility with significant growth potential but the organization is at capacity servicing existing customers

Note that in some cases these could be just thought experiments that happen as part of retrospectives. If running an actual live experiment remember to follow the pattern and do the upfront work to define the hypothesis, determine the scope/metrics, and be clear on how the experiment will be managed.

What other experiments can you think of to proactively test business agility?

Brice Walsh is a change agent interested in open leadership, agility, organization design, and capabilities for doing our best work. Follow Brice on LinkedIn or Medium.

To view or add a comment, sign in

Chaos Monkey for Business Agility

Brice Walsh

Head of Product @ Carbonmark

Introduce tensions to test team and organization resilience

DevOps and the Chaos Monkey

Chaos Engineering

Applying Chaos Monkey to Business Agility

Flight Levels

Safe to Try Experiments

More articles by Brice Walsh

Insights from the community

Others also viewed

SRE concepts part 9 ( Stability versus Agility )

Chaos Engineering Toward Relentless Resilience

DevOps and the Four Ways – communication, collaboration, innovation & merciless refactoring

Pipeline Chronicles ⛓️ – Issue #1

Bringing Operations On The DevOps Journey

Why Can't We All Get Along?

"DevOps Journey with The Phoenix Project"

My Take on the IEEE Standard for DevOps (IEEE 2675-2021)

The State of DevOps

DevOps Origins

Explore topics

Introduce tensions to test team and organization resilience

DevOps and the Chaos Monkey

Chaos Engineering

Applying Chaos Monkey to Business Agility

Flight Levels

Safe to Try Experiments

More articles by Brice Walsh

The Open Leadership Symposium

“Defaults” to Drive Continuous Improvement

Mattamy Homes - BILD Award Winner!

Business Unit of the Year!

Insights from the community

Others also viewed

SRE concepts part 9 ( Stability versus Agility )

Chaos Engineering Toward Relentless Resilience

DevOps and the Four Ways – communication, collaboration, innovation & merciless refactoring

Pipeline Chronicles ⛓️ – Issue #1

Bringing Operations On The DevOps Journey

Why Can't We All Get Along?

"DevOps Journey with The Phoenix Project"

My Take on the IEEE Standard for DevOps (IEEE 2675-2021)

The State of DevOps

DevOps Origins

Explore topics