Chaos engineering to improve application resiliency using Azure Chaos Studio

Published Wednesday, June 8, 2022

Digital transformation is reshaping the way we work helping us create new, innovative, and sustainable ways of working and living. This transformation led the modern application development to be Cloud native. Any application in cloud must be designed with the ability to recover from failures and continue to function without any customer downtime. This blog focuses on Chaos engineering as a methodology that helps to build resilient, well-architected applications. I discuss how Azure Chaos Studio, a latest Azure service can be used as a tool to implement Chaos engineering.

Why reliability?

Designing an application in cloud must meet certain architectural guidelines to operate effectively. Reliability is one of the well-architected principles of Azure that aims at hardening your application against failures. Reliable uptime is more critical than ever in this modern digital era. Building a reliable application in the cloud is different from traditional application development. In the cloud, we acknowledge that failures happen and aim for minimizing the effects of a single failing component.

A well architected application must pass reliability assessment which is an end-to-end review covering Compute, Network, Storage, DevOps and other areas through reliability lens which is focused on Service Level Agreements (SLA) and Service Level Objectives (SLO), and Recovery targets, such as Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO). These metrics ensure application reliability aligns with business requirements.

Why Chaos engineering?

Customers need a modern testing methodology to adopt for resilience. Chaos engineering is one such mechanism that helps you attain consistent reliability. Embracing the chaos through experimenting, helps increase confidence in application’s ability to handle it. By conducting experiments in a controlled manner, you can identify issues that are likely to arise during the application lifecycle. It helps with proactive graceful risk mitigation and reduce blast radius.

Chaos Picture 2.png

Identifying right applications for Chaos Engineering

Performing chaos cannot be done on any application as it might have adverse effects if done without proper planning. Hence, choosing a right application is key for the success of chaos engineering, Following are some of the ways that can help with picking up a right candidate.

Analyze past incidents and look for common patterns.
Reliability assessment recommendations and invest in testing trouble areas.
Applications with greater business impact (like mission critical)
Wholistic approach - People, practices, and processes; application; platform and infrastructure
Analyze impact vs likelihood of a fault and prioritize based on what you care the most.

Game Day in Chaos engineering

The goal of a Game Day is to practice how you, your team, and your supporting systems deal with real-world fault scenarios. A Game Day is not just a random event in which things break; it is a controlled, safe, and observed experiment through which information about how a system responds to the fault scenarios is collected. Here are some of the key steps in a successful gameday event:

To start with, pick a hypothesis to explore.
Decide who participates in your Game Day and who observes.
Decide where your Game Day is going to happen.
Decide when your Game Day is going to start, and how long it will last.
Get approval from key stakeholders!
Collaboration is key for chaos engineering. Cross-team sharing and even potential reuse of backlog and plans of chaos experiments across teams and environments is encouraged.

Chaos engineering with Azure Chaos Studio

Given that resilience is a shared responsibility, our job as the cloud provider representatives, is to improve the resilience of our own services and help customers to improve their resilience by providing the right tooling and best practices. Hence, we introduced a service, which is in public preview since Nov 2,2021, called Azure Chaos Studio. It is a fully managed service to measure, understand, and build services that are resilient to real world incidents. With Chaos Studio, you can perform Chaos Engineering experiments that inject faults into your service, and then monitor how the service responds to the disruptions. The service will be generally available in the latter half of 2022. The service has achieved great customer adoption within just months of its release along with the ability to build cloud resilient services.

Azure Chaos Studio overview

Azure Chaos Studio.png

Chaos Studio is deeply integrated with other Azure services like Azure Resource Manager ARM, Azure Monitor and Azure AD. It has ARM-compliant REST API, and an Azure Portal-integrated user interface, that enables you to design, execute, monitor, and view the results of chaos experiments. Chaos Studio experiments provide orchestrated parallel and sequential fault injection.

Fault library is an expandable list of faults across the entire azure stack. For now, these faults include Agent based and Service direct faults. Agent-based faults are those that have impact on the operating system for a VM, or similar type of compute where you have direct access to the OS). Examples include stress tests on processor, memory etc. It comes in 2 flavors – Windows and Linux agents. Service-direct faults don’t require any installation or instrumentation in your application, they simply interact with Azure resources at the service level. They focus on common Azure service issues that you may want to build resilience against. Examples include Cosmos DB Cluster failover, Azure storage failover etc.

Chaos Studio Experiments

Chaos Studio Experiments are orchestrated scenarios of faults applied to resource targets. Experiment Metadata is container for consisting of experiment metadata such as azure region where the test is to be deployed, and Identity to be used. Steps run sequentially whereas branches run in parallel within a step. Actions execute a fault or add a time delay. Selectors group a set of target resources for fault injection. Targets describe Azure resources onboarded to Chaos Studio.

Experiment Metadata.png

For more details, check out the sample tutorial for how to run an experiment.

You can also view the video Getting started with Chaos Studio.

Controlled chaos with Chaos Studio

Controlled chaos is key for the success of a chaos experiment to get desired results. To ensure proper guardrails are in place before injecting the faults, chaos studio provides a 3-layer permission model:

User must have RBAC permission to create and to start an experiment.
Chaos experiment resource itself needs to have permissions to execute a fault to a resource.
Resource to which fault is injected must be onboarded explicitly to become target of a chaos experiment and should have capability for a particular fault to be allowed.

In addition to the above guardrails, Chaos Studio also has a feature to stop and roll back experiments to prevent outages. However, note that faults injected to an application can be undone with this feature. But can’t guarantee if the application comes back fully functional depending on how the application is configured.

What’s next?

Get started by visiting our documentation or get started in the Azure portal. Let the chaos begin!

About the Author

Harshitha Putta is a Senior Cloud Solutions Architect in the Customer Success Unit at Microsoft. As the cloud business continues to experience hyper-growth, she helps the customers to build, grow and enable their cloud.

Continue to website...