Building Resilient Systems with Chaos Engineering

In this blog we'll explore the concept of chaos engineering, why we need it, and how to get started with it.

What?

Predictable systems are a myth. Failures are inevitable, but we can be prepared for them by building resilient systems. That's where chaos engineering comes in. But what is chaos engineering, you may ask? It's a Site Reliability Engineering technique that simulates unexpected system failures to test a system's behavior and recovery plan. By learning from these tests, organizations can design interventions and upgrades to strengthen their technology.

Why?

But why do we need chaos engineering? Well, let's say for example one of our e-commerce customers experiences their applications crashing during a Black Friday sale, but there's no CPU or memory spike. It turns out that the root cause was running out of disk space from writing logs in a file within the container.

In the world of microservices, it's not uncommon for one slow service to drag the latency up for the whole chain of systems. And with today's microservice architecture and ecosystems, we've moved from a single point of failure in monolith systems to multiple points of failure in distributed systems. To create scalable, highly available and reliable systems, we need newer methods of testing.

How?

But how does chaos engineering work? Think of it like a vaccine. It's a mild form of the "disease" injected into the system to prepare it for better availability, stability, and resilience. The most common problems that every application suffers are CPU or memory spikes, network latency, time changes during daylight saving time, reduced disk spaces, and application crashes. So the first step is to make the infrastructure resilient enough to overcome these disasters at the application level.

There are four major steps when running a chaos test:

  • define a steady state
  • introduce chaos
  • verify the steady state
  • roll back the chaos.
If the application passes the test, it's evidence that the system is resilient. But if it fails, we recommend following the red-green testing cycle, identifying the weakness, fixing it, and rerunning the test.

So, how do we start chaos testing? If teams are just starting to adopt chaos engineering, we suggest using a simple shell script. But as the practice matures, we recommend using one of the many open-source or commercial tools like Gremlin, Litmus Chaos toolkit, Istio service mesh, or the AWS Fault Injection Simulator. Ideally, chaos testing is best run in production, but we recommend starting in a lower environment first and then conducting controlled experiments in production later. It took one of our clients months to learn and implement chaos engineering. But trust me, the payoff was worth it. So let's embrace the chaos and build resilient systems together.

Conclusion

We've just scratched the surface of chaos engineering and how it can help us build resilient systems. Remember, predictable systems are a myth and failures are inevitable, but with chaos engineering, we can prepare for them and ensure that our systems can withstand the unexpected. I encourage you to start experimenting with chaos engineering in a lower environment and then gradually move to production.

This is the 3rd blog post in the Chaos Engineering mini series, if you want to explore chaos engineering further check out the previous blogs: