Bang!! All of a sudden, one of your production servers goes down. What happens next? How does your system respond? Fortunately, the outage was only in your test environment, at least this time. It was introduced randomly, but deliberately, by chaos engineering.
To understand the need for chaos testing, it’s worth thinking about a few questions:
- Do you know what would happen when outages happen in your production system?
- Are you vulnerable to failures or attacks in your live infrastructure?
- How do you respond to CPU failures, network failures, software service failures, etc?
How Chaos Testing Helps
Incidents like those mentioned above have often been seen as too difficult, too expensive, or too low a priority to test. If you did test attacks and failures, you were in the minority, and you typically had one shot to get it right.
Chaos engineering allows you to throw a spanner in the works. Easily, repeatedly and affordably.
Chaos engineering identifies weaknesses before they become outages. It proactively tests how a system responds under stress and infrastructure failure. Chaos engineering lets you identify and fix weaknesses before they end up in the news.
What is Chaos Engineering?
Chaos engineering introduces failures into a distributed computing system to test its resilience.
Also known as chaos testing, chaos engineering was popularised by Netflix when they introduced ‘Chaos Monkey’ to test their redundant architecture. If you are running customer-facing systems, chaos testing is a useful and straightforward process. Of course, you need to have the right tools and testing processes in place.
It is worth noting that chaos engineering is essentially a modern take on ‘resilience testing’, which has been around the tech industry since the year dot.
Usually, with software testing you are assessing your solution’s ability to carry out your business processes. However, you assume that the underlying architecture is fully operational.
With chaos engineering, you still assess your business processes, but failures are deliberately introduced into the underlying architecture.
Incidents tested by chaos engineering include:
• Severed network connections
• Server outages
• Software component crashes
• Component degradation
• Any many more
Why is Chaos Engineering Important?
We told you earlier about Netflix’s involvement. Since then, chaos testing has been adopted by Google, Microsoft, Amazon, LinkedIn, Facebook, and many others. As well as tech companies, finance and retail are also seeing significant benefits.
There is often an underlying fear with software systems, an elephant in the room. What happens if part of your system goes down? How resilient are your systems?
Most systems nowadays are heavily interconnected, with multiple customer access points. If even a small component were to fall over, it could lead to much larger issues.
A lot of companies choose to bury their head in the sand. Ignoring these potential issues in the hope that they won’t happen. Unfortunately, though, they do happen.
Chaos engineering directly addresses these issues.
When you randomly introduce failures into your systems, you can:
• Understand how failures affect your solution
• Test your redundant systems (if you have any)
• Identify components that need additional resilience
How Can I Get Started with Chaos Testing?
As part of its ongoing development, Micro Focus LoadRunner Professional (LRP) users can now integrate with Gremlin, one of the leading chaos engineering tools, to introduce chaos testing during performance testing.
LoadRunner Professional uses integrated Gremlin APIs to orchestrate chaos testing. LoadRunner sends a request to Gremlin to execute the predefined scenario and is called when you add Gremlin scenarios.
You can then compare how your solution responds during an attack against how it performs normally.
Visit the Micro Focus LoadRunner Help Center to learn how to incorporate Gremlin attacks into your LoadRunner Professional scenarios
How Does Gremlin Chaos Engineering Work?
Gremlin is one of the foremost chaos engineering tools and helps you test how your system responds under stress. By incorporating Gremlin attacks into your LRP load tests, you can understand how unexpected failures will impact your infrastructure and applications.
Once a Gremlin disruption event has been added to your LRP scenario, this will affect your chosen component for a specified duration.
For example, you could choose to run a CPU attack event for 5 minutes, starting after half an hour. When you run this scenario script, at the 30-minute mark, Gremlin will attack the chosen CPU.
Gremlin allows you to define virtual attacks on different elements of your solution. You can perform high levels attacks on systems such as databases or web servers. You can also perform more specific, attacks, on components such as the CPU, disk, memory, etc.
Attacks will generally impact the regular workflow, limiting response or reducing performance, so that the webserver works more slowly than usual and there are fewer successful transactions.
By using chaos engineering during performance testing, you add value to the process and learn about how your systems will cope with failures or attacks. Chaos engineering allows you to establish how resilient you are when something goes… bang!