Challenge
At the end of 2021, Smiles approached us to discuss the possibility of using Chaos Engineering techniques to improve the resilience and availability of its systems. While this term is known in the market, these techniques are still relatively uncommon, especially in Brazil. Therefore, Smiles was seeking a partner to assist them on this journey.
In addition to issues like tool selection, explaining the method and techniques, the main challenge was to create a progression plan that yielded results without introducing risks to Smiles' operations.
Chaos Engineering
First and foremost, it's important to understand the concept. Chaos Engineering is the discipline of conducting experiments on a system to build confidence in its ability to withstand turbulent conditions in production.
In practice, it involves a set of techniques used to introduce controlled adverse conditions into a system to validate its behavior and identify opportunities for improvement. More information can be found on the
Principles of Chaos.
To understand how your application will behave when [adverse] events happen, the best form of testing is chaos engineering.
Dr. Werner Vogels - Amazon CTO
Tool
The first step was to determine which tools would support the project. As it is a relatively new field, the selection of tools varies greatly in terms of functionality, features, etc. Therefore, it was necessary to organize a Proof of Concept (POC), testing and comparing the main available solutions. The solutions tested during this period included:
- Litmus Chaos;
- Gremlin;
- AWS Fault Injection Simulator (FIS).
After this testing period, AWS FIS was selected. The reasons for this decision were mainly the convenience of being integrated into the AWS environment, which is the platform used for Smiles' services, as well as the availability of necessary features.
Project History
Since 2016, Smiles has been working on the evolution of DevOps, covering infrastructure, pipelines, application architecture, etc.
As part of this process, there has always been a desire to start adopting Chaos Engineering techniques. Finally, in 2021, a project was initiated to conduct some tests, still in non-production environments, to validate if the results would be satisfactory.
The results were surprising, strengthening the movement and initiating new project phases and a transition to greater integration with ongoing operations. The project was divided into the following phases:
- Application survey;
- Creation of the load test plan;
- Creation and measurement of the steady state;
- Experimentation in the non-production environment;
- Analysis of results;
- Definition of improvements per application;
- Presentation of the report;
- Subsequently, the ongoing team was responsible for implementing the identified improvements, bringing actual benefits to Smiles.
Tests and Discoveries
In the first round of tests, two applications were selected for the pilot, a CMS and a microservice, both running in containers, one in EKS and the other in ECS. In these applications, the following experiments were conducted:
- 100% CPU injection;
- 80% memory injection;
- Increased disk usage;
- Killing ECS tasks;
- Shutting down EC2 instances.
Applying Chaos Engineering techniques alongside these experiments revealed some discoveries:
Application 1 - Microservice
- Slight increase in response time related to CPU consumption;
- Significant increase in response time related to memory consumption;
- No impact related to disk IO consumption;
- Partial container failures: Low error percentage for less than a minute;
- Total container failures: 100% errors for at least four minutes;
- Single cluster node failure: Low error percentage for a few seconds.
Application 2 - CMS
- No impact related to CPU consumption;
- Slight increase in response time related to memory consumption;
- No impact related to disk IO consumption;
- Partial container failures: Low error percentage for a few minutes;
- Total container failures: 100% errors for at least three minutes;
- Single cluster node failure: Total unavailability of the application, requiring technical team intervention.
A critical availability flaw in the application was discovered, caused by an architectural failure. This discovery triggered an improvement plan, evolving the architecture and resolving the flaw.
Additionally, other less critical deficiencies were identified and addressed by the ongoing team, providing more resilience to the applications, ultimately leading to a better user experience.
Conclusion
Chaos Engineering proved to be an excellent technique for identifying flaws that would not otherwise be detected.
This brought more resilience to the applications and also instilled confidence in the teams that the applications would be able to withstand business demands, even in unfavorable situations.