In the world of cloud infrastructure, hope is not a strategy. Systems fail, networks degrade, and resources become constrained - often at the most inconvenient times. At ExclCloud Solutions, we’ve long advocated for proactive resilience testing rather than reactive firefighting. That’s why when AWS launched their Fault Injection Simulator (FIS), we were eager to implement it for our clients. This post details how we leveraged AWS FIS to transform the reliability posture of one of our financial services clients.
AWS Fault Injection Simulator is a fully managed service that enables controlled chaos engineering experiments on AWS workloads. If you’re unfamiliar with chaos engineering, it’s the practice of deliberately injecting failures into systems to test their resilience and recovery capabilities. Think of it as “vaccinating” your infrastructure against potential failures by exposing it to small, controlled doses of disruption.
AWS FIS allows you to:
All of this happens within a controlled environment with built-in safety mechanisms and rollback procedures.
Our client, a mid-sized financial services company (we’ll call them “FinSecure” to maintain confidentiality), approached us with a challenge familiar to many in regulated industries. They had migrated their core transaction processing platform to AWS over the previous year, but while their infrastructure was now cloud-native, their resilience testing was still following legacy manual processes.
Their challenges were clear:
After assessing FinSecure’s environment, we proposed implementing AWS FIS in phases, starting with non-critical workloads and gradually moving toward their core financial processing applications. Here’s how we approached the implementation:
We began by mapping dependencies between services and creating a comprehensive resilience scorecard for each component. This helped identify the most critical areas where failures would have the biggest business impact.
Next, we defined clear resilience objectives for each service, such as:
These objectives formed the foundation of our experiment success criteria.
With AWS FIS, all chaos experiments are defined as “experiment templates.” We created a library of templates for FinSecure, each targeting specific failure scenarios:
Each template included:
Simply running chaos experiments provides limited value without proper observability and integration into development workflows. We integrated AWS FIS with:
The real game-changer was integrating resilience testing into FinSecure’s deployment pipelines. New code couldn’t reach production unless it passed not only functional tests but also resilience tests.
Initially, there was significant skepticism from both operations and development teams. The idea of deliberately causing failures seemed counterintuitive. However, the results spoke for themselves:
One particularly memorable discovery came during a network disruption experiment. We found that while the application had proper retry logic for most API calls, it would hang indefinitely when connections to a particular third-party payment gateway timed out. This issue would likely have caused a major outage during a real network event, but we were able to address it proactively.
The CTO at FinSecure summarized it best: “We went from hoping our recovery plans would work to knowing they work because we test them continuously.”
Through our implementation with FinSecure and subsequent projects with other clients, we’ve developed several best practices for AWS FIS:
Start small and expand gradually: Begin with non-critical workloads and simple failure scenarios before advancing to critical systems.
Make safety a priority: Always implement stop conditions and rollback procedures for experiments.
Measure everything: Define clear metrics before experiments to quantify impact and improvement.
Run experiments regularly: Resilience testing should be continuous, not a one-time exercise.
Use tagging strategically: Develop a consistent tagging strategy to precisely target resources for experiments.
Involve the right stakeholders: Security, operations, development, and business teams should all participate in planning.
Document everything: Build a knowledge base of failure scenarios, their impact, and mitigation strategies.
Building on our success with AWS FIS, we’ve now implemented quarterly “Game Days” for FinSecure, where cross-functional teams respond to increasingly complex failure scenarios. These exercises have proven invaluable for building the muscle memory needed during real incidents.
Next, we’re working toward fully automated resilience testing that intelligently targets newly deployed resources and services, ensuring that resilience isn’t something that can degrade over time as the system evolves.
If you’re running critical workloads on AWS, the answer is likely yes. While there are other chaos engineering tools available (like Chaos Monkey or Gremlin), AWS FIS provides native integration with AWS services and a managed experience that reduces the risk of unintended consequences.
We’ve found AWS FIS to be particularly valuable for:
At ExclCloud Solutions, we believe that resilience is not a feature you add to systems—it’s a property that emerges from disciplined testing and continuous improvement. AWS FIS has become an essential tool in our arsenal for helping clients achieve true cloud resilience.
Interested in learning more about how AWS FIS might fit into your resilience strategy? Contact our team for a no-obligation assessment of your current AWS environment.
You’re one step closer to optimize your IT operations in the cloud.