Building Cloud Resilience: How We Implemented AWS Fault Injection Simulator for a Financial Services Client

Published on April 22, 2025

In the world of cloud infrastructure, hope is not a strategy. Systems fail, networks degrade, and resources become constrained - often at the most inconvenient times. At ExclCloud Solutions, we’ve long advocated for proactive resilience testing rather than reactive firefighting. That’s why when AWS launched their Fault Injection Simulator (FIS), we were eager to implement it for our clients. This post details how we leveraged AWS FIS to transform the reliability posture of one of our financial services clients.

What is AWS FIS and Why Should You Care?

AWS Fault Injection Simulator is a fully managed service that enables controlled chaos engineering experiments on AWS workloads. If you’re unfamiliar with chaos engineering, it’s the practice of deliberately injecting failures into systems to test their resilience and recovery capabilities. Think of it as “vaccinating” your infrastructure against potential failures by exposing it to small, controlled doses of disruption.

AWS FIS allows you to:

  • Simulate AWS service disruptions (like API throttling or service unavailability).
  • Stress test compute resources by maxing out CPU or memory.
  • Terminate EC2 instances or Kubernetes pods.
  • Disrupt network connectivity.
  • Simulate latency in AWS API calls.

All of this happens within a controlled environment with built-in safety mechanisms and rollback procedures.

The Client Challenge: Financial Services with Five-Nines Requirements

Our client, a mid-sized financial services company (we’ll call them “FinSecure” to maintain confidentiality), approached us with a challenge familiar to many in regulated industries. They had migrated their core transaction processing platform to AWS over the previous year, but while their infrastructure was now cloud-native, their resilience testing was still following legacy manual processes.

Their challenges were clear:

  • Regulatory requirements demanded 99.999% uptime for core services.
  • Their disaster recovery testing was quarterly, manual, and disruptive.
  • No systematic way to verify that auto-scaling, load balancing, and self-healing mechanisms actually worked as expected.
  • Engineers were hesitant to make changes due to fear of breaking production systems.

Our Approach: Controlled Chaos with AWS FIS

After assessing FinSecure’s environment, we proposed implementing AWS FIS in phases, starting with non-critical workloads and gradually moving toward their core financial processing applications. Here’s how we approached the implementation:

Phase 1: Discovery and Planning

We began by mapping dependencies between services and creating a comprehensive resilience scorecard for each component. This helped identify the most critical areas where failures would have the biggest business impact.

Next, we defined clear resilience objectives for each service, such as:

  • Recovery time objectives (RTO).
  • Recovery point objectives (RPO).
  • Acceptable performance degradation thresholds.

These objectives formed the foundation of our experiment success criteria.

Phase 2: Building the FIS Experiment Templates

With AWS FIS, all chaos experiments are defined as “experiment templates.” We created a library of templates for FinSecure, each targeting specific failure scenarios:

  1. Infrastructure Failures: Terminating EC2 instances across availability zones.
  2. Network Disruption: Introducing latency and packet loss between services.
  3. Resource Exhaustion: Maxing out CPU/memory on critical services.
  4. Dependency Failures: Simulating failures in external APIs and databases.

Each template included:

  • Precisely targeted resources (using resource tags).
  • Stop conditions to automatically halt experiments if they exceeded safety thresholds.
  • IAM roles with least-privilege permissions.
  • Scheduled maintenance windows to minimize business impact.

Phase 3: Integration with Monitoring and CI/CD

Simply running chaos experiments provides limited value without proper observability and integration into development workflows. We integrated AWS FIS with:

  • CloudWatch dashboards: Custom metrics tracking system behavior during experiments
  • X-Ray traces: Detailed analysis of request flows during disruptions
  • EventBridge rules: Automated notifications and response activities
  • CI/CD pipelines: Resilience tests triggered automatically after deployments

The real game-changer was integrating resilience testing into FinSecure’s deployment pipelines. New code couldn’t reach production unless it passed not only functional tests but also resilience tests.

Real Results: From Skepticism to Resilience Champions

Initially, there was significant skepticism from both operations and development teams. The idea of deliberately causing failures seemed counterintuitive. However, the results spoke for themselves:

  • 27 previously unknown failure modes discovered during the first month of experiments.
  • 85% reduction in mean time to recovery (MTTR) for common failure scenarios.
  • Elimination of false positive alerts by tuning monitoring based on real failure data.
  • Increased deployment frequency as teams gained confidence in their system’s resilience.

One particularly memorable discovery came during a network disruption experiment. We found that while the application had proper retry logic for most API calls, it would hang indefinitely when connections to a particular third-party payment gateway timed out. This issue would likely have caused a major outage during a real network event, but we were able to address it proactively.

The CTO at FinSecure summarized it best: “We went from hoping our recovery plans would work to knowing they work because we test them continuously.”

Lessons Learned: Best Practices for AWS FIS Implementation

Through our implementation with FinSecure and subsequent projects with other clients, we’ve developed several best practices for AWS FIS:

  1. Start small and expand gradually: Begin with non-critical workloads and simple failure scenarios before advancing to critical systems.

  2. Make safety a priority: Always implement stop conditions and rollback procedures for experiments.

  3. Measure everything: Define clear metrics before experiments to quantify impact and improvement.

  4. Run experiments regularly: Resilience testing should be continuous, not a one-time exercise.

  5. Use tagging strategically: Develop a consistent tagging strategy to precisely target resources for experiments.

  6. Involve the right stakeholders: Security, operations, development, and business teams should all participate in planning.

  7. Document everything: Build a knowledge base of failure scenarios, their impact, and mitigation strategies.

Looking Ahead: Game Days and Automated Resilience Testing

Building on our success with AWS FIS, we’ve now implemented quarterly “Game Days” for FinSecure, where cross-functional teams respond to increasingly complex failure scenarios. These exercises have proven invaluable for building the muscle memory needed during real incidents.

Next, we’re working toward fully automated resilience testing that intelligently targets newly deployed resources and services, ensuring that resilience isn’t something that can degrade over time as the system evolves.

Is AWS FIS Right for Your Organization?

If you’re running critical workloads on AWS, the answer is likely yes. While there are other chaos engineering tools available (like Chaos Monkey or Gremlin), AWS FIS provides native integration with AWS services and a managed experience that reduces the risk of unintended consequences.

We’ve found AWS FIS to be particularly valuable for:

  • Regulated industries with strict uptime requirements
  • Organizations undergoing cloud migration or modernization
  • Teams adopting microservices architectures
  • Any business where downtime directly impacts revenue or reputation

At ExclCloud Solutions, we believe that resilience is not a feature you add to systems—it’s a property that emerges from disciplined testing and continuous improvement. AWS FIS has become an essential tool in our arsenal for helping clients achieve true cloud resilience.

Interested in learning more about how AWS FIS might fit into your resilience strategy? Contact our team for a no-obligation assessment of your current AWS environment.

Ready to get started?

You’re one step closer to optimize your IT operations in the cloud.

Book your free consulation call