Building an AI-Powered SRE Incident Response Workflow With AWS Strands Agents

# ai# python# devops# agents

Ayush Raj Jha

By Ayush Raj Jha, Senior Software Engineer, Oracle Health | Former Software Engineer, Motorola...

By Ayush Raj Jha, Senior Software Engineer, Oracle Health | Former Software Engineer, Motorola Solutions

Learn how to automate CloudWatch alerts, Kubernetes remediation, and incident reporting using multi-agent AI workflows with the AWS Strands Agents SDK.

The SRE Incident Response Agent is a multi-agent sample that ships with the AWS Strands Agents SDK. It automatically discovers active CloudWatch alarms, performs AI-powered root cause analysis using Claude Sonnet 4 on Amazon Bedrock, proposes Kubernetes or Helm remediations, and posts a structured incident report to Slack.

This guide covers everything you need to clone the repo and run it yourself.

Prerequisites

Before you begin, make sure the following are in place:

Python 3.11+ installed on your machine
AWS credentials configured (aws configure or an active IAM role)
Amazon Bedrock access enabled for Claude Sonnet 4 in your target region
kubectl and helm v3 installed — only required if you plan to run live remediations. Dry-run mode works without them.

Step 1: Clone the Repository

The sample lives inside the strands-agents/samples open source repository. Clone it and navigate to the SRE agent directory:

git clone https://github.com/strands-agents/samples.git
cd samples/02-samples/sre-incident-response-agent

The directory contains the following files:

sre-incident-response-agent/
├── sre_agent.py           # Main agent: 4 agents + 8 tools
├── test_sre_agent.py      # Pytest unit tests (12 tests, mocked AWS)
├── requirements.txt
├── .env.example
└── README.md

Step 2: Create a Virtual Environment and Install Dependencies

python -m venv .venv
source .venv/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt

The requirements.txt pins the core dependencies:

strands-agents>=0.1.0
strands-agents-tools>=0.1.0
boto3>=1.38.0
botocore>=1.38.0

Step 3: Configure Environment Variables

Copy .env.example to .env and fill in your values:

cp .env.example .env

Open .env and set the following:

# AWS region where your CloudWatch alarms live
AWS_REGION=us-east-1

# Amazon Bedrock model ID (Claude Sonnet 4 is the default)
BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-20250514-v1:0

# DRY_RUN=true means kubectl/helm commands are printed, not executed.
# Set to false only when you are ready for live remediations.
DRY_RUN=true

# Optional: post the incident report to Slack.
# Leave blank to print to stdout instead.
SLACK_WEBHOOK_URL=

Step 4: Grant IAM Permissions

The agent needs read-only access to CloudWatch alarms, metric statistics, and log events. No write permissions to CloudWatch are required. Attach the following policy to the IAM role or user running the agent:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "cloudwatch:DescribeAlarms",
      "cloudwatch:GetMetricStatistics",
      "logs:FilterLogEvents",
      "logs:DescribeLogGroups"
    ],
    "Resource": "*"
  }]
}

Step 5: Run the Agent

There are two ways to trigger the agent.

Option A: Automatic Alarm Discovery

Let the agent discover all active CloudWatch alarms on its own. This is the recommended mode for a real on-call scenario:

python sre_agent.py

Option B: Targeted Investigation

Pass a natural-language description of the triggering event. The agent will focus its investigation on the service and symptom you describe:

python sre_agent.py "High CPU alarm fired on ECS service my-api in prod namespace"

Example Output

Running the targeted trigger above produces output similar to the following:

Starting SRE Incident Response
   Trigger: High CPU alarm fired on ECS service my-api in prod namespace

[cloudwatch_agent] Fetching active alarms...
  Found alarm: my-api-HighCPU (CPUUtilization > 85% for 5m)
  Metric stats: avg 91.3%, max 97.8% over last 30 min
  Log events: 14 OOMKilled events in /ecs/my-api

[rca_agent] Performing root cause analysis...
  Root cause: Memory leak causing CPU spike as GC thrashes
  Severity: P2 - single service, <5% of users affected
  Recommended fix: Rolling restart to clear heap; monitor for recurrence

[remediation_agent] Applying remediation...
  [DRY-RUN] kubectl rollout restart deployment/my-api -n prod

================================================================
*[P2] SRE Incident Report - 2025-10-14 09:31 UTC*

What happened: CloudWatch alarm my-api-HighCPU fired at 09:18 UTC.
CPU reached 97.8% (threshold 85%). 14 OOMKilled events in 15 min.

Root cause: Memory leak in application heap leading to aggressive GC,
causing CPU saturation. Likely introduced in the last deployment.

Remediation: Rolling restart of deployment/my-api in namespace prod
initiated (dry-run). All pods will be replaced with fresh instances.

Follow-up:
  - Monitor CPUUtilization for next 30 min
  - Review recent commits for memory allocation changes
  - Consider setting memory limits in the Helm chart
================================================================

Running the Tests (No AWS Credentials Required)

The sample ships with 12 pytest unit tests that mock boto3 entirely. You can run the full test suite in any environment, including CI, without any AWS credentials:

pip install pytest pytest-mock
pytest test_sre_agent.py -v

# Expected: 12 passed

Enabling Live Remediation

Once you have validated the agent's behaviour in dry-run mode and are satisfied with the decisions it makes, you can enable live kubectl and helm execution by setting DRY_RUN=false in your .env file:

DRY_RUN=false

Conclusion

In under five minutes of setup, the AWS Strands Agents SDK gives you a working multi-agent incident response loop: alarm discovery, AI-powered root cause analysis, Kubernetes remediation, and a structured incident report, all driven by a single python sre_agent.py command. The dry-run default means there is no risk in running it against a real environment while you evaluate its reasoning.

From here, the natural next steps are connecting a Slack webhook for team notifications, adding a PagerDuty tool for incident tracking, or extending the RCA agent with a vector store of past postmortems. All of that is a tool definition away.

Ayush Raj Jha is a Senior Software Engineer at Oracle Health and former Software Engineer at Motorola Solutions, where he built real-time optimization and device intelligence systems for public safety at scale. He holds IEEE Senior Member status and publishes peer-reviewed research on Healthcare, IoT, Cloud Infrastructure and mission-critical systems at IEEE-affiliated venues.