Ayush Raj JhaBy Ayush Raj Jha, Senior Software Engineer, Oracle Health | Former Software Engineer, Motorola...
By Ayush Raj Jha, Senior Software Engineer, Oracle Health | Former Software Engineer, Motorola Solutions
Learn how to automate CloudWatch alerts, Kubernetes remediation, and incident reporting using multi-agent AI workflows with the AWS Strands Agents SDK.
The SRE Incident Response Agent is a multi-agent sample that ships with the AWS Strands Agents SDK. It automatically discovers active CloudWatch alarms, performs AI-powered root cause analysis using Claude Sonnet 4 on Amazon Bedrock, proposes Kubernetes or Helm remediations, and posts a structured incident report to Slack.
This guide covers everything you need to clone the repo and run it yourself.
Before you begin, make sure the following are in place:
aws configure or an active IAM role)kubectl and helm v3 installed — only required if you plan to run live remediations. Dry-run mode works without them.The sample lives inside the strands-agents/samples open source repository. Clone it and navigate to the SRE agent directory:
git clone https://github.com/strands-agents/samples.git
cd samples/02-samples/sre-incident-response-agent
The directory contains the following files:
sre-incident-response-agent/
├── sre_agent.py # Main agent: 4 agents + 8 tools
├── test_sre_agent.py # Pytest unit tests (12 tests, mocked AWS)
├── requirements.txt
├── .env.example
└── README.md
python -m venv .venv
source .venv/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
The requirements.txt pins the core dependencies:
strands-agents>=0.1.0
strands-agents-tools>=0.1.0
boto3>=1.38.0
botocore>=1.38.0
Copy .env.example to .env and fill in your values:
cp .env.example .env
Open .env and set the following:
# AWS region where your CloudWatch alarms live
AWS_REGION=us-east-1
# Amazon Bedrock model ID (Claude Sonnet 4 is the default)
BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-20250514-v1:0
# DRY_RUN=true means kubectl/helm commands are printed, not executed.
# Set to false only when you are ready for live remediations.
DRY_RUN=true
# Optional: post the incident report to Slack.
# Leave blank to print to stdout instead.
SLACK_WEBHOOK_URL=
The agent needs read-only access to CloudWatch alarms, metric statistics, and log events. No write permissions to CloudWatch are required. Attach the following policy to the IAM role or user running the agent:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"cloudwatch:DescribeAlarms",
"cloudwatch:GetMetricStatistics",
"logs:FilterLogEvents",
"logs:DescribeLogGroups"
],
"Resource": "*"
}]
}
There are two ways to trigger the agent.
Let the agent discover all active CloudWatch alarms on its own. This is the recommended mode for a real on-call scenario:
python sre_agent.py
Pass a natural-language description of the triggering event. The agent will focus its investigation on the service and symptom you describe:
python sre_agent.py "High CPU alarm fired on ECS service my-api in prod namespace"
Running the targeted trigger above produces output similar to the following:
Starting SRE Incident Response
Trigger: High CPU alarm fired on ECS service my-api in prod namespace
[cloudwatch_agent] Fetching active alarms...
Found alarm: my-api-HighCPU (CPUUtilization > 85% for 5m)
Metric stats: avg 91.3%, max 97.8% over last 30 min
Log events: 14 OOMKilled events in /ecs/my-api
[rca_agent] Performing root cause analysis...
Root cause: Memory leak causing CPU spike as GC thrashes
Severity: P2 - single service, <5% of users affected
Recommended fix: Rolling restart to clear heap; monitor for recurrence
[remediation_agent] Applying remediation...
[DRY-RUN] kubectl rollout restart deployment/my-api -n prod
================================================================
*[P2] SRE Incident Report - 2025-10-14 09:31 UTC*
What happened: CloudWatch alarm my-api-HighCPU fired at 09:18 UTC.
CPU reached 97.8% (threshold 85%). 14 OOMKilled events in 15 min.
Root cause: Memory leak in application heap leading to aggressive GC,
causing CPU saturation. Likely introduced in the last deployment.
Remediation: Rolling restart of deployment/my-api in namespace prod
initiated (dry-run). All pods will be replaced with fresh instances.
Follow-up:
- Monitor CPUUtilization for next 30 min
- Review recent commits for memory allocation changes
- Consider setting memory limits in the Helm chart
================================================================
The sample ships with 12 pytest unit tests that mock boto3 entirely. You can run the full test suite in any environment, including CI, without any AWS credentials:
pip install pytest pytest-mock
pytest test_sre_agent.py -v
# Expected: 12 passed
Once you have validated the agent's behaviour in dry-run mode and are satisfied with the decisions it makes, you can enable live kubectl and helm execution by setting DRY_RUN=false in your .env file:
DRY_RUN=false
In under five minutes of setup, the AWS Strands Agents SDK gives you a working multi-agent incident response loop: alarm discovery, AI-powered root cause analysis, Kubernetes remediation, and a structured incident report, all driven by a single python sre_agent.py command. The dry-run default means there is no risk in running it against a real environment while you evaluate its reasoning.
From here, the natural next steps are connecting a Slack webhook for team notifications, adding a PagerDuty tool for incident tracking, or extending the RCA agent with a vector store of past postmortems. All of that is a tool definition away.
Ayush Raj Jha is a Senior Software Engineer at Oracle Health and former Software Engineer at Motorola Solutions, where he built real-time optimization and device intelligence systems for public safety at scale. He holds IEEE Senior Member status and publishes peer-reviewed research on Healthcare, IoT, Cloud Infrastructure and mission-critical systems at IEEE-affiliated venues.