AWS DevOps Agent: AI-Powered Incident Investigation in Seconds
Stop spending 30 minutes investigating incidents. Let AI do it in seconds. Here is a hands-on demo you can practice in 15 minutes.
The Problem
3 AM. Production is down. You are doing this:
- Open CloudWatch → Check metrics
- Open Datadog → Review traces
- Open Splunk → Search logs
- Check GitHub → Find recent deployments
- Correlate everything manually → Find root cause
Time: 20-40 minutes of context switching and log correlation.
What if AI could do all of this in seconds?
The Solution: AWS DevOps Agent
Announced at AWS re:Invent 2025, AWS DevOps Agent is an AI service that automatically investigates incidents by:
- Analyzing logs, metrics, and traces across multiple tools
- Mapping infrastructure dependencies automatically
- Recommending fixes to prevent future incidents
- Integrating with your existing DevOps stack
| Status: Public preview (us-east-1) | Free during preview |
Who Should Use This?
Perfect For
- On-call engineers who spend hours investigating incidents
- SREs managing complex distributed systems
- Platform teams running multi-account AWS environments
- DevOps engineers correlating deployments with failures
Skip If
- Simple applications with clear failure modes
- Rarely experience incidents
- Not heavily using AWS services
My Test: Real Results
I deployed a Lambda function with an intentional error and let the AI investigate.
Setup
- Lambda function with division-by-zero error
- CloudWatch alarm monitoring failures
- 3 error-generating invocations
Results
What the AI found in seconds:
“The Lambda function contains intentional test code that throws ZeroDivisionError at line 9 in lambda_test.py with the literal expression ‘result = 1 / 0’. This is not a production bug but an expected test behavior.”
What impressed me:
- Context-aware: Understood it was test code, not a bug
- Complete timeline: Linked deployment time to first error
- Exact location: Found the error on line 9
- Impact analysis: Calculated 100% failure rate
- Fast: AI analysis in seconds + 4 minutes total
Before vs After
| Task | Manual | AI Agent | Savings |
|---|---|---|---|
| Check metrics | 2-3 min | Auto | 100% |
| Review logs | 3-5 min | Auto | 100% |
| Check deployments | 5-10 min | Auto | 100% |
| Correlate timeline | 5-10 min | Auto | 100% |
| Root cause | 5-10 min | sec | 90% |
| Total | 20-40 min | ~4 min | 80-90% |
Three Core Features
1. AI Investigation
Auto-triggers from:
- ServiceNow tickets
- PagerDuty alerts
- Datadog/Dynatrace/Splunk webhooks
- Slack commands
What it analyzes:
- CloudWatch metrics, logs, alarms
- Third-party observability data
- Deployment history from GitHub/GitLab
- Infrastructure topology
- Historical incident patterns
Delivers:
- Root cause with reasoning
- Event timeline
- Blast radius analysis
- Mitigation steps
2. Topology Discovery
Automatically maps your AWS infrastructure:
- Resources across all accounts
- Service dependencies
- Links to source code
- Deployment history
Use it to:
- Understand blast radius during incidents
- See cascading failure patterns
- Assess change impact
3. Incident Prevention
After analyzing multiple incidents, the AI recommends:
- Observability: “Add alarm for Lambda cold starts”
- Testing: “Add load testing to pipeline”
- Code: “Implement retry logic for API calls”
- Infrastructure: “Enable Multi-AZ for RDS”
Integrations
Works with your existing tools:
Observability: CloudWatch • Datadog • Dynatrace • New Relic • Splunk
CI/CD: GitHub • GitLab
Ticketing: ServiceNow • PagerDuty
Chat: Slack
Kubernetes: Amazon EKS
Custom: MCP servers for proprietary tools
Try It: 15-Minute Demo
A hands-on demo using Terraform for infrastructure and manual Agent Space setup through the AWS Console.
Prerequisites
- AWS account with admin access
- AWS CLI v2 + Terraform installed
- Region: us-east-1
Quick Start
1. Clone & Deploy Infrastructure
git clone https://github.com/sprider/aws-devops-agent-demo.git
cd aws-devops-agent-demo
chmod +x lambda-test.sh
./lambda-test.sh deploy
This automatically creates:
- Lambda function with intentional error
- CloudWatch alarm

2. Create Agent Space (Manual - AWS Console)
The Agent Space must be created through the AWS Console to ensure proper Primary source configuration.
- Open the AWS DevOps Agent Console
- Click “Begin setup” or “Create Agent Space”
- Configure:
- Name: TestAgentSpace (or your preferred name)
- Description: Test Agent Space for Lambda error investigation demo
- Click “Create”

3. Configure Cloud Capabilities (Primary Source)
After Agent Space creation, configure AWS account access:
- In your Agent Space, go to “Settings” → “Cloud capabilities”
- Click “Add cloud capability”
- Select “AWS”
- Choose “Primary source” (not Secondary)
- Configuration:
- Account ID: Your AWS account (from
terraform output aws_account_id) - IAM Role: Use “Auto-create role” option
- Account ID: Your AWS account (from
- Click “Add”

Note: The IAM roles required for the DevOps Agent are automatically created by AWS when you select “Auto-create role” - you do not need to create them manually. The Primary source configuration ensures the agent can properly access CloudWatch alarms, Lambda logs, and other AWS resources needed for investigations.
4. Generate Lambda Errors
./lambda-test.sh test

5. Wait for Alarm to Trigger
After generating errors, wait 1-2 minutes for the CloudWatch alarm to evaluate and enter ALARM state:
./lambda-test.sh status
Wait until you see AlarmState: ALARM before proceeding to the next step.

6. Start Investigation
- In the AWS DevOps Agent Console, click on your Agent Space name (e.g., “TestAgentSpace”)
- Click the “Incident Response” tab
- In the “Start an investigation” text box, type: Lambda function throwing errors
- Click “Start investigation” button
- A modal will appear - fill in the investigation details:
- Investigation details: Keep “Lambda function throwing errors”
- Investigation starting point: CloudWatch alarm AWS-AIDevOps-Lambda-Error-Test
- Date and time of incident: Get current time with
date -u +"%Y-%m-%dT%H:%M:%SZ"
- Click “Start investigating…“

7. Watch AI Work
Watch the investigation in real-time. The AI will:
- Detect the alarm
- Pull Lambda logs
- Identify ZeroDivisionError
- Correlate deployment time
- Provide root cause

Investigation time: In seconds
8. Cleanup Everything
./lambda-test.sh destroy
Important: Manually delete the Agent Space and auto-created resources from the AWS Console before destroying infrastructure.
- Delete Agent Space:
- Go to AWS DevOps Agent Console
- Select your Agent Space
- Click “Actions” → “Delete Agent Space”
- Confirm deletion
- Note: This automatically removes the IAM roles created by the Agent Space
- Delete Lambda Log Group:
- Go to CloudWatch Console → Log groups
- Find
/aws/lambda/AWS-AIDevOps-test-lambda - Select it and click “Actions” → “Delete log group(s)”
- Confirm deletion
- Verify IAM Roles Cleanup (Optional):
- Go to IAM Console → Roles
- Search for roles created by the Agent Space (they usually have “DevOpsAgent” or “AIDevOps” in the name)
- These should be automatically deleted when the Agent Space is deleted
- If any remain, manually delete them
- Then run:
./lambda-test.sh destroy

All Available Commands
./lambda-test.sh deploy # Deploy Lambda and CloudWatch alarm
./lambda-test.sh test # Generate Lambda errors (invoke 3 times)
./lambda-test.sh status # Check CloudWatch alarm status
./lambda-test.sh logs # View Lambda function logs
./lambda-test.sh destroy # Destroy all infrastructure
Cost
$0.00 - Everything covered by AWS Free Tier
Troubleshooting
Issue: “AWS account is not accessible” or “Monitor Association not found”
Error message in investigation:
Unable to investigate the Lambda function errors because AWS account XXX
is not accessible. The error 'Monitor Association with AgentSpace agentSpaceId
XXX not found' indicates this account is not associated with the monitoring system.
Root cause: Your AWS account is not configured as a Primary source in Cloud Capabilities.
Solution:
- Open your Agent Space in AWS Console
- Go to Settings → Cloud capabilities
- Check if your AWS account is listed under “Primary sources”
- If not listed or listed under “Secondary sources”:
- Click “Add cloud capability”
- Select “AWS”
- CRITICAL: Choose “Primary source” (NOT Secondary)
- Enter your AWS account ID (from
terraform output aws_account_id) - Use “Auto-create role” option
- Click “Add”
- Verify your account now appears under “Primary sources”
- Try the investigation again
Why this matters: Only Primary sources give the AI agent full access to CloudWatch alarms, Lambda logs, and other AWS resources needed for investigations.
Key Facts
What It Is
- AI layer that connects your existing tools
- Not a monitoring tool replacement
- Reduces investigation time by 80-90%
Limitations (Preview)
- Region: us-east-1 only
- Quotas: 20 investigation hours/month, 10 prevention hours/month
- Pricing: Free now, pricing TBD at GA
Security
- Read-only permissions by default
- IAM-based access control
- Agent Space isolation
- AWS IAM Identity Center support
Common Questions
Q: Does it replace my observability tools? A: No. It sits on top of them, connecting data across tools.
Q: What if the AI is wrong? A: You are in control. Ask follow-up questions, steer investigations, or escalate to AWS Support.
Q: How secure is it? A: Very. Read-only by default, IAM-controlled, data stays in your account.
Q: Works with non-AWS tools? A: Yes. Integrates with Datadog, Dynatrace, New Relic, Splunk, GitHub, GitLab, ServiceNow, Slack.
Next Steps
After testing:
- Connect production - Create Agent Space for real environment
- Enable auto-triggers - Set up ServiceNow/PagerDuty webhooks
- Review recommendations - Implement prevention suggestions
- Expand scope - Connect multiple AWS accounts
Files in This Repo
aws-devops-agent-demo/
├── README.md # This guide
├── lambda-test.tf # Terraform: Lambda and CloudWatch alarm
├── lambda_test.py # Test Lambda function (division by zero)
├── lambda-test.sh # Automation script for deployment
├── .gitignore # Git ignore file
└── screenshots/ # Step-by-step screenshots of the demo
├── 01-terraform-deploy.png
├── 02-terraform-output.png
├── 03-devops-agent-console.png
├── 04-create-agent-space.png
├── 05-cloud-capabilities.png
├── 06-lambda-errors-generated.png
├── 07-cloudwatch-alarm-triggered.png
├── 08-incident-response-dashboard.png
├── 10-investigation-details-modal.png
├── 11-investigation-in-progress.png
├── 12-investigation-completed.png
├── 13-investigation-summary.png
├── 14-mitigation-plan.png
└── 15-terraform-destroy.png
What is Automated vs Manual?
Automated via Terraform:
- Lambda function with intentional error
- CloudWatch alarm monitoring
Manual via AWS Console:
- Agent Space creation
- Cloud Capabilities configuration (Primary source setup + IAM role auto-creation)
- Agent Space deletion (which automatically removes auto-created IAM roles)
Why Manual? The Agent Space requires Primary source configuration through the console to ensure the AI agent can properly access AWS resources during investigations. The AWS CLI cannot currently configure this correctly. When you delete the Agent Space, AWS automatically cleans up the auto-created IAM roles.
About This Article This article and accompanying automation scripts were developed with assistance from Claude Code(Anthropic). All code has been tested in my personal AWS environment and verified against the official AWS DevOps Agent User Guide.
Resources
Published on:
Learn more