AWS DevOps Agent: AI-Powered Incident Investigation in Seconds

Stop spending 30 minutes investigating incidents. Let AI do it in seconds. Here is a hands-on demo you can practice in 15 minutes.

The Problem

3 AM. Production is down. You are doing this:

Open CloudWatch → Check metrics
Open Datadog → Review traces
Open Splunk → Search logs
Check GitHub → Find recent deployments
Correlate everything manually → Find root cause

Time: 20-40 minutes of context switching and log correlation.

What if AI could do all of this in seconds?

The Solution: AWS DevOps Agent

Announced at AWS re:Invent 2025, AWS DevOps Agent is an AI service that automatically investigates incidents by:

Analyzing logs, metrics, and traces across multiple tools
Mapping infrastructure dependencies automatically
Recommending fixes to prevent future incidents
Integrating with your existing DevOps stack

Status: Public preview (us-east-1)

Free during preview

Who Should Use This?

Perfect For

On-call engineers who spend hours investigating incidents
SREs managing complex distributed systems
Platform teams running multi-account AWS environments
DevOps engineers correlating deployments with failures

Skip If

Simple applications with clear failure modes
Rarely experience incidents
Not heavily using AWS services

My Test: Real Results

I deployed a Lambda function with an intentional error and let the AI investigate.

Setup

Lambda function with division-by-zero error
CloudWatch alarm monitoring failures
3 error-generating invocations

Results

What the AI found in seconds:

“The Lambda function contains intentional test code that throws ZeroDivisionError at line 9 in lambda_test.py with the literal expression ‘result = 1 / 0’. This is not a production bug but an expected test behavior.”

What impressed me:

Context-aware: Understood it was test code, not a bug
Complete timeline: Linked deployment time to first error
Exact location: Found the error on line 9
Impact analysis: Calculated 100% failure rate
Fast: AI analysis in seconds + 4 minutes total

Before vs After

Task	Manual	AI Agent	Savings
Check metrics	2-3 min	Auto	100%
Review logs	3-5 min	Auto	100%
Check deployments	5-10 min	Auto	100%
Correlate timeline	5-10 min	Auto	100%
Root cause	5-10 min	sec	90%
Total	20-40 min	~4 min	80-90%

Three Core Features

1. AI Investigation

Auto-triggers from:

ServiceNow tickets
PagerDuty alerts
Datadog/Dynatrace/Splunk webhooks
Slack commands

What it analyzes:

CloudWatch metrics, logs, alarms
Third-party observability data
Deployment history from GitHub/GitLab
Infrastructure topology
Historical incident patterns

Delivers:

Root cause with reasoning
Event timeline
Blast radius analysis
Mitigation steps

2. Topology Discovery

Automatically maps your AWS infrastructure:

Resources across all accounts
Service dependencies
Links to source code
Deployment history

Use it to:

Understand blast radius during incidents
See cascading failure patterns
Assess change impact

3. Incident Prevention

After analyzing multiple incidents, the AI recommends:

Observability: “Add alarm for Lambda cold starts”
Testing: “Add load testing to pipeline”
Code: “Implement retry logic for API calls”
Infrastructure: “Enable Multi-AZ for RDS”

Integrations

Works with your existing tools:

Observability: CloudWatch • Datadog • Dynatrace • New Relic • Splunk

CI/CD: GitHub • GitLab

Ticketing: ServiceNow • PagerDuty

Chat: Slack

Kubernetes: Amazon EKS

Custom: MCP servers for proprietary tools

Try It: 15-Minute Demo

A hands-on demo using Terraform for infrastructure and manual Agent Space setup through the AWS Console.

Prerequisites

AWS account with admin access
AWS CLI v2 + Terraform installed
Region: us-east-1

Quick Start

1. Clone & Deploy Infrastructure

git clone https://github.com/sprider/aws-devops-agent-demo.git
cd aws-devops-agent-demo
chmod +x lambda-test.sh
./lambda-test.sh deploy

This automatically creates:

Lambda function with intentional error
CloudWatch alarm

Terraform Deploy Terraform Output

2. Create Agent Space (Manual - AWS Console)

The Agent Space must be created through the AWS Console to ensure proper Primary source configuration.

Open the AWS DevOps Agent Console
Click “Begin setup” or “Create Agent Space”
Configure:
- Name: TestAgentSpace (or your preferred name)
- Description: Test Agent Space for Lambda error investigation demo
Click “Create”

DevOps Agent Console Create Agent Space

3. Configure Cloud Capabilities (Primary Source)

After Agent Space creation, configure AWS account access:

In your Agent Space, go to “Settings” → “Cloud capabilities”
Click “Add cloud capability”
Select “AWS”
Choose “Primary source” (not Secondary)
Configuration:
- Account ID: Your AWS account (from terraform output aws_account_id)
- IAM Role: Use “Auto-create role” option
Click “Add”

Cloud Capabilities

Note: The IAM roles required for the DevOps Agent are automatically created by AWS when you select “Auto-create role” - you do not need to create them manually. The Primary source configuration ensures the agent can properly access CloudWatch alarms, Lambda logs, and other AWS resources needed for investigations.

4. Generate Lambda Errors

./lambda-test.sh test

Lambda Errors Generated

5. Wait for Alarm to Trigger

After generating errors, wait 1-2 minutes for the CloudWatch alarm to evaluate and enter ALARM state:

./lambda-test.sh status

Wait until you see AlarmState: ALARM before proceeding to the next step.

CloudWatch Alarm Triggered

6. Start Investigation

In the AWS DevOps Agent Console, click on your Agent Space name (e.g., “TestAgentSpace”)
Click the “Incident Response” tab
In the “Start an investigation” text box, type: Lambda function throwing errors
Click “Start investigation” button
A modal will appear - fill in the investigation details:
- Investigation details: Keep “Lambda function throwing errors”
- Investigation starting point: CloudWatch alarm AWS-AIDevOps-Lambda-Error-Test
- Date and time of incident: Get current time with date -u +"%Y-%m-%dT%H:%M:%SZ"
Click “Start investigating…“

Start Investigation Investigation Details Modal

7. Watch AI Work

Watch the investigation in real-time. The AI will:

Detect the alarm
Pull Lambda logs
Identify ZeroDivisionError
Correlate deployment time
Provide root cause

Investigation In Progress Investigation Completed Investigation Summary Mitigation Plan

Investigation time: In seconds

8. Cleanup Everything

./lambda-test.sh destroy

Important: Manually delete the Agent Space and auto-created resources from the AWS Console before destroying infrastructure.

Delete Agent Space:
- Go to AWS DevOps Agent Console
- Select your Agent Space
- Click “Actions” → “Delete Agent Space”
- Confirm deletion
- Note: This automatically removes the IAM roles created by the Agent Space
Delete Lambda Log Group:
- Go to CloudWatch Console → Log groups
- Find /aws/lambda/AWS-AIDevOps-test-lambda
- Select it and click “Actions” → “Delete log group(s)”
- Confirm deletion
Verify IAM Roles Cleanup (Optional):
- Go to IAM Console → Roles
- Search for roles created by the Agent Space (they usually have “DevOpsAgent” or “AIDevOps” in the name)
- These should be automatically deleted when the Agent Space is deleted
- If any remain, manually delete them
Then run: ./lambda-test.sh destroy

Terraform Destroy

All Available Commands

./lambda-test.sh deploy    # Deploy Lambda and CloudWatch alarm
./lambda-test.sh test      # Generate Lambda errors (invoke 3 times)
./lambda-test.sh status    # Check CloudWatch alarm status
./lambda-test.sh logs      # View Lambda function logs
./lambda-test.sh destroy   # Destroy all infrastructure

Cost

$0.00 - Everything covered by AWS Free Tier

Troubleshooting

Issue: “AWS account is not accessible” or “Monitor Association not found”

Error message in investigation:

Unable to investigate the Lambda function errors because AWS account XXX
is not accessible. The error 'Monitor Association with AgentSpace agentSpaceId
XXX not found' indicates this account is not associated with the monitoring system.

Root cause: Your AWS account is not configured as a Primary source in Cloud Capabilities.

Solution:

Open your Agent Space in AWS Console
Go to Settings → Cloud capabilities
Check if your AWS account is listed under “Primary sources”
If not listed or listed under “Secondary sources”:
- Click “Add cloud capability”
- Select “AWS”
- CRITICAL: Choose “Primary source” (NOT Secondary)
- Enter your AWS account ID (from terraform output aws_account_id)
- Use “Auto-create role” option
- Click “Add”
Verify your account now appears under “Primary sources”
Try the investigation again

Why this matters: Only Primary sources give the AI agent full access to CloudWatch alarms, Lambda logs, and other AWS resources needed for investigations.

Key Facts

What It Is

AI layer that connects your existing tools
Not a monitoring tool replacement
Reduces investigation time by 80-90%

Limitations (Preview)

Region: us-east-1 only
Quotas: 20 investigation hours/month, 10 prevention hours/month
Pricing: Free now, pricing TBD at GA

Security

Read-only permissions by default
IAM-based access control
Agent Space isolation
AWS IAM Identity Center support

Common Questions

Q: Does it replace my observability tools? A: No. It sits on top of them, connecting data across tools.

Q: What if the AI is wrong? A: You are in control. Ask follow-up questions, steer investigations, or escalate to AWS Support.

Q: How secure is it? A: Very. Read-only by default, IAM-controlled, data stays in your account.

Q: Works with non-AWS tools? A: Yes. Integrates with Datadog, Dynatrace, New Relic, Splunk, GitHub, GitLab, ServiceNow, Slack.

Next Steps

After testing:

Connect production - Create Agent Space for real environment
Enable auto-triggers - Set up ServiceNow/PagerDuty webhooks
Review recommendations - Implement prevention suggestions
Expand scope - Connect multiple AWS accounts

Files in This Repo

aws-devops-agent-demo/
├── README.md                 # This guide
├── lambda-test.tf            # Terraform: Lambda and CloudWatch alarm
├── lambda_test.py            # Test Lambda function (division by zero)
├── lambda-test.sh            # Automation script for deployment
├── .gitignore                # Git ignore file
└── screenshots/              # Step-by-step screenshots of the demo
    ├── 01-terraform-deploy.png
    ├── 02-terraform-output.png
    ├── 03-devops-agent-console.png
    ├── 04-create-agent-space.png
    ├── 05-cloud-capabilities.png
    ├── 06-lambda-errors-generated.png
    ├── 07-cloudwatch-alarm-triggered.png
    ├── 08-incident-response-dashboard.png
    ├── 10-investigation-details-modal.png
    ├── 11-investigation-in-progress.png
    ├── 12-investigation-completed.png
    ├── 13-investigation-summary.png
    ├── 14-mitigation-plan.png
    └── 15-terraform-destroy.png

What is Automated vs Manual?

Automated via Terraform:

Lambda function with intentional error
CloudWatch alarm monitoring

Manual via AWS Console:

Agent Space creation
Cloud Capabilities configuration (Primary source setup + IAM role auto-creation)
Agent Space deletion (which automatically removes auto-created IAM roles)

Why Manual? The Agent Space requires Primary source configuration through the console to ensure the AI agent can properly access AWS resources during investigations. The AWS CLI cannot currently configure this correctly. When you delete the Agent Space, AWS automatically cleans up the auto-created IAM roles.

About This Article This article and accompanying automation scripts were developed with assistance from Claude Code(Anthropic). All code has been tested in my personal AWS environment and verified against the official AWS DevOps Agent User Guide.

Resources

Published on: December 05, 2025

Learn more

Home | Joseph Velliah

Fulfilling God’s purpose for my life