Episode 504 - Azure Reliability SRE
Sadaf Khan joins Evan and Russell to explain and talk about Service Reliability Engineering in the Azure engineering group.

Media file: https://azpodcast.blob.core.windows.net/episodes/Episode504.mp3
YouTube: https://www.youtube.com/watch?v=QNGdTnb1W90&t=1684s
- Public Preview: Customer managed planned failover for Azure Storage
- Public Preview: Instance Mix on Virtual Machine Scale Sets
- Generally Available: Workspaces in Azure API Management
- Generally Available: Azure NetApp Files storage with cool access for all service levels
- Generally Available: Larger Enterprise tier cache instances for Azure Cache for Redis
- Generally Available: Azure Red Hat OpenShift Now Supports Clusters Up to 250 Nodes
Key Topics:
- Azure Reliability SRE: Evan introduced the episode's focus on Azure reliability SRE and mentioned a special guest, Sadaf, who would provide insights on the topic. 0:19
- Azure Storage Public Preview Feature: Russell discussed a new public preview feature for Azure storage that allows customers to manage planned failovers, enhancing the service's reliability. 1:10
- Virtual Machine Scale Set Update: Russell highlighted an update to virtual machine scale sets that allows mixing different instances, improving flexibility and scalability. 1:38
- Azure API Management Workspace: Russell introduced a new feature in Azure API management that enables teams to have more autonomy in managing and publishing APIs. 2:08
- NetApp Files Storage Update: Russell mentioned the general availability of cool access for NetApp files storage, allowing for more cost-effective data storage based on access patterns. 2:40
- Redis Cache Update: Russell discussed a new tier for Redis Cache that supports larger enterprises with increased memory and compute capabilities. 3:02
- Azure Red Hat Openshift Update: Russell shared an update on Azure Red Hat Openshift, which now supports up to 250 nodes, significantly increasing scalability. 3:29
- SRE Role and Impact: Sadaf explained the role of SRE in improving service reliability and quality, detailing their engagement model with various Azure services. 4:52
- SRE Engagement and Resistance: Sadaf shared insights on the initial resistance faced from service teams during SRE engagements and how trust is built over time to allow for more impactful changes. 7:49
- SRE's Approach to Service Improvement: Sadaf outlined the SRE team's structured approach to service improvement, focusing on fundamentals, service health, operational efficiency, and scalability. 10:51
- AI Initiatives in SRE: Sadaf discussed the SRE team's initiatives in leveraging AI to analyze incident data and generate insights, aiming to reduce the cognitive load on engineers. 30:27
Published on:
Learn moreRelated posts
Azure DevOps and GitHub: Journeying into the AI Era
AI is changing how software gets planned, built, and reviewed. As teams adopt agentic development, the platform underneath those workflows mat...
Introducing azure-functions-skills: An AI-Era Workspace for Azure Functions (Preview)
azure-functions-skills gives GitHub Copilot CLI, Claude Code, Codex CLI, and VS Code the skills, MCP configuration, hooks, and instructions ne...
Announcing the Public Preview of Integrated Embeddings in Azure Cosmos DB: Build AI Apps With Embeddings That Stay in Sync
AI applications built on Azure Cosmos DB depend on embeddings for grounded results. Keeping them in sync with your data is the hard part: it m...
Introducing OmniVec: An Open-Source Embedding Platform for AI Apps on Azure
Today we are open-sourcing OmniVec, a platform for building and operating the embedding pipelines that keep the vector representation of your ...
Azure Cosmos DB All Versions and Deletes Change Feed Mode is Now Generally Available
Modern applications don’t just write data and move on. They react to it. A new order triggers an inventory update. A profile change sync...
Change Partition Keys in Azure Cosmos DB is Now Generally Available
We’re excited to announce the general availability of Change Partition Key in Azure Cosmos DB for NoSQL, now with online copy support. Y...
Announcing the General Availability of Per Partition Automatic Failover for Azure Cosmos DB NoSQL
Today, we are excited to announce the General Availability of Per Partition Automatic Failover (PPAF) for Azure Cosmos DB NoSQL API. PPAF is a...
Public Preview: AI-powered Azure Cosmos DB Migration Assistant for RDBMS to NoSQL
Today, we are excited to announce the public preview of the Azure Cosmos DB Migration Assistant for RDBMS to NoSQL, now available in the Azure...
Azure Cosmos DB MCP Toolkit Is Now Generally Available — Bringing Your Database to AI Agents at Scale
Since we introduced the Azure Cosmos DB MCP Toolkit at Ignite 2025 in preview, the response has been clear: developers want a straightforward ...
Announcing General availability of the Azure Cosmos DB vNext emulator
The Azure Cosmos DB vNext emulator is generally available today. It ships as a Docker image that runs on Linux, macOS, and Windows, on both x6...