Loading...

Use cases of Advanced Network Observability for your Azure Kubernetes Service clusters

Use cases of Advanced Network Observability for your Azure Kubernetes Service clusters

Introduction 

Advanced Network Observability is the inaugural feature of the Advanced Container Networking Services (ACNS) suite bringing the power of Hubble’s control plane to both Cilium and Non-Cilium Linux data planes. It unlocks Hubble metrics, Hubble’s command line interface (CLI) and the Hubble user interface (UI) on your AKS clusters providing deep insights into your containerized workloads. Advanced Network Observability empowers customers to precisely detect and root-cause network related issues in a Kubernetes cluster.  

VamsiKalapala_0-1721421305197.png

 

Prerequisites 

This blog will focus on ACNS enabled on Azure Kubernetes Service cluster with Azure Managed Prometheus and Grafana enabled.  

VamsiKalapala_1-1721421305200.png

Before setting up AKS, ensure that you have an Azure account and subscription, with permissions that allow you to create resource groups and deploy AKS clusters. Follow instructions in this guide to setup an AKS cluster and run the scenarios below.  

High level steps: 

  1. Create AKS Cluster 
  1. Enable Advanced Container Networking Services on this cluster 
  1. Create and attach Azure managed Prometheus and Grafana 
  1. Install Hubble CLI on your local machine following these instructions. 
  1. Deploy Hubble UI Component on this cluster following these instructions. 

 

Concepts

 

Cilium: Cilium is an open source, cloud native solution for providing, securing, and observing network connectivity between workloads, fueled by the revolutionary Kernel technology eBPF. 

Hubble: Hubble is a fully distributed networking and security observability platform. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. 

Retina:  Retina is a cloud-agnostic, open-source eBPF based Kubernetes Network Observability platform, it is the technology behind advanced network observability in non-Cilium Linux nodes. 

 

Customer Scenario 1: Domain Name Server (DNS) intermittent failures  

Ruling out Domain Name Server (DNS) issues is the first step for any major network issue. Having powerful visibility into Domain Name Server (DNS) requests/responses at a detailed pod level enables faster incident resolution and cloud cost optimization. With Advanced Observability, customers can not only view requests and responses by type and fully qualified domain name (FQDN), but they can also see error codes returned to requests, IP addresses returned in response for a given request and much more.   

Retina uses eBPF programs to examine every DNS request and response packet in the Linux kernel and pass the packet and its metadata to the user space program. Here, the metadata is further processed to extract returned IPs in DNS response packets. All this metadata is used to produce relevant metrics that show status both at node level and pod level.  

This is an example of how Advanced metrics can help you. DNS latency, errors and timeouts are hard to troubleshoot and can cause severe application issues. But our dashboards make it easier for DevOps engineers to detect and fix DNS problems. The dashboard panel below shows a sudden rise in missing DNS responses within the cluster, the most common DNS errors, and which nodes have the most errors.  

VamsiKalapala_2-1721421305203.png

 

The dashboard shows a summary of all DNS activities in the cluster – what kinds of queries lack responses, what’s the most common query and most common response. All this information can help administrators prevent possible problems with usage and security, and act to reduce them.  

VamsiKalapala_3-1721421305211.png

Customer Scenario 2: Network Policy Drops at Pod level  

Debugging network policies in large, intricate clusters with multiple namespaces can be a daunting task, especially when there are numerous network policies per namespace. To address this challenge, the network policy addon leverages eBPF in Linux to collect crucial information about dropped packets. By attaching kprobes at various critical locations in the Linux kernel, such as the netfilter drop function and the netfilter nat function, the network policy addon effectively determines if a packet is being dropped.  

When a dropped packet is detected, the associated eBPF programs generate an event that includes packet metadata, along with the drop reason and location. This event is then processed by a userspace program, which parses the data and converts it into Prometheus metrics. These metrics offer valuable insights into the dropped packets, aiding in the identification and resolution of network policy configuration issues.  

Let’s walk through an example and see how pod-level metrics and flows can help debug packet drops in a cluster. Below is a snapshot of a workload running in AKS cluster. The panel shows a heatmap of the pods running as part of a deployment, and the number of packets originating from those pods being dropped. This panel is very useful, because this lets administrations know there is a problem in real-time, and the pod being impacted. The panel also immediately shows the reason for the drop – “policy_denied”, indicating the drops are happening because of a networking policy applied in the cluster.  

 

VamsiKalapala_4-1721421305215.png

 

To dig deeper, we can leverage the Hubble CLI tool to inspect flows in real time. The below snapshot shows how we can filter traffic using namespace and type. Hubble cli will show the source and destination pods of the packets being dropped, helping us narrow down the policy even further.  

VamsiKalapala_5-1721421305230.png

Another tool user can use is the Hubble UI, which shows traffic flows occurring for a namespace. Below, we see the pods in agnhost namespace interacting with other pods in the same namespace as well as pods in different namespaces. Also, it’s receiving packets from outside the cluster. The UI also shows which packets are getting dropped, and the details include source and destination pod names, as well as pod and namespace labels. Using this information, we can dig through network policies applied in the cluster and identify the offending policy quickly.  

 

VamsiKalapala_6-1721421305237.png

 

Customer Scenario 3: Imbalance of traffic for pods within a workload  

Pods fronted by a service expects an even distribution of traffic when a request reaches the service. However, that may not always be the case. Faulty settings can introduce subtle distribution bugs, and this may only manifest when the application performance degrades even when scaling up the workload.   

Retina deploys eBPF programs that attaches itself at various interfaces in the Linux kernel and observes all TCP/UDP packets flowing through the node. This allows Retina to generate rich pod level L4 metrics which can show, among other things, traffic distribution amongst all the pods under a workload (deployment for example).  

The panel below shows the heatmap of incoming and outgoing traffic of pods under a workload. As evident, one of the three pods is receiving a higher volume of traffic that the other two. Administrators can be proactive and help mitigate this issue before application performance degrades and impacts end users.  

VamsiKalapala_7-1721421305241.png

 

Conclusion: 

ACNS with advanced network observability enables deep insights into container networks and enhances the operability of AKS. This blog has explored its capabilities through real-world customer scenarios, demonstrating its capabilities in tackling common network challenges.  We’d also love to hear how enhanced observability can help make your deployment scenarios easier in a comment below.

 

Resources: 

Published on:

Learn more
Azure Networking Blog articles
Azure Networking Blog articles

Azure Networking Blog articles

Share post:

Related posts

Primer: Output Data Generated with an Azure Automation Runbook to a SharePoint List

The second part of the Azure Automation runbook primer brings us to output, specifically how to create items generated by a runbook in a Share...

5 hours ago

Databricks vs Azure Synapse Analytics: A Comprehensive Comparison for Modern Data Solutions

Table of Contents Introduction Data is at the core of modern business decision-making. As companies increasingly rely on data-driven insights,...

22 hours ago

Primer: How to Use Azure Automation to Run Microsoft Graph PowerShell SDK Scripts

A reader asked why it seems so difficult to use Azure Automation runbooks to process Microsoft 365 data. In fact, it's not so hard, and here's...

1 day ago

Extending Regular Expressions (Regex) Support on Azure SQL Managed Instance (MI)

We are happy to announce the Private Preview of Regular Expressions (Regex) support on Azure SQL Managed Instance (MI). This new feature bring...

1 day ago

Final Days for the MSOnline and AzureAD PowerShell Modules

After many twists and turns since August 2021, the MSOnline module retirement will happen in April 2025. The AzureAD module will then retire i...

7 days ago

Join the Conversation: Call for Proposals for Azure Cosmos DB Conf 2025!

Are you passionate about Azure Cosmos DB? Do you have insights, experiences, or innovations that the developer community would love to hear? N...

7 days ago
Stay up to date with latest Microsoft Dynamics 365 and Power Platform news!
* Yes, I agree to the privacy policy