Loading...

GPU Monitoring using Azure Monitor

GPU Monitoring using Azure Monitor

gpu_monitoring.jpg

Overview

Today, many highly parallel HPC/AI applications use GPU to improve the run-time performance.

It is important to be able to monitor the GPU metrics (e.g. GPU and memory utilization, tensor cores activity, temperature of GPU’s etc.) to determine if the GPUs are being used efficiently and predict any reliability issues. 

Azure Monitor is an azure service that provides a platform to ingest, analyze, query and monitor all types of data. The primary advantage of using Azure Monitor to monitor your data is simplicity, you do not need to deploy any additional resources or install any extra software to monitor your data.

Here we give an example of how to use Azure monitor to monitor various ND96asr_v4 (A100 on Ubuntu-HPC 18.04) GPU metrics. Please note that the GPU monitoring procedures outlined in this blog post are very portable and can be used with other Azure GPU types (e.g NDv2 and NC series).

 

Which GPU metrics to use?

Nvidia Datacenter GPU Monitoring (DCGM) is a framework that allows access to several low-level GPU counters and metrics to help give insights to the performance and health of the GPU’s. In this example we will be monitoring counter/metrics provided by dmon feature. All DCGM metrics/counters can be accessed by a specific field id. To see all available field ids:

 

dcgmi dmon -l ___________________________________________________________________________________ Long Name Short Name Field Id ___________________________________________________________________________________ driver_version DRVER 1 nvml_version NVVER 2 process_name PRNAM 3 device_count DVCNT 4 cuda_driver_version CDVER 5 name DVNAM 50 brand DVBRN 51 nvml_index NVIDX 52 serial_number SRNUM 53 uuid UUID# 54 minor_number MNNUM 55 oem_inforom_version OEMVR 56 pci_busid PCBID 57 pci_combined_id PCCID 58 pci_subsys_id PCSID 59 etc

 

Note: The DCGM stand-alone executable dcgmi is pre-loaded on the ubuntu-hpc marketplace images.

 

Some useful DCGM field Ids

Field Id GPU Metric
150 temperature (in C)
203 utilization (0-100)
252 memory used (0-100)
1004 tensor core active (0-1)
1006 fp64 unit active (0-1)
1007 fp32 unit active (0-1)
1008 fp16 unit active (0-1)

 

 

How to create a custom GPU Azure monitor collector

 

The python script gpu_data_collector.py show you how to connect to your log analytics workspace , collect various DCGM dmon metrics (by selecting the field Ids of interest) and send them at a specified time interval to your log analytics workspace.

 

./gpu_data_collector.py -h usage: gpu_data_collector.py [-h] [-dfi DCGM_FIELD_IDS] [-nle NAME_LOG_EVENT] [-fgm] [-uc] [-tis TIME_INTERVAL_SECONDS] optional arguments: -h, --help show this help message and exit -dfi DCGM_FIELD_IDS, --dcgm_field_ids DCGM_FIELD_IDS Select the DCGM field ids you would like to monitor (if multiple field ids are desired then separate by commas) [string] (default: 203,252,1004) -nle NAME_LOG_EVENT, --name_log_event NAME_LOG_EVENT Select a name for the log events you want to monitor (default: MyGPUMonitor) -fgm, --force_gpu_monitoring Forces data to be sent to log analytics WS even if no SLURM job is running on the node (default: False) -uc, --use_crontab This script will be started by the system contab and the time interval between each data collection will be decided by the system crontab (if crontab is selected then the -tis argument will be ignored). (default: False) -tis TIME_INTERVAL_SECONDS, --time_interval_seconds TIME_INTERVAL_SECONDS The time interval in seconds between each data collection (This option cannot be used with the -uc argument) (default: 30)

 

Note: This script also collects SLURM job id and the physical hostnames (i.e. physical hosts on which this VM is running). By default, data is only sent to log analytics workspace if a SLURM job is running on the node (this can be overridden with the -fgm option).

 

This script can be started using a linux crontab (see -uc argumet), stand-alone or as a linux systemd service.

To connect to the log analytics workspace the customer_id and shared_key needed to be defined. (Customer ID (i.e. Workspace ID) and shared key (primary or secondary key) can be found in the Azure portal-->log analytics workspace-->Agents management).

You can either define customer_id and shared_key in the script or set with environment variables.

 

export LOG_ANALYTICS_CUSTOMER_ID=<log_analytics_customer_id> export LOG_ANALYTICS_SHARED_KEY=<log_analytics_shared_key>

 

 

Note: if customer_id or shared_key is defined in this script, then the LOG_ANALYTICS_CUSTOMER_ID or LOG_ANALYTICS_SHARED_KEY environmental variables will be ignored.

 

Create GPU Monitor dashboard (with Azure Monitor)

You can go to the log analytics workspace you created in Azure and use kusto  queries to create the GPU metrics charts you are interested in.

Here is a query to get the average GPU utilization of a particular SLURM job running on a virtual machine with GPU's.

 

MYGPUMonitor_CL | where gpu_id_d in (0,1,2,3,4,5,6,7) and slurm_jobid_d == 17 | summarize avg(gpu_utilization_d) by bin(TimeGenerated, 5m) | render timechart

 

 

You can then pin these graphs to your Azure dashboard to create a dashboard like the following.

 

CormacGarvey_0-1643839112456.png

 

Conclusion

 

It's important to provide GPU monitoring to gain insights into how effectively your application is using the GPU’s.

Azure monitor has some powerful monitoring capabilities and allows you to provide GPU monitoring without having to deploy additional resources or install extra software. An example client python code is provided that collects and sends GPU metrics to Azure Monitor, which can then be used to create a custom GPU monitoring dashboard.

 

 

 

Published on:

Learn more
Azure Global articles
Azure Global articles

Azure Global articles

Share post:

Related posts

Introducing langchain-azure-cosmosdb: Build Agentic Apps and RAG with One Database

Build AI Agents and RAG Applications with the New LangChain + LangGraph Connector for Azure Cosmos DB Building AI agents and RAG applications ...

3 days ago

Azure Developer CLI (azd) – April 2026

The Azure Developer CLI (azd) shipped five releases in April 2026. The biggest theme this month is multi-language hook support: write azd hook...

3 days ago

Dynamics 365 Supply Chain Management – Run Planning Optimization on Azure operated by 21Vianet

We are announcing the ability for companies in China running Dynamics 365 Supply Chain Management on Azure operated by 21Vianet to run Plannin...

3 days ago

Announcing the Private Preview of Cosmos DB Azure RBAC Integration

Introduction Managing access to Azure resources often means dealing with two separate permission models: one for management operations and ano...

5 days ago

Azure DocumentDB (with MongoDB compatibility) for Banking: A Modern Customer 360 Approach

Introduction: Transforming Customer Intelligence in Banking Every day, people interact with their bank across mobile apps, branches, call cent...

5 days ago

Exam AI-901: Microsoft Azure AI Fundamentals

With a massive amount of focus on AI across the Microsoft platform, I decided to sit the new AI-901 exam, which is the new Azure fundamentals ...

5 days ago

The problem: All-or-nothing batch processing in Azure Service Bus

Azure Functions lets you settle each Service Bus message on its own within a batch. Complete, abandon, dead-letter, or defer messages one by o...

6 days ago

Welcome to Azure Cosmos DB Conf 2026

Today is the day. Azure Cosmos DB Conf 2026, in partnership with AMD, is a free virtual developer event focused on building modern, scalable a...

6 days ago
Stay up to date with latest Microsoft Dynamics 365 and Power Platform news!
* Yes, I agree to the privacy policy