Loading...

Moneo: Distributed GPU System Monitoring for AI Workflows

Moneo: Distributed GPU System Monitoring for AI Workflows

Microsoft has introduced a new open-source GPU monitoring framework named Moneo (Latin for monitor). Moneo orchestrates metric collection (DCGMI + Prometheus DB) and visualization (Grafana) across multi-GPU/node systems.  This provides useful insights into workflow and system level characterization.

 

GPUs are optimized for high throughput, massively parallel, workloads. Efficient use of GPUs is dependent on a few factors such as workload characteristics, system characteristics, and sometimes physical environment.  GPU system level monitoring determines GPU utilization and facilitates workload characterization.  This aids in exploring how to utilize the GPU more efficiently.

 

Monitoring GPU metrics on a single system over a period may be trivial. However, formatting and analyzing the raw metric data so that it can provide intuitive insights can prove to be a tedious task. Our goal is to pair system level characterization with application metrics to determine the efficiency of our use of the GPUs.

 

Given that there is already some complexity with collecting and analyzing GPU metrics on a single system, unsurprisingly, scaling the same methodology for multiple systems is difficult.

 

For certain deep learning models use of distributed multi-GPU systems is the only feasible way to train in a reasonable time frame. Much of the application complexity is abstracted away by high level AI frameworks, but there are still configurations and design choices that users must make that ultimately affect the throughput and behavior of model training. 

 

Moneo’s usefulness in providing system level insights can help guide design choices to achieve the efficient use of GPU systems.

 

Moneo Design

Rafael_Salas_0-1655752144081.png

Figure 1: Design

Three categories of metrics that Moneo monitors:

  1. Device Counters
    1. Compute/Memory Utilization
    2. Streaming multiprocessor (SM) and Memory Clock frequency
    3. Temperature
    4. Power
    5. ECC Counts
  2. Profiling Counters
    1. SM Activity
    2. Memory Dram Activity
    3. NVLink Activity
    4. PCIE Rate
  3. InfiniBand Network Counters
    1. IB TX/RX rate

Once Moneo has been launched these metrics can be viewed from the Grafana portal. See figures 2,3,4 for snapshots of the different metric views.

Rafael_Salas_1-1655752346436.png

Figure 2: Device Counter View

 

Rafael_Salas_2-1655752354193.png

Figure 3: Profiling Counter View

 

Rafael_Salas_2-1655754097583.png

Figure 4: IB Counter View

 

 

Getting Started

Starting with Moneo is easy. Just clone the latest release from the Moneo Repo and follow the README for detailed setup instructions or take a look at the quick start guide.  In a short period, you should be able to launch Moneo with a single command and log into the Grafana portal to start seeing results!

 

Moneo is also available on Azure HPC + AI Ubuntu images.  Just navigate to “/opt/azurehpc/tools/Moneo”. The image has all the required dependencies installed. So, all that’s necessary is configuring and deploying Moneo.

 

Quick start instructions:

  1. Clone Moneo from Github and install ansible.
    1. git clone https://github.com/Azure/Moneo.git
    2. cd Moneo
    3. python3 -m pip install ansible
  1. Next create a host.ini config file.

    Rafael_Salas_4-1655752618982.png

    • Note: The master node can also be a worker node as well. The master node will have the Grafana and Prometheus docker containers deployed to it.

    • Note: If you have configured password less SSH already, [all:vars] section can be skipped.

    • Note: The master node must be able to ssh into itself. 

  2. Now deploy Moneo
    • ansible-playbook -i host.ini src/ansible/deploy.yaml
  3. Log into the portal by navigating to http://master-ip-or-domain:3000 and inputting your credentials
    • Rafael_Salas_5-1655752797433.png
    • Note: By default, username/password are set to "azure". This can be changed here "src/master/grafana/grafana.env"
  4. Navigating Moneo Grafana Portal
    1. The current view is labeled in the top left corner:
      • Rafael_Salas_6-1655753066606.png

         

    2. VM instance and GPU can be selected from the drop-down menus in the top left corner:
      • Rafael_Salas_7-1655753080675.png

         

    3. Various actions such as dashboard selection or data source configuration can be achieved using the left screen menu:Rafael_Salas_12-1655753496902.png

       

    4. Metric groups are collapsible:

Rafael_Salas_10-1655753451422.png

 

 

 

 

 

 

 

Published on:

Learn more
Azure Compute Blog articles
Azure Compute Blog articles

Azure Compute Blog articles

Share post:

Related posts

Power Automate: Get support for normalized schema import for data ingestion

Customers with existing data pipelines or data products in data mesh mostly want to stick to normalized, efficient forms such as star schema. ...

1 day ago

Power Automate: Export object-centric process mining data to Microsoft Fabric semantic model

Publishing to Microsoft Fabric breaks down data silos and amplifies the impact of your process insights across your organization. Instead of k...

1 day ago

Power Automate: Create and visualize custom KPIs in Process Intelligence Studio

Custom KPIs put your unique business priorities at the center of process analysis. While standard metrics provide valuable baseline insights, ...

1 day ago

Power Automate: Analyze your processes in Process Intelligence Studio

Process Intelligence Studio eliminates the friction between your questions and your answers. Instead of navigating rigid dashboards or using a...

1 day ago

Power Automate: Configure Entra hybrid join for hosted machine groups

Microsoft Entra hybrid join with custom virtual networks (VNETs) and hosted machine groups lets your hosted machine group bots enroll in both ...

1 day ago

Power Automate: Enable version control for desktop flows

With version control in Power Automate for desktop, you can see what changes were made and who made them. This feature makes it easier to debu...

1 day ago

Power Automate: Use Power Platform environment variables in desktop flows

Retrieve Power Platform environment variables directly through their desktop flows without the need to pass them as inputs to the flow. A new ...

1 day ago

Power Automate: Create and edit expressions with Copilot

You can create, edit, and fix your Power Automate expressions by indicating your requirements in natural language. By using this feature to in...

1 day ago

Condition vs. Switch in Power Automate: When to Use Each

A common question I hear from newer Power Automate users is when to use Condition vs Switch in the Control connector. Control is available in ...

5 days ago

Connection references and permissions in Power Automate

Have you ever battled with connection references in Power Automate? If not, you probably haven't worked on projects with multiple developers o...

5 days ago
Stay up to date with latest Microsoft Dynamics 365 and Power Platform news!
* Yes, I agree to the privacy policy