Moneo: Distributed GPU System Monitoring for AI Workflows
Microsoft has introduced a new open-source GPU monitoring framework named Moneo (Latin for monitor). Moneo orchestrates metric collection (DCGMI + Prometheus DB) and visualization (Grafana) across multi-GPU/node systems. This provides useful insights into workflow and system level characterization.
GPUs are optimized for high throughput, massively parallel, workloads. Efficient use of GPUs is dependent on a few factors such as workload characteristics, system characteristics, and sometimes physical environment. GPU system level monitoring determines GPU utilization and facilitates workload characterization. This aids in exploring how to utilize the GPU more efficiently.
Monitoring GPU metrics on a single system over a period may be trivial. However, formatting and analyzing the raw metric data so that it can provide intuitive insights can prove to be a tedious task. Our goal is to pair system level characterization with application metrics to determine the efficiency of our use of the GPUs.
Given that there is already some complexity with collecting and analyzing GPU metrics on a single system, unsurprisingly, scaling the same methodology for multiple systems is difficult.
For certain deep learning models use of distributed multi-GPU systems is the only feasible way to train in a reasonable time frame. Much of the application complexity is abstracted away by high level AI frameworks, but there are still configurations and design choices that users must make that ultimately affect the throughput and behavior of model training.
Moneo’s usefulness in providing system level insights can help guide design choices to achieve the efficient use of GPU systems.
Moneo Design
Figure 1: Design
Three categories of metrics that Moneo monitors:
- Device Counters
- Compute/Memory Utilization
- Streaming multiprocessor (SM) and Memory Clock frequency
- Temperature
- Power
- ECC Counts
- Profiling Counters
- SM Activity
- Memory Dram Activity
- NVLink Activity
- PCIE Rate
- InfiniBand Network Counters
- IB TX/RX rate
Once Moneo has been launched these metrics can be viewed from the Grafana portal. See figures 2,3,4 for snapshots of the different metric views.
Figure 2: Device Counter View
Figure 3: Profiling Counter View
Figure 4: IB Counter View
Getting Started
Starting with Moneo is easy. Just clone the latest release from the Moneo Repo and follow the README for detailed setup instructions or take a look at the quick start guide. In a short period, you should be able to launch Moneo with a single command and log into the Grafana portal to start seeing results!
Moneo is also available on Azure HPC + AI Ubuntu images. Just navigate to “/opt/azurehpc/tools/Moneo”. The image has all the required dependencies installed. So, all that’s necessary is configuring and deploying Moneo.
Quick start instructions:
- Clone Moneo from Github and install ansible.
- git clone https://github.com/Azure/Moneo.git
- cd Moneo
- python3 -m pip install ansible
- Next create a host.ini config file.
-
Note: The master node can also be a worker node as well. The master node will have the Grafana and Prometheus docker containers deployed to it.
-
Note: If you have configured password less SSH already, [all:vars] section can be skipped.
-
Note: The master node must be able to ssh into itself.
-
- Now deploy Moneo
- ansible-playbook -i host.ini src/ansible/deploy.yaml
- Log into the portal by navigating to http://master-ip-or-domain:3000 and inputting your credentials
- Note: By default, username/password are set to "azure". This can be changed here "src/master/grafana/grafana.env"
- Navigating Moneo Grafana Portal
- The current view is labeled in the top left corner:
- VM instance and GPU can be selected from the drop-down menus in the top left corner:
- Various actions such as dashboard selection or data source configuration can be achieved using the left screen menu:
- Metric groups are collapsible:
- The current view is labeled in the top left corner:
Published on:
Learn moreRelated posts
Microsoft Power Automate – Measure time and cost savings for desktop flows
We are announcing the ability to measure time and cost savings for desktop flows in Microsoft Power Automate. This feature will reach general ...
How to stop an infinite trigger loop in Power Automate?
Stop a flow from re-triggering itself with trigger conditions, flag columns, or a create-only trigger.
25 Power Automate Flows That Can Save You 10 Hours Every Week
Introduction Imagine arriving at work on Monday morning and discovering that most of your repetitive tasks have already been completed automat...
Open a Power Automate Flow for Edit Without Fixing Broken Connections First
While reviewing Power Automate flows recently, we ran into an issue where we could not open a flow in edit mode When opening the flow, Power A...
Why Power Automate Isn't the Answer to Record Cloning in Dynamics 365
Your Dynamics 365 team isn't slow. They're just copying the wrong way. The native duplicate feature in Dynamics 365 was never designed for rea...
Trigger a flow when a Power Automate Approval is complete (Accepted or Rejected)
Stop waiting on approvals—trigger a flow the moment a decision is made The post shows how to trigger a Power Automate flow when an Power Auto...
How to get my Power Automate to run more than 30 days?
Redesign Flows that need to wait past 30 days: log requests and process them on a schedule.
Microsoft Power Automate – Export object-centric process mining data to Microsoft Fabric semantic model
We are announcing the ability to export object-centric process mining data to Microsoft Fabric semantic model in Microsoft Power Automate. Thi...
Microsoft Power Automate – Configure notifications for desktop flow checker in admin portal
We are announcing the ability to configure notifications for the desktop flow checker in the admin portal in Microsoft Power Automate. This fe...
Power Automate – View property value expanded inline in the new cloud flow designer
We are announcing the ability to view property value expanded inline in the new cloud flow designer in Power Automate. This feature will reach...