Loading...

A quick start guide to benchmarking AI models in Azure: MLPerf Training v2.0

A quick start guide to benchmarking AI models in Azure: MLPerf Training v2.0

By: Sonal Doomra, Program Manager 2, Hugo Affaticati, Program Manager, and Daramfon Akpan, Program Manager

 

Useful resources

Information on the NC A100 v4-series

Information on the NDm A100 v4-series

 

 

MLCommons® provides a distributed AI training benchmark suite: MLPerfTM Training. Here is how to run MLPerfTM training v2.0 benchmarks on NC A100 v4 and NDm A100 v4 virtual machines.

 

1- Select and set up the virtual machine: NC96ads A100 v4 or ND96amsr A100 v4 using the information given in this document.

 

2- Git clone the MLcommons® repo:

cd /mnt/resource_nvme
git clone https://github.com/mlcommons/training_results_v2.0.git

 

3- Set permissions

sudo chown -R $USER:$USER training_results_v2.0/

4- Navigate into the benchmark directory:

cd training_results_v2.0/Azure/benchmarks/<BENCHMARK_NAME>/implementations/ND96amsr_A100_v4/

5- Make changes for NUMA bindings in azure.sh

vi azure.sh 

a. For NC A100 v4-series, paste the following lines in the file.

bind_cpu_cores=([0]="0-23" [1]="24-47" [2]="48-71" [3]="72-95")
bind_mem=([0]="0" [1]="1" [2]="2" [3]="3")

b. For NDm A100 v4-series, paste the following lines in the file.

bind_cpu_cores=([0]="24-47" [1]="24-47" [2]="0-23" [3]="0-23" [4]="72-95" [5]="72-95" [6]="48-71" [7]="48-71")
bind_mem=([0]="1" [1]="1" [2]="0" [3]="0" [4]= "3" [5]="3" [6]="2" [7]="2")

 

6- Make changes to run_and_time.sh to reflect the right path to azure.sh (around line 125)

vi run_and_time.sh 

Replace the line with the following.

 

CMD=( '/bm_utils/bind.sh' '--cpu=/bm_utils/azure.sh' '--mem=/bm_utils/azure.sh' '--ib=single' '--cluster=${cluster}' '--' ${NSYSCMD} 'python' '-u')  

7- Make the changes to run_with_docker.sh file to point to correct path in mounted run_and_time.sh (around line 170)

docker exec -it "${_config_env[@]}" "${CONT_NAME}" \
${TORCH_RUN} --nproc_per_node=${DGXNGPU} /bm_utils/run_and_time.sh
) |& tee "${LOG_FILE_BASE}_${_experiment_index}.log"

 

8- Make changes to config file to account for hyperthreads and number of GPUs

a. For NC A100 v4-series, paste the following lines in the file.

vi config_DGXA100_4gpu_common.sh

First, replace with the following values.

export DGXNGPU=4
export DGXSOCKETCORES=48
export DGXNSOCKET=2
export DGXHT=1

Then, add the following variables

export UCX_TLS=tcp
export UCX_NET_DEVICES=eth0
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO
export NCCL_TOPO_FILE=/opt/microsoft/ncv4/topo.xml
export NCCL_GRAPH_FILE=/opt/microsoft/ncv4/graph.xml
export NCCL_ALGO=Tree
export NCCL_SHM_USE_CUDA_MEMCPY=1
export CUDA_DEVICE_MAX_CONNECTIONS=32
export NCCL_CREATE_THREAD_CONTEXT=1
export NCCL_DEBUG_SUBSYS=ENV
export NCCL_IB_PCI_RELAXED_ORDERING=1
export CUDA_DEVICE_ORDER=PCI_BUS_ID

b. For NDm A100 v4-series, only the values below must be updated:

vi config_DGXA100_1x8x56x1.sh
export DGXNGPU=8
export DGXSOCKETCORES=48
export DGXNSOCKET=2
export DGXHT=1

9- We need to edit mounts.txt (BERT) and run_with_docker.sh (other benchmarks) as well to mount these changes inside container

a. For NC A100 v4-series

For BERT benchmark:

vi mounts.txt
/opt/microsoft/ncv4/topo.xml:/opt/microsoft/ncv4/topo.xml
/opt/microsoft/ncv4/graph.xml:/opt/microsoft/ncv4/graph.xml
/usr/lib/x86_64-linux-gnu/libnccl.so:/usr/lib/x86_64-linux-gnu/libnccl.so
/usr/include/nccl.h:/usr/include/nccl.h
${PWD}/config_DGXA100_1x4x56x2.sh:/workspace/bert/config_DGXA100_1x4x56x2.sh

 For the other benchmarks:

vi run_with_docker.sh
_cont_mounts+=”,/opt/microsoft/ncv4/topo.xml:/opt/microsoft/ncv4/topo.xml”
_cont_mounts+=”,/opt/microsoft/ncv4/graph.xml:/opt/microsoft/ncv4/graph.xml”
_cont_mounts+=”,/usr/lib/x86_64-linux-gnu/libnccl.so:/usr/lib/x86_64-linux-gnu/libnccl.so”
_cont_mounts+=”,/usr/include/nccl.h:/usr/include/nccl.h”

 b. For NDm A100 v4-series

No change is needed for this step.

 

10- Run the command to source the config file:

a. For NC A100 v4-series

source ./config_DGXA100_1x4x56x2.sh

 b. For NDm A100 v4-series

source ./config_DGXA100_1x8x56x1.sh

The next steps are different for each benchmark.

 

11- Follow Readme.txt for the benchmark to download and prepare the data.

Note: While downloading the data, make sure you have enough space. Tip: Use the /mnt/resource_nvme directory to store the data.

 

12- Run the following command to get the docker image name and tag.

docker images

Note the image name and tag associated with the benchmark you are running. <CONTAINER_NAME> in the next step is <REPOSITORY>:<TAG>

 

13- The command below runs the benchmark. Note that each benchmark has its own environment variables to set before we run. Please read the explanation of the variables to understand what value to give to each variable.

 

Run the command below to set the number of experiments to run

export NEXP=10 

 

BERT

CONT=<CONTAINER_NAME> DATADIR=<path/to/4320_shards_varlength/dir> DATADIR_PHASE2=<path/to/4320_shards_varlength/dir> EVALDIR=<path/to/eval_varlength/dir> CHECKPOINTDIR=<path/to/result/checkpointdir> CHECKPOINTDIR_PHASE1=<path/to/pytorch/ckpt/dir> ./run_with_docker.sh

The variables in the above command refer to the directory structure created by the Data download and preprocessing steps.

DATADIR: Point this to the 4320_shards_varlength folder downloaded with the training dataset.
DATADIR_PHASE2: Point this to the 4320_shards_varlength folder downloaded with the training dataset.
EVALDIR: Point this to the eval_varlength folder downloaded with the validation dataset.
CHECKPOINTDIR: Point this to a new results folder under bert data directory.
CHECKPOINTDIR_PHASE1: Point this to the phase1 folder within the bert data directory.

 

RNNT

CONT=<CONTAINER_NAME> DATADIR= </path/to/rnnt/datasets/dir> METADATA_DIR=</path/to/tokenized/folder/under/data/dir> SENTENCEPIECES_DIR=</path/to/sentencepieces/folder/under/data/dir> LOGDIR=./results ./run_with_docker.sh

DATADIR: Point this to the directory where RNNT data is downloaded.
METADATA_DIR: Point this to the folder called ‘tokenized’ within the downloaded RNNT data.
SENTENCEPIECES_DIR: Point this to the folder called “sentencepieces” within the downloaded RNNT data.

 

ResNet50

CONT=<CONTAINER_NAME> DATADIR=/path/to/resnet_data/prep_data/ LOGDIR=./results ./run_with_docker.sh

DATADIR: Point this to the folder called “prep_data” inside the downloaded Resnet data.

 

Minigo

CONT=<CONTAINER_NAME> DATADIR=/path/to/minigo_data/ ./run_with_docker.sh

DLRM

CONT=<CONTAINER_NAME> DATADIR=/path/to/dlrm_data / LOGDIR=./results ./run_with_docker.sh

SSD

CONT=<CONTAINER_NAME> DATADIR=/path/to/ssd_data TORCH_HOME=/torch-home LOGDIR=./results ./run_with_docker.sh

TORCH_HOME: Create a new folder. Mkdir /torch-home.
Point this variable to the newly created /torch-home directory.

 

Mask R-CNN

CONT=<CONTAINER_NAME> DATADIR=/path/to/maskrcnn_data/ LOGDIR=./results ./run_with_docker.sh

 

Published on:

Learn more
Azure Compute Blog articles
Azure Compute Blog articles

Azure Compute Blog articles

Share post:

Related posts

Microsoft Copilot (Microsoft 365): [Copilot Extensibility] No-Code Publishing for Azure AI Foundry Agents to Microsoft 365 Copilot Agent Store

Developers can now publish Azure AI Foundry Agents directly to the Microsoft 365 Copilot Agent Store with a simplified, no-code experience. Pr...

5 hours ago

Azure Marketplace and AppSource: A Unified AI Apps and Agents Marketplace

The Microsoft AI Apps and Agents Marketplace is set to transform how businesses discover, purchase, and deploy AI-powered solutions. This new ...

3 days ago

Episode 413 – Simplifying Azure Files with a new file share-centric management model

Welcome to Episode 413 of the Microsoft Cloud IT Pro Podcast. Microsoft has introduced a new file share-centric management model for Azure Fil...

4 days ago

Bringing Context to Copilot: Azure Cosmos DB Best Practices, Right in Your VS Code Workspace

Developers love GitHub Copilot for its instant, intelligent code suggestions. But what if those suggestions could also reflect your specific d...

4 days ago

Build an AI Agentic RAG search application with React, SQL Azure and Azure Static Web Apps

Introduction Leveraging OpenAI for semantic searches on structured databases like Azure SQL enhances search accuracy and context-awareness, pr...

5 days ago

Announcing latest Azure Cosmos DB Python SDK: Powering the Future of AI with OpenAI

We’re thrilled to announce the stable release of Azure Cosmos DB Python SDK version 4.14.0! This release brings together months of innov...

7 days ago

How Azure CLI handles your tokens and what you might be ignoring

Running az login feels like magic. A browser pops up, you pick an account, and from then on, everything just works. No more passwords, no more...

7 days ago

Boost your Azure Cosmos DB Efficiency with Azure Advisor Insights

Azure Cosmos DB is Microsoft’s globally distributed, multi-model database service, trusted for mission-critical workloads that demand high ava...

9 days ago

Microsoft Azure Fundamentals #5: Complex Error Handling Patterns for High-Volume Microsoft Dataverse Integrations in Azure

🚀 1. Problem Context When integrating Microsoft Dataverse with Azure services (e.g., Azure Service Bus, Azure Functions, Logic Apps, Azure SQ...

10 days ago

Using the Secret Management PowerShell Module with Azure Key Vault and Azure Automation

Automation account credential resources are the easiest way to manage credentials for Azure Automation runbooks. The Secret Management module ...

11 days ago
Stay up to date with latest Microsoft Dynamics 365 and Power Platform news!
* Yes, I agree to the privacy policy