Azure Compute Blog articles

Azure Compute Blog articles

https://techcommunity.microsoft.com/t5/azure-compute-blog/bg-p/AzureCompute

Azure Compute Blog articles

Benchmarking the NC A100 v4, NCsv3, and NCas_T4_v3 series with NVIDIA Deep Learning Examples

Published

Benchmarking the NC A100 v4, NCsv3, and NCas_T4_v3 series with NVIDIA Deep Learning Examples

By Hugo Affaticati, Program Manager

 

Useful resources

BERT from NVIDIA Deep Learning Examples: BERT

ResNet50 from NVIDIA Deep Learning Examples: ResNet50

SSD from NVIDIA Deep Learning Examples: SSD

 

Below are the steps one needs to take to benchmark the capabilities of the NC A100 v4, NCsv3, and NCas_T4_v3 series with NVIDIA Deep Learning Examples on Azure.

 

Pre-requisites

Deploy and set up a virtual machine with “Getting started with the NC A100 v4 series” or “Getting started with the NCsv3 series and the NCas_T4_v3 series”.

 

Path

Set the path to the mounted disk depending on the deployed VM with the pre-requisites.

 

NC A100 v4 series

Data_path=/mnt/resource_nvme

 

NCsv3-series and the NCas_T4_v3 series

Data_path=/mnt/resource_mdisk

 

BERT

Clone the repository

mkdir $Data_path/BERT && cd $Data_path/BERT
git clone https://github.com/NVIDIA/DeepLearningExamples.git
cd $Data_path/BERT/DeepLearningExamples/PyTorch/LanguageModeling/BERT

 

Set up the environment

Get the checkpoints for both models SQUAD and Glue:

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/bert_pyt_ckpt_large_qa_squad11_amp/versions/19.09.0/zip -O bert_pyt_ckpt_large_qa_squad11_amp_19.09.0.zip

 

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/dle/bert_pyt_ckpt_large_ft_sst2_amp/versions/21.11.0/zip -O bert_pyt_ckpt_large_ft_sst2_amp_21.11.0.zip

 

Unzip and place the checkpoints in the checkpoints folder

unzip bert_pyt_ckpt_large_qa_squad11_amp_19.09.0.zip

unzip bert_pyt_ckpt_large_ft_sst2_amp_21.11.0.zip

mv pytorch_model.bin checkpoints/ && mv bert_large_qa.pt checkpoints/

 

Build and launch docker in two steps:

bash scripts/docker/build.sh

bash scripts/docker/launch.sh

 

Finally, obtain the datasets with the following line inside the container. This step is the bottleneck of the benchmark because it requires downloading 19.6 GB of data. It takes approximately two hours.

/workspace/bert/data/create_datasets_from_start.sh

 

Reference inference benchmark GLUE

This benchmark takes less than one minute per batch size. First, start by opening and modifying the configuration file.

vi scripts/run_glue.sh

 

Modify the following parameters

init_checkpoint=${1:-"/workspace/bert/checkpoints/pytorch_model.bin"}

num_gpu=${7:-"1"}

batch_size=${8:-"1"}

precision=${14:-"fp32"}

mode=${16:-"eval"}

 

Run the benchmark

bash scripts/run_glue.sh

 

Then, modify only the batch size by incrementations and run the previous command again to obtain more data points.

 

Run inference benchmark SQUAD

Reproduce the previous steps for SQUAD. This benchmark takes approximately five minutes per batch size. First, start by opening and modifying the configuration file.

vi scripts/run_squad.sh

 

Modify the following parameters

init_checkpoint=${1:-"/workspace/bert/checkpoints/bert_large_qa.pt"}

batch_size=${3:-"1"}

precision=${6:-"fp32"}

num_gpu=${7:-"1"}

mode=${12:-"eval"}

 

Run the benchmark

bash scripts/run_squad.sh

 

Finally, modify only the batch size by incrementations and run the previous command again to obtain more data points.

 

Run training benchmarks

Reproduce the previous steps for both SQUAD and Glue after changing the mode to run training benchmarks.

mode=${12:-"train"}

 

Clone the repository

mkdir $Data_path/resnet && cd $Data_path/resnet
git clone https://github.com/NVIDIA/DeepLearningExamples
cd $Data_path/resnet/DeepLearningExamples/PyTorch/Classification/

 

Download ImageNet Data available online.

 

Set up the environment

Starting with training

mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train

tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar

find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done

cd ..

 

Continuing with inference:

mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar

wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash

cd ../ConvNets

 

Create and launch the container

docker build . -t nvidia_resnet50

nvidia-docker run --rm -it -v $Data_path/resnet/DeepLearningExamples:/imagenet --ipc=host nvidia_resnet50

 

Run the benchmark

Get the pretrained weights from NGC:

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/resnet50_pyt_amp/versions/20.06.0/zip -O resnet50_pyt_amp_20.06.0.zip
unzip resnet50_pyt_amp_20.06.0.zip

 

Finally, for the benchmarks start by updating the config file for the desired batch size

vi configs.yml

 

Run inference benchmark

python ./launch.py --model resnet50 --precision TF32 --mode 
benchmark_inference --platform DGXA100 /imagenet/PyTorch/Classification/ --
raport-file benchmark.json --epochs 1 --prof 100

 

Run training benchmark

python ./launch.py --model resnet50 --precision TF32 --mode benchmark_training 
--platform DGXA100 /imagenet/PyTorch/Classification/ --raport-file
benchmark.json --epochs 1 --prof 100

 

Read the summary to get the values you need. Modify the config file before running the benchmark again to get data points for different batch sizes.

 

SSD

Clone the repository

Clone the NVIDIA repository.

sudo chmod 1777 /mnt
mkdir $Data_path/SSD && cd $Data_path/SSD
git clone https://github.com/NVIDIA/DeepLearningExamples

 

Set up the environment

Get the datasets

mkdir $Data_path/coco && cd $Data_path/coco

sudo apt install unzip

wget http://images.cocodataset.org/zips/train2017.zip && unzip train2017.zip

wget http://images.cocodataset.org/zips/val2017.zip && unzip val2017.zip

wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip && unzip annotations_trainval2017.zip

 

Build and launch docker with this three-step launch

 

cd $Data_path/SSD/DeepLearningExamples/PyTorch/Detection/SSD

 

docker build . -t nvidia_ssd

 

docker run --rm -it --gpus=all --ipc=host -v $Data_path:/coco nvidia_ssd

 

Run the benchmarks

Inference benchmark takes less than one minute per batch size: modify only the variable eval-batch-size and run again to obtain more data points

python main.py --data /coco/coco --eval-batch-size 1 --mode benchmark-inference

 

Then, one can run training benchmark with the following command. Again, modify the variable batch-size and run again to obtain more data points

python main.py --data /coco/coco --batch-size 2 --mode benchmark-training

 

Continue to website...

More from Azure Compute Blog articles

Related Posts