Install Slurm in a Custom Image for CycleCloud
Published Thursday, July 21, 2022
OVERVIEW
Azure CycleCloud (CC) is a High Performance Computing (HPC) orchestration tool for creating and autoscaling HPC clusters in Azure using traditional schedulers (ie. Slurm, GridEngine, PBS, etc). The default behavior of CC is to download and install the scheduler packages for each node at boot, which can increase the boot time of compute nodes in particular. Creating a custom image with the scheduler packages installed can reduce the boot time by up to half. This blog will demonstrate how to install Slurm packages in a custom image to be deployed by CC.
PREREQUISITES
- working CC install (mine is currently 8.2.2-1902)
- CC Slurm cluster-init version 2.5+ (mine is running 2.6.2)
- Azure CLI installed (or use Cloud Shell)
- CycleCloud CLI installed
- (optional) Azure Image Builder configured
- (optional) Azure Compute Gallery
SOLUTION
Following are the steps needed to create an Azure VM, install Slurm packages and capture it for deployment via CycleCloud (CC):
1. The CC Product Group provides specific versions of Slurm with a Job Submit Plugin used by Slurm to communicate with CC. The latest Slurm version provided by CC is 20.11.7-1 and is available on the Github repo. The Slurm packages and Job Submit Plugin are based on the Linux OS and version used.
NOTE: The CC Slurm repo includes scripts needed to build Slurm RPMs and the CC Job Submit Plugin for other versions of Slurm (ie. 20.11.9).
The scripts can be found here with sample instructions:
## Slurm 20.11.9:
sudo -i
cd $HOME
git clone https://github.com/Azure/cyclecloud-slurm.git
mkdir /source
cp -a cyclecloud-slurm/specs/default/cluster-init/files/JobSubmitPlugin /source
sed -i 's/20.11.7/20.11.9/g' /source/JobSubmitPlugin/job_submit_cyclecloud_test.py
sed -i 's/20.11.7/20.11.9/g' ~/cyclecloud-slurm/specs/default/cluster-init/files/00-build-slurm.sh
bash ~/cyclecloud-slurm/specs/default/cluster-init/files/00-build-slurm.sh
## RPMs and Job Submit Plugin located in ~/rpmbuild/RPMS/x86_64/
2. Create a VM in the Azure portal or CLI using the same OS and version in your Slurm cluster (ie. CentOS 7.9, AlmaLinux 8.6, etc). Here is an example Azure CLI command to create the VM without a Public IP (NOTE: to configure a Public IP remove the --public-ip-address ""
parameter:
3. SSH into the newly created VM using the credentials provisioned while creating the VM.
4. Copy the slurm-install.sh
script to the VM and run it (NOTE: the script defaults to Slurm 20.11.7 on AlmaLinux 8):
5. Verify the installation:
6. Deprovision the VM & exit the SSH session:
7. Deallocate, generalize & capture the VM as a managed image (or capture it to an Azure Compute Gallery):
8. Find and copy the ResourceID of the captured image using Azure CLI or Azure Portal (CLI shown) for use with CC:
9. Update the cluster settings in the CC Portal to use the newly captured image:
- Click the
Edit
link to modify your cluster settings - In the popup window, select
Advanced Settings
on the left vertical menu - Check the
Custom image
checkbox for all the options here and paste the output from Step 8 (the image ResourceID) - Save the settings
10. Add slurm.install = false
to your CC cluster template file and re-import so CC will not download/install the Slurm pkgs:
11. The compute nodes can be updated to use the new image without terminating/restarting the cluster. SSH into the scheduler node and run the following commands:
12. Start a compute node (ie. srun --pty bash
) and verify Slurm is working correctly.
CONCLUSION
The boot time of Slurm compute nodes can be decreased by up to half when Slurm is installed in a custom image and deployed by CC. The custom image can be deployed to an Azure Compute Gallery for replicating to other regions and for improving performance when concurrently scaling many compute nodes. This process can also be combined with Azure Image Builder and Azure Devops to make it a repeatable process.
LEARN MORE
Learn more about Azure Cyclecloud
Read more about Azure HPC + AI
Take the Azure HPC learning path
Continue to website...