How to Setup SLURM and ParallelCluster in AWS#
This guide will help you set up your AWS environment and install AWS CLI and AWS ParallelCluster.
Note that the version of AWS that works as of writing is below:
awscli 1.32.93
aws-parallelcluster 3.9.1
Some breaking updates may render current commands invalid.
Setting Up Identity and Access Management (IAM) Role#
Before deploying AWS ParallelCluster, it’s essential to configure an IAM role with appropriate permissions. This role ensures that AWS ParallelCluster has the necessary permissions to manage AWS resources on your behalf, such as EC2 instances, networking components, and storage systems.
The rough steps are as follows:
Navigate to the IAM console in AWS.
Create a new IAM user or use an existing one.
Ensure the user has an IAM role with
AdministratorAccess
or attach theAdministratorAccess
policy directly. While convenient, usingAdministratorAccess
can pose security risks if not managed carefully. It’s recommended to scope down the permissions to least privilege principles if you are working in a team or production environment. For educational purposes, it’s fine to useAdministratorAccess
.Generate an AWS Access Key ID and AWS Secret Access Key for command-line access - you can find them in the user’s security credentials tab in the IAM console.
Configure AWS CLI#
First we install the python package awscli
using pip
.
❯ pip install --upgrade awscli
Then we configure the AWS CLI:
❯ aws configure
AWS Access Key ID [None]: <AWS_ACCESS_KEY>
AWS Secret Access Key [None]: <AWS_SECRET_KEY>
Default region name [None]: ap-southeast-1
Default output format [None]: json
Replace <AWS_ACCESS_KEY>
and <AWS_SECRET_KEY>
with your actual AWS
credentials. And since we want a region close to home we set ap-southeast-1
as
the default region - it is also recommended to set up the region close to you to
reduce latency.
Create EC2 Key Pair#
To set up the SLURM cluster, we need to create an EC2 key pair. This key pair can be viewed as a secure SSH access to the cluster’s head and compute nodes where authorized users can log in to train models, run experiments, and manage the cluster.
❯ aws ec2 create-key-pair --key-name <YourKeyName> --query 'KeyMaterial' --output text > ~/.ssh/<YourKeyName>.pem
We also set the permissions of the key pair file to 600
to ensure that only
the owner can read and write to the file.
❯ chmod 600 ~/.ssh/<YourKeyName>.pem
Configure AWS ParallelCluster#
First, we install the aws-parallelcluster
package using pip
so that we can
use the pcluster
command.
❯ pip install -U aws-parallelcluster
❯ pcluster configure --config config.yaml
Then you would have some configuration like below:
Region: ap-southeast-1
Image:
Os: ubuntu2004
SharedStorage:
- MountDir: /shared
Name: gaohn-efs
StorageType: Efs
EfsSettings:
PerformanceMode: generalPurpose
ThroughputMode: bursting
HeadNode:
InstanceType: t2.small
Networking:
SubnetId: subnet-xxxxxxxxxxxxxxxxx
Ssh:
KeyName: awsec2
Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: distributed-queue
ComputeResources:
- Name: g4dn12xlarge
Instances:
- InstanceType: g4dn.12xlarge
MinCount: 2
MaxCount: 2
Efa:
Enabled: true
Networking:
PlacementGroup:
Enabled: true
Networking:
SubnetIds:
- subnet-xxxxxxxxxxxxxxxxx
1. Review the Configuration File#
Before proceeding, it’s a good idea to double-check your config.yaml
file.
Make sure that all settings such as instance types, network configurations,
placement groups, and other parameters accurately meet your needs. Adjust any
settings if necessary. For example, I expected the HeadNode
to be t2.small
and indeed it was.
2. Create the Cluster#
Run the following command to create the cluster. Replace cluster-name
with
your desired name for the cluster:
❯ pcluster create-cluster --cluster-configuration config.yaml --cluster-name <YOUR-CLUSTER-NAME> --region <REGION>
This command specifies the cluster configuration file (config.yaml
), the name
of the cluster, and the AWS region where you want the cluster to be deployed.
3. Monitor the Cluster Creation#
The cluster creation process may take some time. You can monitor the progress by running the following command:
❯ pcluster describe-cluster --cluster-name <YOUR-CLUSTER-NAME> --region <REGION>
Logging into the Head Node#
During creation of the cluster, you can ssh into the head node using the following command:
❯ pcluster ssh --cluster-name <YOUR-CLUSTER-NAME> -i <path-to-your-key.pem> --region <REGION>
Alternatively you can use normal ssh to login into the head node as well which is useful if you are using vscode remote to login. First, get the public DNS of the head node (you can just get the instance id from the AWS console and then get the public DNS):
❯ aws ec2 describe-instances --instance-ids <INSTANCE-ID> --query "Reservations[*].Instances[*].PublicDnsName" --output text
where the output of the previous command is the <public-dns>
.
Then the ssh command is:
❯ ssh -i </path/to/your-key.pem> <username>@<public-dns>
For username
it defaults to ubuntu
for Ubuntu AMIs.
Compute Node#
The compute nodes are what you will use to run your distributed training. You
set it to g4dn.12xlarge
earlier, and with MinCount
and MaxCount
- these
fields specify the minimum and maximum number of instances of this type that can
be launched. Both MinCount
and MaxCount
are set to 2. This means that
exactly 2 instances of g4dn.12xlarge
will be used.
SLURM Status#
We can check status of the slurm cluster:
❯ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
distributed-queue* up infinite 2 idle distributed-queue-st-g4dn2xlarge-[1-2]
And we can see the nodes are up and running. We can use srun
to see if the
nodes are working:
❯ export NUM_NODES=2
❯ srun -N${NUM_NODES} hostname
distributed-queue-st-g4dn2xlarge-1
distributed-queue-st-g4dn2xlarge-2
Setup Python Environment#
Virtual Environment#
We can do so via virtual environment:
#!/usr/bin/env sh
sudo apt-get update
sudo apt-get install -y python3-venv
python3 -m venv /shared/venv/
source /shared/venv/bin/activate
pip install wheel
echo 'source /shared/venv/bin/activate' >> ~/.bashrc
Miniconda#
Or with conda:
sudo wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
export PATH="</path/to/miniconda3>/bin:$PATH"
To add miniconda to module:
cd /usr/share/modules/modulefiles
sudo nano miniconda3
Add these into the file:
#%Module1.0
proc ModulesHelp { } {
puts stderr "Adds Miniconda3 to your environment variables."
}
module-whatis "Loads Miniconda3 Environment."
set root /home/ubuntu/miniconda3
prepend-path PATH $root/bin
And if you don’t use it in script, you can just do the following.
source /etc/profile.d/modules.sh
module use /usr/share/modules/modulefiles
module load miniconda3
conda create -n ddp python=3.9
source ~/.bashrc # so no need conda init
conda activate ddp
Sample Run#
We use the example from PyTorch’s examples. But note very carefully you likely
need to use their config
template and not the one I provided as they have
things like FSx and other things that are not in the default config.yaml
.
git clone https://github.com/pytorch/examples.git
cd /shared/examples/distributed/minGPT-ddp/mingpt/slurm
nano sbatch_run.sh
We want to modify some things.
1#!/bin/bash
2
3#SBATCH --job-name=multinode-example
4#SBATCH --nodes=2
5#SBATCH --ntasks=2
6#SBATCH --gpus-per-task=1
7#SBATCH --cpus-per-task=4
8#SBATCH --output=train_gpt_%j.out
9#SBATCH --error=train_gpt_%j.err
10
11source $(conda info --base)/etc/profile.d/conda.sh
12conda activate ddp
13
14nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
15nodes_array=($nodes)
16head_node=${nodes_array[0]}
17head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
18
19echo Node IP: $head_node_ip
20export LOGLEVEL=INFO
21
22srun torchrun \
23--nnodes 2 \
24--nproc_per_node 1 \
25--rdzv_id $RANDOM \
26--rdzv_backend c10d \
27--rdzv_endpoint $head_node_ip:29500 \
28/shared/examples/distributed/minGPT-ddp/mingpt/main.py
We change the last line from /shared/examples/mingpt/main.py
to
/shared/examples/distributed/minGPT-ddp/mingpt/main.py
. For the rest, is some
additional info like activating the conda environment.
Also now you can do the sbatch
command to submit the job, and is typically
done in the head node.
❯ sbatch sbatch_run.sh
Delete Cluster#
Delete ParallelCluster#
pcluster delete-cluster --cluster-name <YOUR-CLUSTER-NAME> --region <REGION>
and verify deletion:
pcluster list-clusters --region <REGION>
Delete Network Resources#
You also need to delete your lingering VPCs, subnets etc. First, we idenfity all the network resources associated with the cluster including VPC, subnets etc. Go here to see existing resources.
First get your subnet id from config.yaml
.
Now we first delete NAT.
aws ec2 describe-subnets --subnet-ids <subnet-XXX> --query 'Subnets[0].VpcId' --output text # get the VPC ID
aws ec2 describe-nat-gateways --filter "Name=vpc-id,Values=<vpc-XXX>" # get nat id in the form of nat-xxx
aws ec2 delete-nat-gateway --nat-gateway-id <nat-XXX> # delete the nat gateway
aws ec2 describe-nat-gateways --nat-gateway-ids <nat-XXX> # check if it is deleted
Then we detach and delete network interfaces.
# Detach network interfaces
aws ec2 describe-network-interfaces \
--filters "Name=vpc-id,Values=<vpc-XXX>" \
--query 'NetworkInterfaces[*].[NetworkInterfaceId,Attachment.AttachmentId]' \
--output text | while read -r interface_id attachment_id; do
if [ ! -z "$attachment_id" ]; then
aws ec2 detach-network-interface --attachment-id $attachment_id
fi
done
# Delete network interfaces
aws ec2 describe-network-interfaces \
--filters "Name=vpc-id,Values=<vpc-XXX>" \
--query 'NetworkInterfaces[*].NetworkInterfaceId' \
--output text | xargs -I {} aws ec2 delete-network-interface --network-interface-id {}
# Run again to see if deleted or not
aws ec2 describe-network-interfaces --filters "Name=vpc-id,Values=<vpc-XXX>"
Next, delete subnets
aws ec2 describe-subnets --filters "Name=vpc-id,Values=<vpc-XXX>" --query 'Subnets[*].SubnetId' --output text | xargs -n 1 -I {} aws ec2 delete-subnet --subnet-id {}
# Check if deleted
aws ec2 describe-subnets --filters "Name=vpc-id,Values=<vpc-XXX>" --query 'Subnets[*].SubnetId' --output text
Next, delete route tables
aws ec2 describe-route-tables --filters "Name=vpc-id,Values=<vpc-XXX>" --query 'RouteTables[?Associations==`[]`].RouteTableId' --output text | xargs -n 1 -I {} aws ec2 delete-route-table --route-table-id {}
Now delete the internet gateway
# list the internet gateways
aws ec2 describe-internet-gateways --filters "Name=attachment.vpc-id,Values=<vpc-XXX>" --query 'InternetGateways[*].InternetGatewayId' --output text
aws ec2 detach-internet-gateway --internet-gateway-id <igw-XXX> --vpc-id <vpc-XXX>
aws ec2 delete-internet-gateway --internet-gateway-id <igw-XXX>
Lastly, delete the VPC
aws ec2 delete-vpc --vpc-id <vpc-XXX>
# check if deleted
aws ec2 describe-vpcs --vpc-ids <vpc-XXX>
Consolidated Script#
#!/bin/bash
# Set variables
CLUSTER_NAME="distributed-training"
REGION="ap-southeast-1"
SUBNET_ID="subnet-02cc9a3a21eecdc77" # find from config.yaml
VPC_ID="vpc-00635da5926ac5242" # aws ec2 describe-subnets --subnet-ids "subnet-02cc9a3a21eecdc77" --query 'Subnets[0].VpcId' --output text # get the VPC ID
# Delete ParallelCluster
echo "Deleting AWS ParallelCluster..."
pcluster delete-cluster --cluster-name $CLUSTER_NAME --region $REGION
# Wait and verify deletion
echo "Listing all clusters to verify deletion..."
pcluster list-clusters --region $REGION
# Describe NAT Gateway
echo "Fetching NAT Gateway ID..."
NAT_ID=$(aws ec2 describe-nat-gateways --filter "Name=vpc-id,Values=$VPC_ID" --query 'NatGateways[0].NatGatewayId' --output text)
echo "Deleting NAT Gateway..."
aws ec2 delete-nat-gateway --nat-gateway-id $NAT_ID
sleep 20
echo "Verifying NAT Gateway deletion..."
aws ec2 describe-nat-gateways --nat-gateway-ids $NAT_ID
# Detach and delete network interfaces
echo "Detaching and deleting network interfaces..."
aws ec2 describe-network-interfaces \
--filters "Name=vpc-id,Values=$VPC_ID" \
--query 'NetworkInterfaces[*].[NetworkInterfaceId,Attachment.AttachmentId]' \
--output text | while read -r interface_id attachment_id; do
if [ ! -z "$attachment_id" ]; then
aws ec2 detach-network-interface --attachment-id $attachment_id
fi
aws ec2 delete-network-interface --network-interface-id $interface_id
done
# Delete subnets
echo "Deleting subnets..."
aws ec2 describe-subnets --filters "Name=vpc-id,Values=$VPC_ID" --query 'Subnets[*].SubnetId' --output text | xargs -n 1 -I {} aws ec2 delete-subnet --subnet-id {}
sleep 20
echo "Verifying subnet deletion..."
aws ec2 describe-subnets --filters "Name=vpc-id,Values=$VPC_ID"
# Delete route tables
echo "Deleting route tables..."
aws ec2 describe-route-tables --filters "Name=vpc-id,Values=$VPC_ID" --query 'RouteTables[?Associations==`[]`].RouteTableId' --output text | xargs -n 1 -I {} aws ec2 delete-route-table --route-table-id {}
sleep 10
# Delete internet gateway
echo "Deleting internet gateway..."
IGW_ID=$(aws ec2 describe-internet-gateways --filters "Name=attachment.vpc-id,Values=$VPC_ID" --query 'InternetGateways[*].InternetGatewayId' --output text)
aws ec2 detach-internet-gateway --internet-gateway-id $IGW_ID --vpc-id $VPC_ID
aws ec2 delete-internet-gateway --internet-gateway-id $IGW_ID
# Delete the VPC
echo "Deleting VPC..."
aws ec2 delete-vpc --vpc-id $VPC_ID
sleep 20
echo "Verifying VPC deletion..."
aws ec2 describe-vpcs --vpc-ids $VPC_ID
echo "All resources have been deleted successfully."
Troubleshooting#
Usually at this stage if you face slurm creation failure, we need to inspect the
logs. Usually during creation one can ssh into head node, and you can list out
ls /var/log
to see a wide range of logs. Or you can just follow
AWS’s documentation
to see what logs they recommend. They have a comprehensive
guide
on troubleshooting.
Check the system log from the head node (can be done in the AWS console):
aws ec2 get-console-output --instance-id <instance-id> --region ap-southeast-1 --output text
You can also check relevant logs like:
cat /var/log/cloud-init.log cat /var/log/cloud-init-output.log cat /var/log/cfn-init.log cat /var/log/chef-client.log cat slurmctld.log cat parallelcluster/clusterstatusmgtd
See a sample log that an user reported here.
failureCode is HeadNodeBootstrapFailure with failureReason Cluster creation timed out#
For example, I faced the
failureCode is HeadNodeBootstrapFailure with failureReason Cluster creation timed out
but AWS has good guide to troubleshoot
here.
Sometimes it is simply because your AWS account has not enough quotas. For me
the concrete error for this can be found in
sudo cat /var/log/parallelcluster/clustermgtd.events
.
{"datetime": "2024-05-05T07:59:58.279+00:00", "version": 0, "scheduler": "slurm", "cluster-name": "distributed-training", "node-role": "HeadNode", "component": "clustermgtd", "level": "WARNING", "instance-id": "i-08d41f68bf12ca54b", "event-type": "node-launch-failure-count", "message": "Number of static nodes that failed to launch a backing instance after node maintenance", "detail": {"failure-type": "vcpu-limit-failures", "count": 1, "error-details": {"VcpuLimitExceeded": {"count": 1, "nodes": [{"name": "distributed-queue-st-g4dn12xlarge-2"}]}}}}
Some Useful Commands#
Slurm Commands#
sinfo # show information about nodes
squeue # show the queue
Useful command to check if the nodes are working and returning the hostname:
srun -N2 hostname # run a command on 2 nodes to see if they are working, returns the hostname
How Many GPUs?#
scontrol show nodes # show more detailed information about nodes
sinfo -N -o "%N %G" # show the number of GPUs
Find the Instance ID#
aws ec2 describe-instances --query "Reservations[*].Instances[*].{InstanceID: InstanceId, PublicDNS: PublicDnsName, State: State.Name, Tags: Tags}" --output table
If table
format gives error like list index out of range
, you can replace
table
with json
to slowly filter out.
Stop EC2 Instances#
aws ec2 stop-instances --instance-ids i-1234567890abcdef0 i-abcdef1234567890
Singularity#
To learn how to set up Singularity in an HPC environment using SLURM for distributed training, refer to this guide.