SLURM#

Getting started#

There are plenty of slurm documentation online on the Net.

Submit a job with sbatch#

Below, a very basic example to run a 1h job on 2 GPU. You can uncomment some lines to activate your conda environment.

cat ~/slurm_batch_example

#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --gres=gpu:2

source ~/.bashrc
#cd /mnt/beegfs/home/YOUR_LOGIN/(...)
#conda activate YOUR_CONDA_ENV
python --version
#python YOUR_PYTHON_FILE.py
#conda deactivate
#Then print Hello and the nodename
echo "Hello from $HOSTNAME"
sleep 10

To submit this script to slurm use

sbatch slurm_batch_example

And to check the status of your job in queue type

squeue

Results are written in slurm-JOB_ID.out

Interative mode: login on a (specific) node#

Use the 'sgpu' command to check if there is an available node (here n5) and type

srun --gres=gpu:1 --nodelist=n5 --time=1:00:00 --pty bash

Enter 'exit' command to logout.


Quality of Service#

There are 5 different QoS on Lab-IA:

Name Priority Max jobs per user Max GPU per user Max duration
default 1000 6 6 24h
preempt 500 6 6 24h
debug 2000 1 2 30min
nvlink 1000 1 4 24h
pcie 1000 1 4 24h

Default#

This QoS allows a user to run up to 6 jobs with up to 6 GPU for up to 24 hours. Jobs running on this QoS are uninterruptible, meaning that requested resources will be assign to a user for the duration of the jobs. If the jobs exceed 24 hours, Slurm will kill all its process to reclaim the resources. If a job ends earlier, the resources are freed.

Preempt#

This QoS works the same way that default does. The only difference is that jobs running on preempt are interruptible. If someone runs a job on default or testing, it might stop a job running on preempt. This partition is intented to run extra jobs when Lab-IA is underused.

Debug#

This QoS allows a user to run 1 job with up to 2 GPU for up to 30 minutes. It is intented for testing purposes only. Please use this QoS if you need to test that a job can run on a node before running it on other partitions.

PCIE#

This QoS allows a user to run a single job with up to 4 GPU on the pcie partition.

This Qos allows a user to run a single job with up to 4 GPU on the nvlink partition.


Partitions#

There are 4 different partitions on Lab-IA:

Name Nodes Default QoS
all n[1-5,51-55,101-102] Yes default, preempt
testing n[1,51,101] No debug
pcie n[1-5,51-55] No pcie
nvlink n[101-102] No nvlink

All#

This is the default parition. It allows any user to access every nodes.

Testing#

This is the testing partition. It allows any user to test his code on every types of nodes.

PCIE#

This is an exclusive partition. It allows a user to access every resources on a single node (CPU and memory) where GPU are connected with PCI Express. This partition must be used if a job needs to run multi-GPU jobs. Since using this partition will prevent any other user to access the node, please use it wisely.

This is an exclusive partition. It allows a user to access every resources on a single node (CPU and memory) on which GPU are connected with NVLink. This partition must be used if a job needs to run multi-GPU jobs. Since using this partition will prevent any other user to access the node, please use it wisely.