SLURM#

Getting started#

There are plenty of slurm documentation online on the Net.

Official SLURM documentation (en)
SLURM Job Scheduler (for users) - LRI classes by Corentin Tallec & Diviyan Kalainathan (fr)
SLURM Introduction at DKRZ
Commandes SLURM at AMU
Ateliers Sequenceur SLURM - exercises at LRI
INRIA's Titanic cluster doc (en)
…

Submit a job with sbatch#

Below, a very basic example to run a 1h job on 2 GPU. You can uncomment some lines to activate your conda environment.

cat ~/slurm_batch_example

#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --gres=gpu:2

source ~/.bashrc
#cd /mnt/beegfs/home/YOUR_LOGIN/(...)
#conda activate YOUR_CONDA_ENV
python --version
#python YOUR_PYTHON_FILE.py
#conda deactivate
#Then print Hello and the nodename
echo "Hello from $HOSTNAME"
sleep 10

To submit this script to slurm use

sbatch slurm_batch_example

And to check the status of your job in queue type

squeue

Results are written in slurm-JOB_ID.out

Use the 'sgpu' command to check if there is an available node (here n5) and type

srun --gres=gpu:1 --nodelist=n5 --time=1:00:00 --pty bash

Enter 'exit' command to logout.

Quality of Service#

There are 5 different QoS on Lab-IA:

Name	Priority	Max jobs per user	Max GPU per user	Max duration
default	1000	6	6	24h
preempt	500	6	6	24h
debug	2000	1	2	30min
nvlink	1000	1	4	24h
pcie	1000	1	4	24h

Default#

This QoS allows a user to run up to 6 jobs with up to 6 GPU for up to 24 hours. Jobs running on this QoS are uninterruptible, meaning that requested resources will be assign to a user for the duration of the jobs. If the jobs exceed 24 hours, Slurm will kill all its process to reclaim the resources. If a job ends earlier, the resources are freed.

Preempt#

This QoS works the same way that default does. The only difference is that jobs running on preempt are interruptible. If someone runs a job on default or testing, it might stop a job running on preempt. This partition is intented to run extra jobs when Lab-IA is underused.

Debug#

This QoS allows a user to run 1 job with up to 2 GPU for up to 30 minutes. It is intented for testing purposes only. Please use this QoS if you need to test that a job can run on a node before running it on other partitions.

PCIE#

This QoS allows a user to run a single job with up to 4 GPU on the pcie partition.

NVLink#

This Qos allows a user to run a single job with up to 4 GPU on the nvlink partition.

Partitions#

There are 4 different partitions on Lab-IA:

Name	Nodes	Default	QoS
all	n[1-5,51-55,101-102]	Yes	default, preempt
testing	n[1,51,101]	No	debug
pcie	n[1-5,51-55]	No	pcie
nvlink	n[101-102]	No	nvlink

All#

This is the default parition. It allows any user to access every nodes.

Testing#

This is the testing partition. It allows any user to test his code on every types of nodes.

PCIE#

This is an exclusive partition. It allows a user to access every resources on a single node (CPU and memory) where GPU are connected with PCI Express. This partition must be used if a job needs to run multi-GPU jobs. Since using this partition will prevent any other user to access the node, please use it wisely.

NVLink#

This is an exclusive partition. It allows a user to access every resources on a single node (CPU and memory) on which GPU are connected with NVLink. This partition must be used if a job needs to run multi-GPU jobs. Since using this partition will prevent any other user to access the node, please use it wisely.

SLURM#

Getting started#

Submit a job with sbatch#

Interative mode: login on a (specific) node#

Quality of Service#

Default#

Preempt#

Debug#

PCIE#

NVLink#

Partitions#

All#

Testing#

PCIE#

NVLink#