SLURM#
Getting started#
There are plenty of slurm
documentation online on the Net.
- Official SLURM documentation (en)
- SLURM Job Scheduler (for users) - LRI classes by Corentin Tallec & Diviyan Kalainathan (fr)
- SLURM Introduction at DKRZ
- Commandes SLURM at AMU
- Ateliers Sequenceur SLURM - exercises at LRI
- INRIA's Titanic cluster doc (en)
- …
Submit a job with sbatch#
Below, a very basic example to run a 1h job on 2 GPU. You can uncomment some lines to activate your conda environment.
cat ~/slurm_batch_example
#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --gres=gpu:2
source ~/.bashrc
#cd /mnt/beegfs/home/YOUR_LOGIN/(...)
#conda activate YOUR_CONDA_ENV
python --version
#python YOUR_PYTHON_FILE.py
#conda deactivate
#Then print Hello and the nodename
echo "Hello from $HOSTNAME"
sleep 10
To submit this script to slurm use
sbatch slurm_batch_example
And to check the status of your job in queue type
squeue
Results are written in slurm-JOB_ID.out
Interative mode: login on a (specific) node#
Use the 'sgpu' command to check if there is an available node (here n5) and type
srun --gres=gpu:1 --nodelist=n5 --time=1:00:00 --pty bash
Enter 'exit' command to logout.
Quality of Service#
There are 5 different QoS on Lab-IA:
Name | Priority | Max jobs per user | Max GPU per user | Max duration |
---|---|---|---|---|
default | 1000 | 6 | 6 | 24h |
preempt | 500 | 6 | 6 | 24h |
debug | 2000 | 1 | 2 | 30min |
nvlink | 1000 | 1 | 4 | 24h |
pcie | 1000 | 1 | 4 | 24h |
Default#
This QoS allows a user to run up to 6 jobs with up to 6 GPU for up to 24 hours. Jobs running on this QoS are uninterruptible, meaning that requested resources will be assign to a user for the duration of the jobs. If the jobs exceed 24 hours, Slurm will kill all its process to reclaim the resources. If a job ends earlier, the resources are freed.
Preempt#
This QoS works the same way that default does. The only difference is that jobs running on preempt are interruptible. If someone runs a job on default or testing, it might stop a job running on preempt. This partition is intented to run extra jobs when Lab-IA is underused.
Debug#
This QoS allows a user to run 1 job with up to 2 GPU for up to 30 minutes. It is intented for testing purposes only. Please use this QoS if you need to test that a job can run on a node before running it on other partitions.
PCIE#
This QoS allows a user to run a single job with up to 4 GPU on the pcie partition.
NVLink#
This Qos allows a user to run a single job with up to 4 GPU on the nvlink partition.
Partitions#
There are 4 different partitions on Lab-IA:
Name | Nodes | Default | QoS |
---|---|---|---|
all | n[1-5,51-55,101-102] | Yes | default, preempt |
testing | n[1,51,101] | No | debug |
pcie | n[1-5,51-55] | No | pcie |
nvlink | n[101-102] | No | nvlink |
All#
This is the default parition. It allows any user to access every nodes.
Testing#
This is the testing partition. It allows any user to test his code on every types of nodes.
PCIE#
This is an exclusive partition. It allows a user to access every resources on a single node (CPU and memory) where GPU are connected with PCI Express. This partition must be used if a job needs to run multi-GPU jobs. Since using this partition will prevent any other user to access the node, please use it wisely.
NVLink#
This is an exclusive partition. It allows a user to access every resources on a single node (CPU and memory) on which GPU are connected with NVLink. This partition must be used if a job needs to run multi-GPU jobs. Since using this partition will prevent any other user to access the node, please use it wisely.