SLURM#

Getting started#

There are plenty of slurm documentation online on the Net.


Quality of Service#

There are 5 different QoS on Lab-IA:

Name Priority Max jobs per user Max GPU per user Max duration
default 1000 6 6 24h
preempt 500 6 6 24h
debug 2000 1 2 30min
nvlink 1000 1 4 24h
pcie 1000 1 4 24h

Default#

This QoS allows a user to run up to 6 jobs with up to 6 GPU for up to 24 hours. Jobs running on this QoS are uninterruptible, meaning that requested resources will be assign to a user for the duration of the jobs. If the jobs exceed 24 hours, Slurm will kill all its process to reclaim the resources. If a job ends earlier, the resources are freed.

Preempt#

This QoS works the same way that default does. The only difference is that jobs running on preempt are interruptible. If someone runs a job on default or testing, it might stop a job running on preempt. This partition is intented to run extra jobs when Lab-IA is underused.

Debug#

This QoS allows a user to run 1 job with up to 2 GPU for up to 30 minutes. It is intented for testing purposes only. Please use this QoS if you need to test that a job can run on a node before running it on other partitions.

PCIE#

This QoS allows a user to run a single job with up to 4 GPU on the pcie partition.

This Qos allows a user to run a single job with up to 4 GPU on the nvlink partition.


Partitions#

There are 4 different partitions on Lab-IA:

Name Nodes Default QoS
all n[1-5,51-55,101-102] Yes default, preempt
testing n[1,51,101] No debug
pcie n[1-5,51-55] No pcie
nvlink n[101-102] No nvlink

All#

This is the default parition. It allows any user to access every nodes.

Testing#

This is the testing partition. It allows any user to test his code on every types of nodes.

PCIE#

This is an exclusive partition. It allows a user to access every resources on a single node (CPU and memory) where GPU are connected with PCI Express. This partition must be used if a job needs to run multi-GPU jobs. Since using this partition will prevent any other user to access the node, please use it wisely.

This is an exclusive partition. It allows a user to access every resources on a single node (CPU and memory) on which GPU are connected with NVLink. This partition must be used if a job needs to run multi-GPU jobs. Since using this partition will prevent any other user to access the node, please use it wisely.