Using the job resource manager Slurm -

Using the job resource manager Slurm

Slurm is an open source cluster management and job scheduling system for Linux. It allocates access to resources and provides a framework for the job management.

The Slurm system allows users to run applications in interactive or batch modes. Upon the job submission, Slurm returns the job ID which allows users to monitor and to interact with this job.

In interactive mode you will have an access to an interactive shell on the first node which was reserved for you by Slurm. In batch mode you have to indicate a shell script which going to be executed in your name on the reserved resources.

You can tune a various parameters of your jobs using Slurm options, such as: job duration, number of nodes or cores, the amount of allocated memory, the name for your job, the name for the files to store output of your job, etc…

For more information about Slurm don’t hesitate to read documentation.

The available partitions at the mesocentre :

Hardware resources are grouped into partitions. Each partitions aggregating machines with common characteristics, the same resource can belong to several partitions.

Nom de la Partition	Nom des nœuds	Type de machine	Nom du scratch
skylake	skylake[001-158]	PowerEdge C6420	/scratch, /scratchfast
dev	dev[01-02]	PowerEdge C6420	/scratch, /scratchfast
smp-opa	smp005	PowerEdge R940	/scratch, /scratchfast
kepler	gpu[004-010]	PowerEdge C4130	/scratch, /scratchfast
pascal	gpu[011-012]	PowerEdge C4130	/scratch, /scratchfast
volta	gpu[013-017]	PowerEdge C4140	/scratch, /scratchfast
visu	visu001	PowerEdge R740	/scratch, /scratchfast

The main Slurm variables are:

$SLURM_JOBID: The ID of the job allocation.
$SLURM_SUBMIT_DIR: The directory from which sbatch was invoked.
$SLURM_NODELIST: List of nodes allocated to the job. It could be used for example as an option for mpirun (machinefile $SLURM_NODEFILE) or to connect to a node allocated for your interactive job.
$SLURM_JOB_NAME: Name of the job.

The complete list of variables used by Slurm is available at this page.

The basic Slurm commands:

sinfo: Shows all available partitions and their load.
srun: Run an interactive job. By default Slurm reserves one core for 30 minutes. Type exit to return to submission shell. Ex: srun -p partition_name –time=2:30:0 -N 2 –ntasks-per-node=4 –pty bash -i
sbatch: submits a batch job to the Slurm scheduler. In a script which is submitted with this command you can define the environment needed for successful job execution.
squeue: shows the status of the jobs. If you are interested in monitoring your jobs only you should use this command with “-u” option. Ex: squeue -u login_name
scontrol: used to view or modify a running job.
sacct: show the history of your jobs. (For more information on this command and the output format, please follow this link.)
scancel: stops a running job. Ex: scancel JOB_ID
sacctmgr: used to view Slurm accounting information.

The complete list of Slurm commands is available at this page.

Main Slurm directives:

#SBATCH -J JOB_NAME: defines a name for the batch job.
#SBATCH -p PARTITION: defines the partition to use.
#SBATCH -N N: set number of nodes to allocate.
#SBATCH -n N: set number of cores to allocate.
#SBATCH -- ntasks-per-node=N: set number of cores per node to allocate.
#SBATCH -t DD-HH:MM:SS # set walltime of the job. The maximum duration of a job is 7 days. If walltime is not set, then, by default, the walltime is set to 30 minutes. The format for time indication could be: “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” ou “days-hours:minutes:seconds”
#SBATCH -A PROJECT_NAME: set the name of the project to use.
#SBATCH -o OUTPUT_FILE: specifies the file containing the stdout.
#SBATCH -e ERROR_FILE: specifies the file containing the stderr.
#SBATCH --mail-type=BEGIN,END: specify the event you want to get notified about.
#SBATCH --mail-user=your@mail.address: specify the e-mail address for receiving notifications
#SBATCH --requeue: Allows to re-run automatically your job if it was killed or failed due any node related problems.

Example of Slurm Scripts:

Example of a job allocating six cores for two hours on a machine with westmere CPU:

#!/bin/sh
#SBATCH -J Job_westmere
#SBATCH -p westmere
#SBATCH -n 6
#SBATCH -A b001
#SBATCH -t 2:00:00
#SBATCH -o ./%j.%x.out
#SBATCH -e ./%j.%x.err
#SBATCH --mail-type=BEGIN,END
#SBATCH --mail-user=your@mail.address
# load module python 3.6.3
module purge
module load userspace/all
module load python3/3.6.3
# moving to the working directory
cd /scratchw/$SLURM_JOB_USER/
~

Example of a job allocating two nodes for two days and 12 hours on the machine with skylake CPU:

#!/bin/sh
#SBATCH -J Job_skylake
#SBATCH -p skylake
#SBATCH -N 2
#SBATCH -n 32
#SBATCH -A b001
#SBATCH -t 2-12
#SBATCH -o .%j.out
#SBATCH -e .%j.err
#SBATCH --mail-type=BEGIN,END
#SBATCH --mail-user=your@mail.address

# chargement des modules
module purge
module load userspace/all
module load openmpi/2.1.2/2018
# moving to the working directory
cd /scratch/$SLURM_JOB_USER/
echo “Running on: $SLURM_NODELIST”
mpirun my_program
~

Example of a job using 2 cards GPU kepler and 10 cores of CPU for 10 hours:

#!/bin/sh
#SBATCH -J Job_gpu
#SBATCH -p kepler
#SBATCH --gres=gpu:2
#SBATCH --gres-flags=enforce-binding # activates CPU:GPU affinity
#SBATCH -n 10
#SBATCH -A b001
#SBATCH -t 10:00:00
#SBATCH -o %j.out
#SBATCH -e %j.err
# chargement des modules
module purge
module load userspace/all
module load cuda/9.1
# moving to the working directory
cd /scratch/$SLURM_JOB_USER/
~

Pour l’utilisation de /scratchfast sur noeud skylake

#!/bin/sh
#SBATCH -J Job_scratchfast
#SBATCH -p skylake
#SBATCH -N 1
#SBATCH -L scratchfast:10 #10Go
#SBATCH -A b001
#SBATCH -t 10:00:00
#SBATCH -o %j.out
#SBATCH -e %j.err
# chargement des modules
module purge
module load userspace/all
module load …
# déplacement sur le répertoire de travail
cd /scratchfast/$SLURM_JOB_USER/$SLURM_JOB_ID/
~

Example of submitting ten parametric jobs:

#!/bin/sh
#SBATCH -J Parametric_Jobs
#SBATCH -p skylake
#SBATCH -N 1
#SBATCH -A b001
#SBATCH –ntasks-per-node=1
#SBATCH -t 10:00:00
#SBATCH –array=0-9
#SBATCH -o %j.out
#SBATCH -e %j.err
#SBATCH –mail-type=end,fail
#SBATCH –mail-user=votre@address
# moving to the working directory
cd /scratch/$SLURM_JOB_USER/
mpirun my_program $SLURM_ARRAY_TASK_ID

Example of submitting 10 parametric jobs with precise argument for each job. The number of simultaneously running jobs is 5 jobs max.
Pay attention: the number of parametric arguments have to correspond to the number of the jobs.

#!/bin/sh
#SBATCH -J Paramétrique_Jobs
#SBATCH -A b001
#SBATCH -p skylake
#SBATCH -N 1
#SBATCH –ntasks-per-node=1
#SBATCH -t 10:00:00
#SBATCH –array=0-9%5
# moving to the working directory
cd /scratch/$SLURM_JOB_USER/
VALUES=(0 1 1 2 3 5 8 13 21 34)
mpirun my_program ${VALUES[$SLURM_ARRAY_TASK_ID]}

Accounting of consumed CPU time:

The mésocentre allocates certain amount of CPU time per project. A user with the same login name can participate in several projects.

The command rheticus_info shows the list of projects to which a user belongs to as well as consumed CPU time and the CPU time limit for user’s project

rheticus_info 

Recent jobs:
[2018-09-12 16:25:28] 235851 'CDensL1.5' b032/skylake (23:42:54)
[2018-09-12 16:23:28] 235850 'CDensL2Fast' b032/skylake (23:44:54)
[2018-09-12 16:21:29] 235848 'CylDensL3' b073/skylake (23:46:53)
[2018-09-11 14:45:15] 235733 'SXRTA005' b073/skylake (2-01:23:07)
[2018-09-11 14:19:35] 235726 'S100RT15' b031/skylake (2-01:48:47)

Relevant projects:
b002: 30261 hours have been consumed (Used 24.2% of 125021 hours)
b031: 2016 hours have been consumed (Used 2.7% of 73335 hours)
b032: 944 hours have been consumed (Used 1.7% of 55723 hours)
b073: 8072 hours have been consumed (Used 13.4% of 60185 hours)
h111: 686 hours have been consumed (Used 3.4% of 20000 hours)

Transition guide from OAR to SLURM.

COMMAND	OAR	SLURM
Submit a passive/batch job	oarsub -S [script]	sbatch [script]
Start an interactive job	oarsub -I	srun -p skylake --pty bash -i
Queue status	oarstat	squeue
User job status	oarstat -u [user]	squeue -u [user]
Specific job status (detailed)	oarstat -f -j [jobid]	scontrol show job [jobid]
Delete (running/waiting) job	oardel [jobid]	scancel [jobid]
Hold job	oarhold [jobid]	scontrol hold [jobid]
Resume held job	oarresume [jobid]	scontrol release [jobid]
Node list and properties	oarnodes	scontrol show nodes

SPECIFICATION	OAR	SLURM
Script directive	#OAR	#SBATCH
Nodes request	-l nodes=[count]	-N [min[-max]]
Cores request	-l core=[count]	-n [count]
Cores-per-node request	-l nodes=[ncount]/core=[ccount]	-N [ncount] --ntasks-per-node=[ccount] -c 1 OR -N [ncount] --ntasks-per-node=1 -c [ccount]
Walltime request	-l [...],walltime=hh:mm:ss	-t [min] OR -t [days-hh:mm:ss]
Job array	--array [count]	--array [specification]
Job name	-n [name]	-J [name]
Job dependency	-a [jobid]	-d [specification]
Property request	-p "[property]='[value]'"	-C [specification]

ENVIRONMENT VARIABLE	OAR	SLURM
Job ID	$OAR_JOB_ID	$SLURM_JOB_ID
Resource list	$OAR_NODEFILE	$SLURM_NODELIST #List not file! See note
Job name	$OAR_JOB_NAME	$SLURM_JOB_NAME
Submitting user name	$OAR_USER	$SLURM_JOB_USER
Task ID within job array	$OAR_ARRAY_INDEX	$SLURM_ARRAY_TASK_ID
Working directory at submission	$OAR_WORKING_DIRECTORY	$SLURM_SUBMIT_DIR

Note: you can easily create a nodefile in the style of OAR, from a SLURM job with srun hostname | sort -n > hostfile.

The transition guide is based on the excellent guide created by University of Luxembourg HPC Team.

Last updated : 2 March 2023