Using the job resource manager Slurm

Slurm is an open source cluster management and job scheduling system for Linux. It allocates access to resources and provides a framework for the job management.

The Slurm system allows users to run applications in interactive or batch modes. Upon the job submission, Slurm returns the job ID which allows users to monitor and to interact with this job.

In interactive mode you will have an access to an interactive shell on the first node which was reserved for you by Slurm. In batch mode you have to indicate a shell script which going to be executed in your name on the reserved resources.

You can tune a various parameters of your jobs using Slurm options, such as: job duration, number of nodes or cores, the amount of allocated memory, the name for your job, the name for the files to store output of your job, etc…

For more information about Slurm don’t hesitate to read documentation.

The available partitions at the mesocentre :

Hardware resources are grouped into partitions. Each partitions aggregating machines with common characteristics, the same resource can belong to several partitions.

Nom de la PartitionNom des nœudsType de machineNom du scratch
skylake skylake[001-158]PowerEdge C6420/scratch, /scratchfast
devdev[01-02]PowerEdge C6420/scratch, /scratchfast
smp-opa  smp005PowerEdge R940/scratch, /scratchfast
kepler      gpu[004-010]PowerEdge C4130/scratch, /scratchfast
pascal      gpu[011-012]PowerEdge C4130/scratch, /scratchfast
voltagpu[013-017]PowerEdge C4140/scratch, /scratchfast
visuvisu001PowerEdge R740/scratch, /scratchfast

The main Slurm variables are:

  • $SLURM_JOBID: The ID of the job allocation.
  • $SLURM_SUBMIT_DIR: The directory from which sbatch was invoked.
  • $SLURM_NODELIST: List of nodes allocated to the job. It could be used for example as an option for mpirun (machinefile $SLURM_NODEFILE) or to connect to a node allocated for your interactive job.
  • $SLURM_JOB_NAME: Name of the job.

The complete list of variables used by Slurm is available at this page.

The basic Slurm commands:

  • sinfo: Shows all available partitions and their load.
  • srun: Run an interactive job. By default Slurm reserves one core for 30 minutes. Type exit to return to submission shell. Ex: srun -p partition_name –time=2:30:0 -N 2 –ntasks-per-node=4 –pty bash -i
  • sbatch: submits a batch job to the Slurm scheduler. In a script which is submitted with this command you can define the environment needed for successful job execution.
  • squeue: shows the status of the jobs. If you are interested in monitoring your jobs only you should use this command with “-u” option. Ex: squeue -u login_name
  • scontrol: used to view or modify a running job.
  • sacct: show the history of your jobs. (For more information on this command and the output format, please follow this link.)
  • scancel: stops a running job. Ex: scancel JOB_ID
  • sacctmgr: used to view Slurm accounting information.

The complete list of Slurm commands is available at this page.

Main Slurm directives:

  • #SBATCH -J JOB_NAME: defines a name for the batch job.
  • #SBATCH -p PARTITION: defines the partition to use.
  • #SBATCH -N N: set number of nodes to allocate.
  • #SBATCH -n N: set number of cores to allocate.
  • #SBATCH -- ntasks-per-node=N: set number of cores per node to allocate.
  • #SBATCH -t DD-HH:MM:SS # set walltime of the job. The maximum duration of a job is 7 days. If walltime is not set, then, by default, the walltime is set to 30 minutes. The format for time indication could be: “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” ou “days-hours:minutes:seconds”
  • #SBATCH -A PROJECT_NAME: set the name of the project to use.
  • #SBATCH -o OUTPUT_FILE: specifies the file containing the stdout.
  • #SBATCH -e ERROR_FILE: specifies the file containing the stderr.
  • #SBATCH --mail-type=BEGIN,END: specify the event you want to get notified about.
  • #SBATCH --mail-user=your@mail.address: specify the e-mail address for receiving notifications
  • #SBATCH --requeue: Allows to re-run automatically your job if it was killed or failed due any node related problems.

Example of Slurm Scripts:

Example of a job allocating six cores for two hours on a machine with westmere CPU:

#!/bin/sh
#SBATCH -J Job_westmere
#SBATCH -p westmere
#SBATCH -n 6
#SBATCH -A b001
#SBATCH -t 2:00:00
#SBATCH -o ./%j.%x.out
#SBATCH -e ./%j.%x.err
#SBATCH --mail-type=BEGIN,END
#SBATCH --mail-user=your@mail.address
# load module python 3.6.3
module purge
module load userspace/all
module load python3/3.6.3
# moving to the working directory
cd /scratchw/$SLURM_JOB_USER/
~

Example of a job allocating two nodes for two days and 12 hours on the machine with skylake CPU:

#!/bin/sh
#SBATCH -J Job_skylake
#SBATCH -p skylake
#SBATCH -N 2
#SBATCH -n 32
#SBATCH -A b001
#SBATCH -t 2-12
#SBATCH -o .%j.out
#SBATCH -e .%j.err
#SBATCH --mail-type=BEGIN,END
#SBATCH --mail-user=your@mail.address

# chargement des modules
module purge
module load userspace/all
module load openmpi/2.1.2/2018
# moving to the working directory
cd /scratch/$SLURM_JOB_USER/
echo “Running on: $SLURM_NODELIST”
mpirun my_program
~

Example of a job using 2 cards GPU kepler and 10 cores of CPU for 10 hours:

#!/bin/sh
#SBATCH -J Job_gpu
#SBATCH -p kepler
#SBATCH --gres=gpu:2
#SBATCH --gres-flags=enforce-binding # activates CPU:GPU affinity
#SBATCH -n 10
#SBATCH -A b001
#SBATCH -t 10:00:00
#SBATCH -o %j.out
#SBATCH -e %j.err
# chargement des modules
module purge
module load userspace/all
module load cuda/9.1
# moving to the working directory
cd /scratch/$SLURM_JOB_USER/
~

Pour l’utilisation de /scratchfast sur noeud skylake

#!/bin/sh
#SBATCH -J Job_scratchfast
#SBATCH -p skylake
#SBATCH -N 1
#SBATCH -L scratchfast:10 #10Go
#SBATCH -A b001
#SBATCH -t 10:00:00
#SBATCH -o %j.out
#SBATCH -e %j.err
# chargement des modules
module purge
module load userspace/all
module load …
# déplacement sur le répertoire de travail
cd /scratchfast/$SLURM_JOB_USER/$SLURM_JOB_ID/
~

Example of submitting ten parametric jobs:

#!/bin/sh
#SBATCH -J Parametric_Jobs
#SBATCH -p skylake
#SBATCH -N 1
#SBATCH -A b001
#SBATCH –ntasks-per-node=1
#SBATCH -t 10:00:00
#SBATCH –array=0-9
#SBATCH -o %j.out
#SBATCH -e %j.err
#SBATCH –mail-type=end,fail
#SBATCH –mail-user=votre@address
# moving to the working directory
cd /scratch/$SLURM_JOB_USER/
mpirun my_program $SLURM_ARRAY_TASK_ID

Example of submitting 10 parametric jobs with precise argument for each job. The number of simultaneously running jobs is 5 jobs max.
Pay attention: the number of parametric arguments have to correspond to the number of the jobs.

#!/bin/sh
#SBATCH -J Paramétrique_Jobs
#SBATCH -A b001
#SBATCH -p skylake
#SBATCH -N 1
#SBATCH –ntasks-per-node=1
#SBATCH -t 10:00:00
#SBATCH –array=0-9%5
# moving to the working directory
cd /scratch/$SLURM_JOB_USER/
VALUES=(0 1 1 2 3 5 8 13 21 34)
mpirun my_program ${VALUES[$SLURM_ARRAY_TASK_ID]}

Accounting of consumed CPU time:

The mésocentre allocates certain amount of CPU time per project. A user with the same login name can participate in several projects.

The command rheticus_info shows the list of projects to which a user belongs to as well as consumed CPU time and the CPU time limit for user’s project

rheticus_info 

Recent jobs:
[2018-09-12 16:25:28] 235851 'CDensL1.5' b032/skylake (23:42:54)
[2018-09-12 16:23:28] 235850 'CDensL2Fast' b032/skylake (23:44:54)
[2018-09-12 16:21:29] 235848 'CylDensL3' b073/skylake (23:46:53)
[2018-09-11 14:45:15] 235733 'SXRTA005' b073/skylake (2-01:23:07)
[2018-09-11 14:19:35] 235726 'S100RT15' b031/skylake (2-01:48:47)

Relevant projects:
b002: 30261 hours have been consumed (Used 24.2% of 125021 hours)
b031: 2016 hours have been consumed (Used 2.7% of 73335 hours)
b032: 944 hours have been consumed (Used 1.7% of 55723 hours)
b073: 8072 hours have been consumed (Used 13.4% of 60185 hours)
h111: 686 hours have been consumed (Used 3.4% of 20000 hours)

Transition guide from OAR to SLURM.

COMMANDOARSLURM
Submit a passive/batch joboarsub -S [script]sbatch [script]
Start an interactive joboarsub -Isrun -p skylake --pty bash -i
Queue statusoarstatsqueue
User job statusoarstat -u [user]squeue -u [user]
Specific job status (detailed)oarstat -f -j [jobid]scontrol show job [jobid]
Delete (running/waiting) joboardel [jobid]scancel [jobid]
Hold joboarhold [jobid]scontrol hold [jobid]
Resume held joboarresume [jobid]scontrol release [jobid]
Node list and propertiesoarnodesscontrol show nodes

SPECIFICATIONOARSLURM
Script directive#OAR#SBATCH
Nodes request-l nodes=[count]-N [min[-max]]
Cores request-l core=[count]-n [count]
Cores-per-node request-l nodes=[ncount]/core=[ccount]-N [ncount] --ntasks-per-node=[ccount] -c 1
OR
-N [ncount] --ntasks-per-node=1 -c [ccount]
Walltime request-l [...],walltime=hh:mm:ss-t [min] OR -t [days-hh:mm:ss]
Job array--array [count]--array [specification]
Job name-n [name]-J [name]
Job dependency-a [jobid]-d [specification]
Property request-p "[property]='[value]'"-C [specification]

ENVIRONMENT VARIABLEOARSLURM
Job ID$OAR_JOB_ID$SLURM_JOB_ID
Resource list$OAR_NODEFILE$SLURM_NODELIST #List not file! See note
Job name$OAR_JOB_NAME$SLURM_JOB_NAME
Submitting user name$OAR_USER$SLURM_JOB_USER
Task ID within job array$OAR_ARRAY_INDEX$SLURM_ARRAY_TASK_ID
Working directory at submission$OAR_WORKING_DIRECTORY$SLURM_SUBMIT_DIR

Note: you can easily create a nodefile in the style of OAR, from a SLURM job with srun hostname | sort -n > hostfile.

The transition guide is based on the excellent guide created by University of Luxembourg HPC Team.


Dernière mise à jour : 2 March 2023 mesocentre-techn@univ-amu.fr

+33 (0)4 13 94 58 29 / (0)4 13 94 58 27