Using the GPU nodes with Slurm
There are several nodes in mesocentre with NVIDIA GPU card on board suitable for the GPU Computing.
To submit a job via SLURM on one of the machines equipped with a GPU card you have to specify a name of the partition dedicated to the GPU computing.
Here is the list of available partition dedicated to the GPU computing:
Nom de la partition | Nœud(s) associé(s) | Mémoire par CPU (Mo) |
---|---|---|
kepler | gpu[004-010] | 13430 |
pascal | gpu[011-012] | 11512 |
volta | gpu[013-017] | 11500 |
Also, you have to specify the number of the GPU cards which you want to allocate for your job. It can be done with an instruction like –gres=gpu:2. In this example we are allocating 2 GPU cards for our job.
On a given GPU machine, each GPU card is having a proper ID number associated with it, starting from 0 through 3 (or 4, on few nodes). By default the typical program is going to use only one GPU accelerator with GPU_ID 0. The OS is not capable of attributing a GPU to a user, so in case of several users running their jobs on the same node there could be a situation when a GPU card with GPU_ID 0 is 100% occupied and shared between the jobs while the other GPUs on this node are free. So, it is absolutely essential to properly indicate the correct GPU_ID in your program. To do, so you should use the environment variable called $CUDA_VISIBLE_DEVICES which contains a list of GPU ids attributed to a given job.
Here is an example for a fortran code:
replace cudaSetDevice(0) in the .cuf by following: #ifdef GPUID cudaSetDevice(GPUID) #else cudaSetDevice(0) #endif
Here is an example of a SLURM batch script using part of the code above:
#!/bin/bash #SBATCH -p kepler #SBATCH --gres=gpu:2 module load PGI/14.9 pgf90 -Mpreprocess -DGPUID=$CUDA_VISIBLE_DEVICES -fast -o exec exec.cuf ./exec
Here is an example of an interactive job on a GPU capable node:
srun -p volta --gres=gpu:1 --pty bash -i