Introduction

Slurm Workload Manager

Slurm Workload Manager

Slurm powers borg as the workload manager software. Slurm enables highly-efficient usage of the cluster by multiple invididuals and organizations. Slurm enables jobs to be dynamically scheduled anywhere on the cluster where the required CPU, memory, and disk resources exist. This minimizes job competition for resources, and provides for more efficient usage for all individuals.

Slurm uses the idea of a job to run an application on the cluster. A job contains the details of processing to carry-out, including the name and version of the application, aand the standard output and error of the process. Additionally, a job allows resource specification, such as the number of CPU cores required, memory requirements, or execution time needed. Submitting a job queues it into the job scheduler, and then jobs operate non-interactively in a batch fashion.

Jobs are managed by the job scheduler or workload manager, which is responsible for allocating the job onto a node (or nodes), running the job, and reporting back the standard output and error. Jobs are scheduled in a FIFO (First In, First Out) queue.

borg runs two queues for job submission, the nodes and gpus. nodes contains the twenty-node compute nodes, while the gpus contains the GPU node. For more information on the nodes, including resources available, run the following command:

sinfo -N -l

This guide can be used to figure out the basics of Slurm job submission. More resources are available online. (See the "More Resources" section)

Job Submission

Preparing the Job Submission Script

The easiest way of submitting a job through SLurm is to use a job submission script. This script allows you to specify the resources that are needed, along with any details about how the job should be executed. The job submission script really is just a simple shell script that sets Slurm environment variables that control how the job is submitted.

test.script
  #!/bin/bash
# # # # # # # # # # # # # # # #

# Basic job submission: 
# One compute node, default of 1 CPU core per node
#SBATCH -N 1
#
# Job name:
#SBATCH -J hostname
# 
# What program should I run:
srun /bin/hostname  

Run your sample script, by running the command:

$ sbatch test.script 
Submitted batch job 5024 

Slurm queues your job, and gives you the job id number, which in this case is 5024. Notice that we added the 'srun' command in front of the program that we want to execute. This instructs the slurm scheduler to use the resources that we have reserved to execute our program. While not technically necessary if we reserve one full host to run a job, it may be necessary for multi-host programs, so it is in your best interest to get used to adding it to your script.

Assuming the system is not full, this program should run in under five seconds. When the job completes, Slurm will create an output file with all the STDOUT created by your programs run in the script. You can view the text file using your favorite text editor, or in this case, we'll just 'cat' it:

$ cat slurm-5024.out
borg-node01.calvin.edu

Variations on the Job Submission Script

Job submission scripts can take a wide-variety of parameters. For full documentation, see https://slurm.schedmd.com/sbatch.html, or run the 'man sbatch' command.

The following code shows various options you can set; you do not need to specify all of them!


# Set number of of nodes - max is 20!
#SBATCH -N 20
# Set number of cpus-per-task - max is 16!
#SBATCH -c 16
# Set working directory
#SBATCH -D /home/jcalvin/myProject
# Set job name
#SBATCH -J myComplexProject
# Set max execution time, ex:  minutes  minutes:seconds  hours:minutes:seconds 
#SBATCH -t 1:00
# Do not exit sbatch until the submitted job terminates
#SBATCH -W
# Set the node list manually
#SBATCH -w borg-node01,borg-node02,borg-node03 

Variables in sbatch Submission

Passing variables to your sbatch script makes it easy to schedule multiple iterations of a job. Since the sbatch script is really just a bash script, it can accept variables that are put into $1, $2, $3, etc just as a bash script would.

The following script shows a way that you can pass in a variable:

testvars.script
  #!/bin/bash
# # # # # # # # # # # # # # # #

# Passing variables to sbatch script
# Usage:  sbatch testvars.script $1 [$2 $3 $4 ...]
# One compute node, default of 1 CPU core per node
#SBATCH -N 1
#
# Job name:
#SBATCH -J testvars
# 
# What program should I run:
srun /usr/bin/echo $1 $2  

Run the testvars script, by running the command:

$ sbatch testvars.script hello world
Submitted batch job 5179 

Slurm queues your job, and gives you the job id number, which in this case is 5179. sbatch will automatically convert any "leftover" variables to the appropriate $1, $2, $3, $4 variables in our batch script, which we can then reference to call our own program with variables.

Examining our output via 'cat' gives us:

$ cat slurm-5179.out
hello world

There are other ways of submitting multiple random raws (e.g. Monte-Carlo simulations), using a parameter sweep. You can see an example of this here.

Interactive Jobs

Sometimes you just want to reserve a node (or more) for interactive programming sessions. If you need an interactive Bash session on a compute node, run the command:

$ srun --pty bash

Doing this will give you a bash session on one of the compute nodes defauting to 1-CPU, default memory, and default job during, and will return a bash session when it starts.

If you need to reserve multiple CPU cores, extra memory, extra nodes, or any other combination of resources, you can use the 'sallac' command. The 'sallac' command accepts many of the same parameters as the 'sbatch' command. For more information, run 'man salloc'.

Job Monitoring

Monitoring Job Submission and Execution

Two main programs exist for monitoring your job. The sstat command allows monitoring of an active job's resources, while the squeue command shows you all about jobs that are queued across the cluster. To view the resource utilization of a job, for example, job 5348, run the command:

$ sstat -j 5348
       JobID  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask  AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite 
------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------ 
5348.0         1488188K    borg-node20              0   1488188K      7512K borg-node+          0      7512K        0  borg-node20              0          0  52:39.000 borg-node+          0  52:39.000        1    469.65M       Unknown       Unknown       Unknown              0        0.71M     borg-node20               0        0.71M        0.00M      borg-node20                0        0.00M

You may want to limit the fields that are outputted, run "man sstat" for more information.

The squeue command can show you the state of all your jobs. If you queue more jobs than there are resources to simultaneously run the jobs, then your jobs may be queued up to be run later.

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              5467     nodes   test   jcalvin PD       0:00      1 (Priority)
              5468     nodes   test   jcalvin PD       0:00      1 (Priority)
              5469     nodes   test   jcalvin PD       0:00      1 (Priority)
              5471      gpus   test   jcalvin PD       0:00      1 (Resources)
              5472      gpus   test   jcalvin PD       0:00      1 (Priority)
              5473      gpus   test   jcalvin PD       0:00      1 (Priority)
              5474      gpus   test   jcalvin PD       0:00      1 (Priority)
              5475      gpus   test   jcalvin PD       0:00      1 (Priority)
              5476      gpus   test   jcalvin PD       0:00      1 (Priority)
              5477      gpus   test   jcalvin PD       0:00      1 (Priority)
              5329     nodes   test   jcalvin  R       6:19      1 borg-node01
              5330     nodes   test   jcalvin  R       6:19      1 borg-node02
              5331     nodes   test   jcalvin  R       6:19      1 borg-node03
              5332     nodes   test   jcalvin  R       6:19      1 borg-node04
              5333     nodes   test   jcalvin  R       6:19      1 borg-node05
              5334     nodes   test   jcalvin  R       6:19      1 borg-node06
              5335     nodes   test   jcalvin  R       6:19      1 borg-node07
              5336     nodes   test   jcalvin  R       6:19      1 borg-node08
              5337     nodes   test   jcalvin  R       6:19      1 borg-node09
              5338     nodes   test   jcalvin  R       6:19      1 borg-node10
              5339     nodes   test   jcalvin  R       6:19      1 borg-node11
              5340     nodes   test   jcalvin  R       6:19      1 borg-node12
              5341     nodes   test   jcalvin  R       6:19      1 borg-node13
              5342     nodes   test   jcalvin  R       6:19      1 borg-node14
              5343     nodes   test   jcalvin  R       6:19      1 borg-node15
              5344     nodes   test   jcalvin  R       6:19      1 borg-node16
              5345     nodes   test   jcalvin  R       6:19      1 borg-node17
              5346     nodes   test   jcalvin  R       6:19      1 borg-node18
              5347     nodes   test   jcalvin  R       6:19      1 borg-node19
              5348     nodes   test   jcalvin  R       6:19      1 borg-node20
              5470      gpus   test   jcalvin  R       5:54      1 borg-gpu

In this case, twenty active jobs (State=R) are running on the "nodes" partition, and one active job is running on the "gpus" partition. The "gpus" partition has seven pending jobs (State=PD), which the nodes partition has three pending jobs.

More information on each of these tools can be found by performing the "man sstat" and "man squeue" commands.

Cancelling jobs

The scancel command can be used to stop any job you have created, whether it is currently running or pending. At minimum, you must either provide a job id, or some other identifier to cancel jobs.

$ scancel jobid

To cancel all jobs by your userid, for example, user jcalvin:

$ scancel -u jcalvin

To cancel all jobs by state, you provide one of: RUNNING, PENDING, SUSPENDED:

$ scancel -t PENDING

Job Accounting

The sacct command can be used to view statistics for all jobs that you have submitted. For more information, run 'mac sacct'

Example

Full Example

As an example, let's queue up a hundred instances of calculating Pi.

calcpi.script
#!/bin/bash

#SBATCH -N 1
#SBATCH -J calcpi

srun echo "scale=10000; 4*a(1)" | bc -l
queue-calcpi.sh
#!/bin/bash
for i in {1..100}
do
   echo "Queueing job #$i"
   sbatch calcpi.script
done

Make sure that queue-calcpi.sh is chmod 755, and when ready, run:

$ ./queue-calcpi.sh
Queueing job #1
Submitted batch job 5579
Queueing job #2
...
Queueing job #99
Submitted batch job 5677
Queueing job #100
Submitted batch job 5678

Monitor the job progress with 'squeue', and after all the jobs have completed, you will find their output in the current working directory.

MPI/OpenMP

MPI

borg has several MPI libraries available for use with Slurm. The libraries available are:

$ module avail
impi                 mvapich2-2.2/gcc     openmpi-1.8/gcc
intel                mvapich2-2.2-psm/gcc openmpi-2.0/gcc
mpich/gcc            openmpi-1.6/gcc

This example will use the 'openmpi-2.0/gcc' module to compile a simple single program multiple data "hello" program. Source can be found here: spmd.c. This example assumse you are logged into the head node of the cluster.

$ module activate openmpi-2.0/gcc
$ mpicc spmd.c -o spmd

Create our sbatch script file as follows:

mpi-spmd.script
#!/bin/bash
# Example with 20 nodes 16 cores each = 320 processes
#SBATCH -A mpitest
#
# Number of nodes
#SBATCH -N 20
#
# Number of processes total across all nodes
#SBATCH -n 320

# Load the compiler and MPI library
module load openmpi-2.0/gcc
mpirun ./spmd

This script will request use of all 20 nodes, at 16 CPU cores (processes) per node (20*16 = 320). Run the command with 'sbatch':

$ sbatch mpi-spmd.script
Submitted batch job 5844
$ cat slurm-5844.out
Greetings from process 0 of 320 on borg-node01.calvin.edu
Greetings from process 234 of 320 on borg-node15.calvin.edu
Greetings from process 80 of 320 on borg-node06.calvin.edu
Greetings from process 28 of 320 on borg-node02.calvin.edu
Greetings from process 91 of 320 on borg-node06.calvin.edu
...
Greetings from process 29 of 320 on borg-node02.calvin.edu
Greetings from process 287 of 320 on borg-node18.calvin.edu
Greetings from process 31 of 320 on borg-node02.calvin.edu
Greetings from process 130 of 320 on borg-node09.calvin.edu
Greetings from process 142 of 320 on borg-node09.calvin.edu

OpenMP

OpenMP can be used for shared-memory parallelization. This example will compile a small OpenMP single program multiple data "hello" program. Source can be found here: spmd-openmp.c. This example assumes you are logged into the head node of the cluster.

$ gcc -fopenmp spmd-openmp.c -o spmd-openmp
$ ./spmd-openmp 
Program launch host: borg-head1.calvin.edu
Hello from thread 0 of 8
Hello from thread 2 of 8
Hello from thread 4 of 8
Hello from thread 7 of 8
Hello from thread 3 of 8
Hello from thread 5 of 8
Hello from thread 6 of 8
Hello from thread 1 of 8
There are 8 cores on borg-head1.calvin.edu

We can create a sbatch script to be able to dynamically set the number of threads based off of Slurm number of processors.

openmp-spmd.script
#!/bin/bash
#Example with 1 node, 16 cores = 16 processes
#SBATCH -A openmptest
#
#Number of nodes
#SBATCH -N 1
#
#Number of processes total across all nodes
#SBATCH -n 10
./spmd-openmp $SLURM_CPUS_ON_NODE

This script will request use of one node, at 10 CPU cores (processes) per node. Run the command with 'sbatch':

$ sbatch openmp-spmd.script
Submitted batch job 5847
$ cat slurm-5847.out
Running spmd-openmp with 10 threads
Program launch host: borg-node01.calvin.edu
Hello from thread 4 of 10
Hello from thread 1 of 10
Hello from thread 2 of 10
Hello from thread 9 of 10
Hello from thread 0 of 10
Hello from thread 7 of 10
Hello from thread 6 of 10
Hello from thread 8 of 10
Hello from thread 5 of 10
Hello from thread 3 of 10
There are 10 cores on borg-node01.calvin.edu

Adjusting the number of processes total on the node, will adjust the output from the program. At max, 16 threads can be used per node.

GPU / Hi-mem Node

borg has one GPU node with four NVIDIA Titan-V GPUs, and 768GB of RAM. This node is excluded from the regular queue of nodes for compute. However, you can specifically request the node by making the following modification to your sbatch script.

Since we only have one GPU node, please only request the amount of GPUs that you need, and make sure to release the GPUs when you are not using them.

Reserve one GPU, 4 CPUs, and 64GB of RAM

#!/bin/bash
#
# Run on the GPU node with one GPU, 4 CPUS, and 64GB of RAM
# Reserve one GPU
#SBATCH --gres=gpu:1
#
# Reserve four CPU cores
#SBATCH -c 4
#
# Reserve 64GB of RAM
#SBATCH --mem=64G
#
# Run in the GPUs queue
#SBATCH -p gpus

...

Interactive session on GPU node with two GPUs, 4 CPUs, and 64GB of RAM

srun --gres=gpu:2 -c 4 --mem=64G -p gpus --pty bash

Reserve the whole GPU/hi-mem node

#!/bin/bash
#
# Run on the GPU and hi-mem node
#SBATCH -p gpus
#
#Number of nodes
#SBATCH -N 1
#
#Job name
#SBATCH -A gpujob

Singularity

Singularity can easily be integrated into Slurm, especially if your Singularity container contains a default run action.

More information coming soon...