Introduction
Slurm Workload Manager
Slurm powers borg as the workload manager software. Slurm enables highly-efficient usage of the cluster by multiple invididuals and organizations. Slurm enables jobs to be dynamically scheduled anywhere on the cluster where the required CPU, memory, and disk resources exist. This minimizes job competition for resources, and provides for more efficient usage for all individuals.
Slurm uses the idea of a job to run an application on the cluster. A job contains the details of processing to carry-out, including the name and version of the application, aand the standard output and error of the process. Additionally, a job allows resource specification, such as the number of CPU cores required, memory requirements, or execution time needed. Submitting a job queues it into the job scheduler, and then jobs operate non-interactively in a batch fashion.
Jobs are managed by the job scheduler or workload manager, which is responsible for allocating the job onto a node (or nodes), running the job, and reporting back the standard output and error. Jobs are scheduled in a FIFO (First In, First Out) queue.
borg runs two queues for job submission, the nodes and gpus. nodes contains the twenty-node compute nodes, while the gpus contains the GPU node. For more information on the nodes, including resources available, run the following command:
sinfo -N -l
This guide can be used to figure out the basics of Slurm job submission. More resources are available online. (See the "More Resources" section)
Job Submission
Preparing the Job Submission Script
The easiest way of submitting a job through SLurm is to use a job submission script. This script allows you to specify the resources that are needed, along with any details about how the job should be executed. The job submission script really is just a simple shell script that sets Slurm environment variables that control how the job is submitted.
#!/bin/bash
# # # # # # # # # # # # # # # #
# Basic job submission:
# One compute node, default of 1 CPU core per node
#SBATCH -N 1
#
# Job name:
#SBATCH -J hostname
#
# What program should I run:
srun /bin/hostname
Run your sample script, by running the command:
$ sbatch test.script Submitted batch job 5024
Slurm queues your job, and gives you the job id number, which in this case is 5024. Notice that we added the 'srun' command in front of the program that we want to execute. This instructs the slurm scheduler to use the resources that we have reserved to execute our program. While not technically necessary if we reserve one full host to run a job, it may be necessary for multi-host programs, so it is in your best interest to get used to adding it to your script.
Assuming the system is not full, this program should run in under five seconds. When the job completes, Slurm will create an output file with all the STDOUT created by your programs run in the script. You can view the text file using your favorite text editor, or in this case, we'll just 'cat' it:
$ cat slurm-5024.out borg-node01.calvin.edu
Variations on the Job Submission Script
Job submission scripts can take a wide-variety of parameters. For full documentation, see https://slurm.schedmd.com/sbatch.html, or run the 'man sbatch' command.
The following code shows various options you can set; you do not need to specify all of them!
# Set number of of nodes - max is 20!
#SBATCH -N 20
# Set number of cpus-per-task - max is 16!
#SBATCH -c 16
# Set memory reservation - max is 95G
#SBATCH --mem=16G
# Set working directory
#SBATCH -D /home/jcalvin/myProject
# Set job name
#SBATCH -J myComplexProject
# Set max execution time, ex: minutes minutes:seconds hours:minutes:seconds
#SBATCH -t 1:00
# Do not exit sbatch until the submitted job terminates
#SBATCH -W
# Set the node list manually
#SBATCH -w borg-node01,borg-node02,borg-node03
Variables in sbatch Submission
Passing variables to your sbatch script makes it easy to schedule multiple iterations of a job. Since the sbatch script is really just a bash script, it can accept variables that are put into $1, $2, $3, etc just as a bash script would.
The following script shows a way that you can pass in a variable:
#!/bin/bash
# # # # # # # # # # # # # # # #
# Passing variables to sbatch script
# Usage: sbatch testvars.script $1 [$2 $3 $4 ...]
# One compute node, default of 1 CPU core per node
#SBATCH -N 1
#
# Job name:
#SBATCH -J testvars
#
# What program should I run:
srun /usr/bin/echo $1 $2
Run the testvars script, by running the command:
$ sbatch testvars.script hello world Submitted batch job 5179
Slurm queues your job, and gives you the job id number, which in this case is 5179. sbatch will automatically convert any "leftover" variables to the appropriate $1, $2, $3, $4 variables in our batch script, which we can then reference to call our own program with variables.
Examining our output via 'cat' gives us:
$ cat slurm-5179.out hello world
There are other ways of submitting multiple random raws (e.g. Monte-Carlo simulations), using a parameter sweep. You can see an example of this here.
Interactive Jobs
Sometimes you just want to reserve a node (or more) for interactive programming sessions. If you need an interactive Bash session on a compute node, run the command:
$ srun --pty bash
Doing this will give you a bash session on one of the compute nodes defauting to 1-CPU, default memory, and default job during, and will return a bash session when it starts.
If you need to reserve multiple CPU cores, extra memory, extra nodes, or any other combination of resources, you can use the 'sallac' command. The 'sallac' command accepts many of the same parameters as the 'sbatch' command. For more information, run 'man salloc'.
Job Monitoring
Monitoring Job Submission and Execution
Two main programs exist for monitoring your job. The sstat command allows monitoring of an active job's resources, while the squeue command shows you all about jobs that are queued across the cluster. To view the resource utilization of a job, for example, job 5348, run the command:
$ sstat -j 5348 JobID MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite ------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------ 5348.0 1488188K borg-node20 0 1488188K 7512K borg-node+ 0 7512K 0 borg-node20 0 0 52:39.000 borg-node+ 0 52:39.000 1 469.65M Unknown Unknown Unknown 0 0.71M borg-node20 0 0.71M 0.00M borg-node20 0 0.00M
You may want to limit the fields that are outputted, run "man sstat" for more information.
The squeue command can show you the state of all your jobs. If you queue more jobs than there are resources to simultaneously run the jobs, then your jobs may be queued up to be run later.
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 5467 nodes test jcalvin PD 0:00 1 (Priority) 5468 nodes test jcalvin PD 0:00 1 (Priority) 5469 nodes test jcalvin PD 0:00 1 (Priority) 5471 gpus test jcalvin PD 0:00 1 (Resources) 5472 gpus test jcalvin PD 0:00 1 (Priority) 5473 gpus test jcalvin PD 0:00 1 (Priority) 5474 gpus test jcalvin PD 0:00 1 (Priority) 5475 gpus test jcalvin PD 0:00 1 (Priority) 5476 gpus test jcalvin PD 0:00 1 (Priority) 5477 gpus test jcalvin PD 0:00 1 (Priority) 5329 nodes test jcalvin R 6:19 1 borg-node01 5330 nodes test jcalvin R 6:19 1 borg-node02 5331 nodes test jcalvin R 6:19 1 borg-node03 5332 nodes test jcalvin R 6:19 1 borg-node04 5333 nodes test jcalvin R 6:19 1 borg-node05 5334 nodes test jcalvin R 6:19 1 borg-node06 5335 nodes test jcalvin R 6:19 1 borg-node07 5336 nodes test jcalvin R 6:19 1 borg-node08 5337 nodes test jcalvin R 6:19 1 borg-node09 5338 nodes test jcalvin R 6:19 1 borg-node10 5339 nodes test jcalvin R 6:19 1 borg-node11 5340 nodes test jcalvin R 6:19 1 borg-node12 5341 nodes test jcalvin R 6:19 1 borg-node13 5342 nodes test jcalvin R 6:19 1 borg-node14 5343 nodes test jcalvin R 6:19 1 borg-node15 5344 nodes test jcalvin R 6:19 1 borg-node16 5345 nodes test jcalvin R 6:19 1 borg-node17 5346 nodes test jcalvin R 6:19 1 borg-node18 5347 nodes test jcalvin R 6:19 1 borg-node19 5348 nodes test jcalvin R 6:19 1 borg-node20 5470 gpus test jcalvin R 5:54 1 borg-gpu
In this case, twenty active jobs (State=R) are running on the "nodes" partition, and one active job is running on the "gpus" partition. The "gpus" partition has seven pending jobs (State=PD), which the nodes partition has three pending jobs.
More information on each of these tools can be found by performing the "man sstat" and "man squeue" commands.
Cancelling jobs
The scancel command can be used to stop any job you have created, whether it is currently running or pending. At minimum, you must either provide a job id, or some other identifier to cancel jobs.
$ scancel jobid
To cancel all jobs by your userid, for example, user jcalvin:
$ scancel -u jcalvin
To cancel all jobs by state, you provide one of: RUNNING, PENDING, SUSPENDED:
$ scancel -t PENDING
Job Accounting
The sacct command can be used to view statistics for all jobs that you have submitted. For more information, run 'mac sacct'
Example
Full Example
As an example, let's queue up a hundred instances of calculating Pi.
#!/bin/bash #SBATCH -N 1 #SBATCH -J calcpi srun echo "scale=10000; 4*a(1)" | bc -l
#!/bin/bash for i in {1..100} do echo "Queueing job #$i" sbatch calcpi.script done
Make sure that queue-calcpi.sh is chmod 755, and when ready, run:
$ ./queue-calcpi.sh Queueing job #1 Submitted batch job 5579 Queueing job #2 ... Queueing job #99 Submitted batch job 5677 Queueing job #100 Submitted batch job 5678
Monitor the job progress with 'squeue', and after all the jobs have completed, you will find their output in the current working directory.
MPI/OpenMP
MPI
borg has several MPI libraries available for use with Slurm. The libraries available are:
$ module avail openmpi-3.1.6 openmpi-4.1.6 openmpi-5.0.2
This example will use the 'openmpi-3.1.6' module to compile a simple single program multiple data "hello" program. Source can be found here: spmd.c. This example assumse you are logged into the head node of the cluster.
$ module load openmpi-3.1.6 $ mpicc spmd.c -o spmd
Create our sbatch script file as follows:
#!/bin/bash # Example with 20 nodes 16 cores each = 320 processes #SBATCH -A mpitest # # Number of nodes #SBATCH -N 20 # # Number of processes total across all nodes #SBATCH -n 320 # Load the compiler and MPI library module load openmpi-3.1.6 mpirun ./spmd
This script will request use of all 20 nodes, at 16 CPU cores (processes) per node (20*16 = 320). Run the command with 'sbatch':
$ sbatch mpi-spmd.script Submitted batch job 5844 $ cat slurm-5844.out Greetings from process 0 of 320 on borg-node01.calvin.edu Greetings from process 234 of 320 on borg-node15.calvin.edu Greetings from process 80 of 320 on borg-node06.calvin.edu Greetings from process 28 of 320 on borg-node02.calvin.edu Greetings from process 91 of 320 on borg-node06.calvin.edu ... Greetings from process 29 of 320 on borg-node02.calvin.edu Greetings from process 287 of 320 on borg-node18.calvin.edu Greetings from process 31 of 320 on borg-node02.calvin.edu Greetings from process 130 of 320 on borg-node09.calvin.edu Greetings from process 142 of 320 on borg-node09.calvin.edu
OpenMP
OpenMP can be used for shared-memory parallelization. This example will compile a small OpenMP single program multiple data "hello" program. Source can be found here: spmd-openmp.c. This example assumes you are logged into the head node of the cluster.
$ gcc -fopenmp spmd-openmp.c -o spmd-openmp $ ./spmd-openmp Program launch host: borg-head1.calvin.edu Hello from thread 0 of 8 Hello from thread 2 of 8 Hello from thread 4 of 8 Hello from thread 7 of 8 Hello from thread 3 of 8 Hello from thread 5 of 8 Hello from thread 6 of 8 Hello from thread 1 of 8 There are 8 cores on borg-head1.calvin.edu
We can create a sbatch script to be able to dynamically set the number of threads based off of Slurm number of processors.
#!/bin/bash #Example with 1 node, 16 cores = 16 processes #SBATCH -A openmptest # #Number of nodes #SBATCH -N 1 # #Number of processes total across all nodes #SBATCH -n 10 ./spmd-openmp $SLURM_CPUS_ON_NODE
This script will request use of one node, at 10 CPU cores (processes) per node. Run the command with 'sbatch':
$ sbatch openmp-spmd.script Submitted batch job 5847 $ cat slurm-5847.out Running spmd-openmp with 10 threads Program launch host: borg-node01.calvin.edu Hello from thread 4 of 10 Hello from thread 1 of 10 Hello from thread 2 of 10 Hello from thread 9 of 10 Hello from thread 0 of 10 Hello from thread 7 of 10 Hello from thread 6 of 10 Hello from thread 8 of 10 Hello from thread 5 of 10 Hello from thread 3 of 10 There are 10 cores on borg-node01.calvin.edu
Adjusting the number of processes total on the node, will adjust the output from the program. At max, 16 threads can be used per node.
GPU / Hi-mem Node
borg has one GPU node with four NVIDIA Titan-V GPUs, and 768GB of RAM. This node is excluded from the regular queue of nodes for compute. However, you can specifically request the node by making the following modification to your sbatch script.
Since we only have one GPU node, please only request the amount of GPUs that you need, and make sure to release the GPUs when you are not using them.
Reserve one GPU, 4 CPUs, and 64GB of RAM
#!/bin/bash # # Run on the GPU node with one GPU, 4 CPUS, and 64GB of RAM # Reserve one GPU #SBATCH --gres=gpu:1 # # Reserve four CPU cores #SBATCH -c 4 # # Reserve 64GB of RAM #SBATCH --mem=64G # # Run in the GPUs queue #SBATCH -p gpus ...
Interactive session on GPU node with two GPUs, 4 CPUs, and 64GB of RAM
srun --gres=gpu:2 -c 4 --mem=64G -p gpus --pty bash
Reserve the whole GPU/hi-mem node
#!/bin/bash # # Run on the GPU and hi-mem node #SBATCH -p gpus # #Number of nodes #SBATCH -N 1 # #Job name #SBATCH -A gpujob
Tensorflow Virtual Environment
To provide the greatest flexibility for users, no system-wide Tensorflow is installed. This allows you to use whatever Tensorflow version you may need. You can use these directions to create a Python3 virtual environment with the latest Tensorflow v2.x Python modules.
- Create the initial Python3 virtual environment named 'tensorflowvenv':
python3 -m venv --system-site-packages tensorflowvenv
- Enter into our 'tensorflowvenv' virtual environment:
source tensorflowvenv/bin/activate
You should end up with a prompt that looks like:(tensorflowvenv) [username@borg-head1 ~]$
- Upgrade pip:
pip install -U pip
- Install tensorflow via pip:
pip install tensorflow
Note: this will take a long time. - Exit our virtual environment:
exit
Once we have created our virtual environment, we can obtain an interactive GPU session to test it out.
- Create an interactive session to borg-gpu (1 GPU, 8 CPU cores, 64GB RAM):
srun --gres=gpu:1 -c 8 --mem=64G -p gpus --pty bash
- Re-enter our tensorflowvenv virtual environment:
source tensorflowvenv/bin/activate
- Verify Tensorflow GPU functionality:
python
Paste in the following Python code:from __future__ import absolute_import, division, print_function, unicode_literals import tensorflow as tf print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU'))) tf.config.experimental.list_physical_devices('GPU') quit()
You should get output that looks like:
>>> from __future__ import absolute_import, division, print_function, unicode_literals
>>>
>>> import tensorflow as tf
2020-02-12 08:08:57.392347: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-02-12 08:08:57.394283: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
2020-02-12 08:08:59.611851: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-12 08:08:59.666907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:1a:00.0 name: TITAN V computeCapability: 7.0
coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 11.78GiB deviceMemoryBandwidth: 607.97GiB/s
2020-02-12 08:08:59.668135: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-12 08:08:59.668191: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-12 08:08:59.671361: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-12 08:08:59.672786: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-12 08:08:59.676575: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-12 08:08:59.678897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-12 08:08:59.678944: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-12 08:08:59.680734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
Num GPUs Available: 1
>>> tf.config.experimental.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
>>> quit()
- To exit the virtual environment session, type:
exit
- To exit the interactive session, type:
exit
Singularity
Singularity can easily be integrated into Slurm, especially if your Singularity container contains a default run action.
Tensorflow GPU interactive demo
This demo will use Singularity and the borg-gpu node with one of the four Titan Vs. We will pull down the NVIDIA NGC Tensorflow Docker Hub GPU image, executing it with Singularity. (NVidia NGC Container Catalog) In this example, we will use the Tensorflow v2.x Python3 container (24.03-tf2-py3). Check the catalog link for other container images. More information can be found at 'How to Run NGC Deep Learning Containers with Singularity'.
- Create an interactive session to borg-gpu (1 GPU, 8 CPU cores, 64GB RAM):
srun --gres=gpu:1 -c 8 --mem=64G -p gpus --pty bash
- Use Singularity to pull the NVIDIA NGC Tensorflow v2.x Docker Hub GPU image for Python3:
singularity pull docker://nvcr.io/nvidia/tensorflow:24.03-tf2-py3
(Note: This will take a bit of time.) - Use Singularity to execute the container, executing bash:
singularity exec --nv docker://nvcr.io/nvidia/tensorflow:24.03-tf2-py3 python
- Verify Tensorflow GPU functionality:
Paste in the following Python code:from __future__ import absolute_import, division, print_function, unicode_literals import tensorflow as tf print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU'))) tf.config.experimental.list_physical_devices('GPU') quit()
You should get output that looks like:
>>> from __future__ import absolute_import, division, print_function, unicode_literals
>>>
>>> import tensorflow as tf
2020-02-12 07:46:40.188161: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))>
2020-02-12 07:46:42.254167: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-12 07:46:42.319217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455
pciBusID: 0000:1a:00.0
2020-02-12 07:46:42.319307: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-02-12 07:46:42.404847: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-12 07:46:42.438590: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-12 07:46:42.450408: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-12 07:46:42.513820: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-12 07:46:42.524081: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-12 07:46:42.604837: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-12 07:46:42.606833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
Num GPUs Available: 1
>>> tf.config.experimental.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
>>> quit()
- To exit the interactive session, type:
exit