Managing Slurm at Scale

Published on November 06, 2024

Table of Contents

Building your cluster
- Setup Munge
- Setup Slurm
Configuration Deep Dive
Essential Tools for Managing Slurm
Conclusion
Further Reading and Resources

In our previous article, Sean introduced Slurm as a powerful HPC scheduler for batch workloads. That post serves as an excellent jumping-off point for those new to Slurm, and I highly recommend reading it before this one if you’re unfamiliar with the basics. Today, we’re going to be taking a more in-depth look at Slurm configuration, provisioning, and management so that you can build and manage your own clusters. Slurm has gained significant traction in AI workloads recently, but we’ll be focusing on CPU-based workloads in this article to keep things focused. We’ll explore using Slurm for GPU training with PyTorch, as well as other AI applications in a future post.

Building your cluster

Before we dive into advanced configurations and management techniques, let’s start with setting up a basic Slurm cluster. A great resource for this is the Slurm for Dummies GitHub repository, which I’ve found useful when working with providers that don’t offer managed Slurm solutions. I’ll summarize the basics here. We’re going to assume you have a controller node plus some worker nodes that have Ubuntu installed and can communicate with each other over SSH. A controller node is just any Linux VM(let’s assume Ubuntu), and it will be used to schedule jobs vs. actually running workloads, so it can be smaller than the worker nodes typically. Worker nodes are also just linux VMs, but will have the resources necessary to compute the workload. This may mean more CPUs/memory, or it may mean specialized accelerators like GPUs.

Setup Munge

Munge is used for authentication between nodes. First, configure the controller node to start, and install packages with:

sudo apt-get install munge libmunge2 libmunge-dev

You should now see a key installed at /etc/munge/munge.key, if not, run the following command to create one:

sudo /usr/sbin/mungekey

At this point, munge should have created a user, and you’re almost there, other than giving that user the correct file permissions, which can be done by running:

sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
sudo chmod 0755 /run/munge/
sudo chmod 0700 /etc/munge/munge.key
sudo chown -R munge: /etc/munge/munge.key

Then to configure the munge service to run at startup:

systemctl enable munge
systemctl restart munge

Now, for the worker nodes, follow the same procedure, except copy the munge key at /etc/munge/munge.key from the controller instead of using the generated one. Make sure to do this before running the file permission commands, and your workers should also be good to go. You can test this by running:

munge -n | ssh <CONTROLLER_NODE_HOSTNAME> unmunge

from the worker nodes.

Setup Slurm

To start, on all nodes run:

sudo apt-get update
sudo apt-get install -y slurm-wlm

Next, you can use Slurm’s handy configuration file generator, which is located at /usr/share/doc/slurmctld/slurm-wlm-configurator.html (open the file with your browser) to create your configuration file. You can learn all about the configuration options here, but you only need to configure the following to get started:

ClusterName - whatever name you’d like for your cluster needs to be lowercase and must be 40 characters or less
SlurmctldHost - The hostname of the machine where the Slurm control daemon is executed (can find this by running hostname -s on the machine). This hostname is optionally followed by either the IP address or another name by which the address can be identified, enclosed in parentheses. e.g.

SlurmctldHost=slurmctl-primary(12.34.56.78)

NodeName - output of hostname -s again, but for worker nodes. Ideally, they are numbered, and you can refer to them like <hostname-prefix>[1-4], otherwise, you can have multiple entries, one per worker node.
The values for CPUs, Sockets, CoresPerSocket, and ThreadsPerCore are based on the results from running lscpu on a worker node.
ProctrackType: LinuxProc, unless you’ve installed proctrack/cgroup, in which case, it will be used by default if you don’t set this option.

Save the tool’s text output to: /etc/slurm/slurm.conf and copy it to the same path on every worker node.

Finally, you can enable the Slurm controller service to run on startup with the following:

systemctl enable slurmctld
systemctl restart slurmctld

At this point, you can check if the cluster is set up correctly by running:

srun hostname

Configuration Deep Dive

Now that we have a basic Slurm cluster up and running let’s explore some more advanced configuration options and common use cases. For full reference docs, refer to this page.

Queue and Workload Management

Slurm’s queuing system, known as partitions, allows you to organize and prioritize jobs efficiently. Here’s an example of defining partitions in your slurm.conf:

PartitionName=debug Nodes=node[1-4] Default=YES MaxTime=01:00:00 State=UP

PartitionName=batch Nodes=node[5-20] MaxTime=08:00:00 State=UP

This configuration creates two partitions:

A “debug” partition for short-running jobs (max 1 hour) on nodes 1-4.
A “batch” partition for longer-running jobs (max 8 hours) on nodes 5-20.

You can further customize these partitions with options like:

PriorityTier: Set priority levels for partitions.
PreemptMode: Configure how jobs can be preempted.
OverSubscribe: Allow multiple jobs to run on a single node simultaneously.

Handling Node Failures

Slurm provides robust tools for managing node states and handling failures:

Draining Nodes: When you need to perform maintenance on a node, you can drain it:

scontrol update NodeName=node5 State=DRAIN Reason="Scheduled maintenance"

This prevents new jobs from being scheduled on the node while allowing current jobs to complete.

Automatic Node Failure Detection: Configure the SlurmdTimeout option in slurm.conf to automatically mark nodes as down if they don’t respond:

SlurmdTimeout=300

ResumeProgram and SuspendProgram: These scripts can automatically handle node power management:

ResumeProgram=/usr/local/bin/slurm_resume.sh

SuspendProgram=/usr/local/bin/slurm_suspend.sh

Useful Plugins

Slurm’s plugin architecture allows for extensive customization. Here are a few particularly useful plugins:

job_submit/lua: Allows you to write custom job submission filters and modifications in Lua.
proctrack/cgroup: Provides better process tracking and resource management using Linux cgroups.
select/cons_tres: Enables Trackable RESources (TRES) for more granular resource allocation.

To enable a plugin, add it to the appropriate line in your slurm.conf, for example:

JobSubmitPlugins=lua
ProctrackType=proctrack/cgroup
SelectType=select/cons_tres

Essential Tools for Managing Slurm

sinfo

sinfo is your go-to command for getting an overview of your cluster’s state. Some useful options include:

sinfo -Nel: Provides a detailed node-oriented view.
sinfo -t idle,mix,alloc: Shows nodes in specific states.
sinfo -o "%n %c %m %t": Customizes output to show node name, CPUs, memory, and state.

scontrol

scontrol is a powerful tool for viewing and modifying Slurm’s configuration. Some common uses:

scontrol show job <job_id>: Displays detailed information about a specific job.
scontrol update JobId=<job_id> TimeLimit=02:00:00: Modifies a running job’s time limit.
scontrol reconfigure: Reloads the Slurm configuration without restarting services.

srun/sbatch

These commands are the primary ways to submit jobs to your Slurm cluster. While srun is used for interactive jobs, sbatch handles batch job submissions.

Some examples of running interactive Jobs with srun:

# Basic interactive job
srun --pty bash
# Request specific resources
srun --cpus-per-task=4 --mem=8G --time=2:00:00 --pty bash
# Run a specific command across multiple nodes
srun --nodes=2 hostname

Also some examples of running a batch job with sbatch. Note that you can use the special #SBATCH comments to set command line arguments, or you can also pass these to sbatch command at runtime depending on your use case. I’ve also included some echo statements to print some useful metadata:

#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G

echo "Date start                = $(date)"
echo "Initiating Host           = $(hostname)"
echo "Working Directory         = $(pwd)"
echo ""
echo "Number of Nodes Allocated = ${SLURM_JOB_NUM_NODES}"
echo "Number of Tasks Allocated = ${SLURM_NTASKS}"
echo ""

python my_script.py

RETURN=${?}

echo ""
echo "Exit code                 = ${RETURN}"
echo "Date end                  = $(date)"
echo ""

Check out the mpi-ping-pong.py script from our previous article for a more realistic example of a task to play around with.

Another cool feature that you can take advantage of is job arrays. Job arrays are perfect for parameter sweeps or processing multiple datasets, here’s an example with sbatch:

#!/bin/bash
#SBATCH --array=0-15
#SBATCH --output=array_%A_%a.out
# $SLURM_ARRAY_TASK_ID contains the array index
python process.py --input-file=dataset_${SLURM_ARRAY_TASK_ID}.txt

You can also create workflows by introducing dependencies between jobs, for example:

# Wait for job completion
sbatch --dependency=afterok:12345 script.sh

# Wait for job start
sbatch --dependency=after:12345 script.sh

# Wait for multiple jobs
sbatch --dependency=afterany:12345:12346:12347 script.sh

You can read all about sbatch here and srun here.

Submitit

Submitit is a Python package that provides a user-friendly interface for submitting and managing Slurm jobs. It’s particularly useful for data scientists and researchers who prefer working in Python environments.

Here’s a simple example of using submitit:

import submitit

def train_model(learning_rate, batch_size):

    # Your training code here

    return accuracy

executor = submitit.SlurmExecutor(folder="log_test")

executor.update_parameters(time=60, mem_gb=8, cpus_per_task=4)

jobs = executor.map_array(train_model, 

                          [0.01, 0.001, 0.0001],  # learning rates

                          [32, 64, 128])          # batch sizes

results = [job.result() for job in jobs]

This script submits 9 jobs (3x3 grid of hyperparameters) to Slurm, each with 4 CPUs and 8GB of memory and a 60-minute time limit.

slurm-exporter

The slurm-exporter allows you to export Slurm metrics to Prometheus, enabling advanced monitoring and alerting capabilities.

To set it up:

Install and configure Prometheus.
Install the slurm-exporter:

go get github.com/vpenso/prometheus-slurm-exporter

Run the exporter:

prometheus-slurm-exporter

Add the following to your Prometheus scrape configuration:

scrape_configs:
  - job_name: 'slurm'
    static_configs:
      - targets: ['localhost:8080']

With this setup, you can create detailed dashboards in Grafana to visualize your cluster’s performance and utilization.

In Conclusion

In this article, we’ve covered how to set up your own simple Slurm cluster, covered some useful configurations to make things more robust, and finally talked about the tools you’ll need to actually manage the cluster. Now you’re ready to start running your jobs on your shiny new cluster! In future articles, we’ll explore topics like using Slurm for distributed PyTorch training, optimizing GPU utilization, and integrating Slurm with docker. For now though, happy Slurming!