Slurm
Slurm (also referred as Slurm Workload Manager or slurm-llnl) is an open-source workload manager designed for Linux clusters of all sizes, used by many of the world's supercomputers and computer clusters. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Installation
Install the slurm-llnl package (or slurm-llnl-gitAUR). It pulls in munge ([1]), an authentication service, as a dependency. It is started as a requirement through slurmd's systemd service and encrypts the connection between the various hosts. Therefore make sure that all nodes in your cluster have the same key in /etc/munge/munge.key. Then start and enable munge.service.
The package itself has many more optional dependencies, though Slurm has to be recompiled to make use of them, after they have been installed.
Configuration
The configuration files for slurm-llnl reside under /etc/slurm-llnl. Prior to starting any slurm-services, it has to be configured properly by creating a configuration file at /etc/slurm-llnl/slurm.conf. Client and server may use the same configuration file, which can either be generated at the official website or by copying /etc/slurm-llnl/slurm.conf.example to /etc/slurm-llnl/slurm.conf and adapting it to ones liking.
By default the Slurm user, which was introduced to your system in the installation process, has 64030 as UID and GID, this simplifies the setup on multiple systems. UID and GID matches the one used in Debian, therefore they may be used side-by-side, but remember that binaries are not in the same directories on each and every distribution.
Client (compute node) configuration
On the client-side one may now safely start/enable slurmd.service.
cgroup.conf configuration file must be created on each client. See the cgroup.conf manual page for configuration details.Server (head node) configuration
Start/enable slurmctld.service.
Additionally you may want to start/enable slurmdbd.service, which handles a SQL database for easier management thereby logging somewhat essential process information.
/etc/default/slurm-llnl though still utilizing the power of systemd. This file is handled as the environment file for the various services and simply passes any arguments on to the program.Troubleshooting
Services fail to start on boot
If slurmd.service or slurmctld.service fail to start at boot but work fine when manually started, then the service may be trying to start before a network connection has been established. To verify this, add the lines associated with the failing service from below to the slurm.conf file:
slurm.conf
SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
Then, check the associated log file. If you notice the fatal exception mentions Address family not supported by protocol, then you may want to extend the unit so that it waits for a valid network connection via network-online.target.
Tips and tricks
Running RHEL based nodes side-by-side
On RedHat based distributions, slurm is running as root by default. [2]
To add these nodes to the cluster, first create slurm user with UID and GID equal 64030 to match the one used in Arch Linux, then change slurm user with command slurm-setuser -u slurm -g slurm.
See also
- Slurm tutorials — Introduction to the Slurm Workload Manager for users and system administrators, plus some material for Slurm programmers
- Quick Start Administrator Guide — Getting started guide
- Slurm to manage jobs — Convenient Slurm Commands
- Running Jobs — How Slurm is used at Harvard university