4 Slurm Plugin

The Slurm Job Launcher Plugin provides the capability to launch executables on a Slurm cluster.

4.1 Configuration

/etc/rstudio/launcher.slurm.conf

Config Option Description Required (Y/N) Default Value
server-user User to run the executable as. The plugin should be started as root and will lower its privilege to this user for normal execution. It is recommended not to change the default value, as this is populated by the Launcher service itself. N rstudio-server
thread-pool-size Size of the thread pool used by the plugin. It is recommended not to change the default value, as this is populated by the Launcher service itself. N Number of CPUs * 2
enable-debug-logging Enables/disables verbose debug logging. Can be 1 (enabled) or 0 (disabled). N 0
scratch-path Scratch directory where the plugin writes temporary state. N /var/lib/rstudio-launcher
logging-dir Specifies the path where debug logs should be written. N /var/log/rstudio/launcher
job-expiry-hours Number of hours before completed jobs are removed from the system. N 24
profile-config Path to the user and group profiles configuration file (explained in more detail below). N /etc/rstudio/launcher.slurm.profiles.conf
slurm-service-user The user to run Slurm service commands as. This user must have privileges to query the Slurm cluster. Y
slurm-bin-path The installation location of the Slurm command line utilities (e.g. sbatch,scontrol). If left blank, the command line utilities must be available on the default path. N ""
user-storage-path The default location to store Slurm job output. Can be templated with {HOME} or {USER}. Users must have write access to the configured location. Paths beginning with ~ will be correctly evaluated. N ~/slurm-data
max-output-stream-seconds The maximum amount of time to keep the job output stream open after the job is completed, in seconds. Since job output may be buffered, the output stream will stay open until it sees an end of stream notifier or it waits the configured number of seconds. Setting this option to a low value may cause job output to appear truncated. Reloading the job output window should resolve that. A value of 0 will cause the output stream to close immediately when the job finishes. N 30
max-output-file-wait-seconds The maximum amount of time to wait for the output files to be created after the output stream is started, in seconds. This can be useful if job output is being buffered for a long period of time, or if the file system is particularly slow. Setting this to a value that is too low may cause job-output-not-found errors when attempting to retrieve output shortly after the job starts. N 30
rsandbox-path Location of the rsandbox executable. N /usr/lib/rstudio-server/bin/rsandbox
unprivileged Runs the Launcher in unprivileged mode. Child processes will not require root permissions. If the plugin cannot acquire root permissions it will run without root and will not change users or perform any impersonation. N 0
enable-gpus Whether to allow users to request GPU resources when submitting jobs. The types of GPUs available for request can be controlled via the gpu-types option. N 0
enable-gres Whether to allow users to request arbitrary GRES resources when submitting jobs. The value of this field will be passed directly to the --gres option provided by sbatch. N 0
gpu-types A comma-separated list of GPU types that are available. If GPUs are enabled and this field is empty, users will be able to request general GPUs (i.e. sbatch --gpus=<n>. Otherwise users will be able to request GPUs for each type (i.e. sbatch --gpus=<type>:<n>[,<type>:<n>]). N
allow-requeue Whether to allow jobs submitted through the Slurm Launcher Plugin to be requeued by Slurm. This option requires special consideration when using with RStudio Launcher Sessions. See Using the Slurm Launcher Plugin with RSP for more details. N 0

4.1.1 User Storage Directory

If the user-storage-path is left as the default value (~/slurm-data) or the {HOME} variable is used in the definition, the server-user must have read access to each user’s home directory in order to look up the real value. The Slurm Launcher Plugin will then attempt to automatically create the configured user-storage-path. This action will be taken as the user starting the Job, so the server-user does not need write access to the user’s home directory.

4.1.2 User and Group Profiles

The Slurm plugin also allows you to specify user and group configuration profiles, similar to RStudio Workbench’s profiles, in the configuration file /etc/rstudio/launcher.slurm.profiles.conf (or an arbitrary file as specified in profile-config within the main configuration; see above). These are entirely optional.

Profiles are divided into sections of three different types:

Global ([*])

Per-group ([@groupname])

Per-user ([username])

Here’s an example profiles file that illustrates each of these types:

/etc/rstudio/launcher.slurm.profiles.conf

[*]
default-cpus=1
default-mem-mb=512
max-cpus=2
max-mem-mb=1024
max-gpus-v100=1
max-gpus-tesla=1

[@rstudio-power-users]
default-cpus=4
default-mem-mb=4096
max-cpus=20
max-mem-mb=20480
max-gpus-tesla=10
default-gpus-tesla=2
max-gpus-v100=8
default-gpus-v100=0

[jsmith]
max-cpus=3

This configuration specifies that by default users will be allowed to launch jobs with a maximum of 1024 MB of memory and 2 CPUs. It also specifies that members of the rstudio-power-users group will be allowed to use much more resources.

Note that the profiles file is processed from top to bottom (i.e. settings matching the current user that occur later in the file always override ones that appeared prior). The settings available in the file are described in more depth in the table below. Also note that if the Slurm cluster has been configured to have a maximum and/or default memory value, these values will be returned whenever a maximum or default value is not configured for a user.

/etc/rstudio/launcher.slurm.profiles.conf

Config Option Description Required (Y/N) Default Value
default-cpus Number of CPUs available to a job by default if not specified by the job. N 0.0 (infinite - managed by Slurm)
default-mem-mb Number of MB of RAM available to a job by default if not specified by the job. N 0.0 (infinite - managed by Slurm)
max-cpus Maximum number of CPUs available to a job. Setting this to a negative value will disable setting CPUs on a job. If set, the value of default-cpus will always be used. N 0.0 (infinite - managed by Slurm)
max-mem-mb Maximum number of MB of RAM available to a job. Setting this to a negative value will disable setting memory on a job. If set, the value of default-mem-mb will always be used. N 0.0 (infinite - managed by Slurm)
max-gpus Only valid if enable-gpus is set to 1 and gpu-types is empty in launcher.slurm.conf. Maximum number of GPUs that can be requested per job. N 0.0 (infinite - managed by Slurm)
default-gpus Only valid if enable-gpus is set to 1 and gpu-types is empty in launcher.slurm.conf. Number of GPUs available to a job by default if not specified by the job. N 0.0 (infinite - managed by Slurm)
max-gpus-<type> Only valid if enable-gpus is set to 1 and type is included in the gpu-types field in launcher.slurm.conf. Maximum number of GPUs of the specified type that can be requested per job. N 0.0 (infinite - managed by Slurm)
default-gpus-<type> Only valid if enable-gpus is set to 1 and type is included in the gpu-types field in launcher.slurm.conf. Number of GPUs of the specified type available to a job by default if not specified by the job. N 0.0 (infinite - managed by Slurm)

4.2 Slurm Cluster Requirements

In order for the Slurm plugin to run correctly, the following assumptions about the Slurm cluster must be true:

  • The Slurm service account (specified in the main configuration file) must have full cluster-admin privileges.
  • The Slurm control machine (the one running slurmctld), the RStudio Launcher host machine, and all Slurm nodes must have a shared home directory.
  • The RStudio Launcher host machine must have the following properties:
    • the Slurm version 19.05 executables installed (e.g. sinfo, scontrol, etc.). If another version of Slurm is installed, you may experience unexpected behavior.
    • the same slurm.conf file as the desired Slurm cluster
    • network connectivity to the machine running slurmctld (i.e. the RStudio Launcher host machine can resolve the IP or hostname of the Slurm control machine and connect via the slurmctld port configured in slurm.conf)
    • properly configured and running Slurm plugins, as required (e.g. if using MUNGE as an authentication service, munged must be running under the same user on all machines connected to the Slurm cluster)
    • properly configured users and groups (i.e. all users with the same name have the same UID, group, group ID on all machines connected to the cluster)

For more information about configuring and running a Slurm cluster, please see the Slurm documenation. Information about available Slurm plugins can also be found in the Slurm documentation in the relevant section. For example, here is the documentation about Slurm Accounting which also includes information about the available plugins and how to use them.

Below is an example of a launcher configuration which might be used in this scenario:

/etc/rstudio/launcher.conf

[server]
address=127.0.0.1
port=5559
server-user=rstudio-server
admin-group=rstudio-server
enable-debug-logging=1

[cluster]
name=Slurm
type=Slurm

/etc/rstudio/launcher.slurm.conf

slurm-service-user=slurm
job-expiry-hours=48
user-storage-path=~/slurm-data
max-output-stream-seconds=15
slurm-bin-path=/slurm/bin
enable-gpus=1
gpu-types=tesla,v100

4.3 Using the Slurm Launcher Plugin with RSP

To support launching RSP R Sessions via the Slurm Launcher plugin, the following must be true in addition to the requirements listed in the Slurm Cluster Requirements section:

  • The RSP host must have network access to every Slurm node that may run an R Session via any TCP port
  • Slurm nodes must have network access to the RSP host via the launcher-sessions-callback-address in order to support launcher jobs via the session, as described in the RSP Launcher Configuration documentation
  • To incorporate R Session configurations, rsession.conf must be accessible by all Slurm nodes that may run R Sessions. The default expected location can be changed by adding rsession-config-file=<path/to/rsession.conf> to /etc/rstudio/rserver.conf

If the allow-requeue option in launcher.slurm.conf is enabled (i.e. allow-requeue=1) and RStudio R Sessions may be preempted by higher priority jobs, it is advisable to set the Slurm preemption mode to SUSPEND rather than REQUEUE to avoid any loss of data in the Session. For more details, please see the Slurm Preemption Documentation.

4.4 Multiple Versions of R and Module Loading

As described in the R Versions section of the RSP administrator guide, it is possible to use multiple versions of R and load environment modules per R Version with R sessions launched via the Slurm Launcher Plugin by configuring the /etc/rstudio/r-versions file. In order to properly support this feature the following must be true:

  • R must be installed on all Slurm nodes in the same location.
  • The modules in question must be installed on all Slurm nodes.
  • The file /var/lib/rstudio-server/r-versions must be reachable by all Slurm nodes. Note that this file is generated by RSP, and that its location may be changed by setting r-versions-path=<shared directory>/r-versions in rserver.conf.

4.5 Load Balancing Considerations

When using the Slurm Launcher Plugin with a load balanced RSP, it is recommended to configure the Slurm cluster and Slurm Launcher Plugins so that the values for job-expiry-hours are the same in all copies of launcher.slurm.conf and the value for MinJobAge in slurm.conf is at least as long as the configured job-expiry-hours value. Note that MinJobAge is set in seconds, rather than hours.

4.6 Additional Considerations

This section lists notable considerations related to the use of the Slurm Plugin.

  • Slurm does not provide a time zone for any time values which it returns. All times related to slurm jobs returned from the launcher will have the same time zone as the configured Slurm cluster.