7 Load Balancing

7.1 Overview

RStudio Server can be configured to load balance R sessions across two or more nodes within a cluster. This provides both increased capacity as well as higher availability.

Load balancing with RStudio Server always operates in an active-active fashion where all nodes are equally equipped to serve users. All nodes have a primary role.

Note that load balancing for RStudio Server has some particular “stickiness” requirements stemming from the fact that users must always return to the same R session where their work resides (i.e. their traffic can’t be handled by more than one node). As a result, it’s not enough to simply place multiple RStudio Servers behind a conventional hardware or software load balancer—additional intelligence and routing is required.

Key characteristics of the RStudio Server load balancer include:

  1. Multiple primary nodes for high availability - all nodes can balance traffic to all other nodes.

  2. Support for several load balancing strategies including least busy server (by active sessions or system load), even distribution by user, or a custom strategy based on an external script.

  3. The ability to add and remove nodes while the cluster is running.

  4. Works standalone or can be integrated with other front-end load balancing environments.

Note: The standalone load balance and higher availability characteristics of RStudio Server are an exception among RStudio products. RStudio Connect and Shiny Server will require a front-end load balancing under the same scenarios. The use of an external load balancer can still be beneficial in a failover setup. See External Load Balancers below for details.

7.2 Load Balancing vs. Job Launcher

The Job Launcher is another method supported by RStudio Server to achieve increased capacity by allowing sessions to run using a compatible computing infrastructure (i.e. Kubernetes). However, the Job Launcher does not aim to provide higher availability. At least two RStudio Server nodes in a load balancing configuration are still required to provide service continuity in failover scenarios.

Note: The Job Launcher itself can have its own load balancing strategy in place. See the loading balancing section in the Job Launcher documentation for more details.

7.3 Configuration

7.3.1 Requirements

There are several requirements for nodes within RStudio clusters:

  1. All nodes must run the same version of RStudio Server Pro.

  2. Server configurations (i.e. contents of the /etc/rstudio directory) must be identical.

  3. User accounts must be accessible from each node and usernames and user ids must be identical on all nodes. The same applies for any groups used by RStudio users, and also to the rstudio service user account.

  4. The clocks on all nodes must be synchronized.

  5. User home directories must be accessible via shared storage (e.g. all nodes mounting the same NFS volume).

    Note: Due to high latencies, use of EFS (Elastic File System) for home directories within AWS is strongly discouraged. If EFS is used, RStudio Server will experience highly degraded performance. We recommend using a traditional NFSv3 or NFSv4 mount instead.

  6. An explicit server-wide shared storage path also must be defined. See the Shared Storage section for additional details.

  7. RStudio must be configured to use a PostgreSQL database, and an empty database must be present for RStudio Server to write important cross-node state. If you have previously run RSP with a SQLite database, it is strongly advised that you execute the database Migration to the PostgreSQL database first. For more information, see Database.

7.3.2 Defining Nodes

To define a cluster node, two configuration files need to be provided:

/etc/rstudio/load-balancer
/etc/rstudio/secure-cookie-key

The first of these defines the available nodes and load balancing strategy. The second defines a shared key used for signing cookies (in single node configurations this key is generated automatically, however with multiple nodes explicit coordination is required. The same secure-cookie-key value must be used on each node).

For example, to define a cluster with two nodes that load balances based the number of actively running R sessions you could use the following configuration:

/etc/rstudio/load-balancer

[config]

balancer = sessions

[nodes]

server1.example.com
server2.example.com

/etc/rstudio/secure-cookie-key

a55e5dc0-d6ae-11e3-9334-000c29635f71

The secure cookie key file above is only an example; you need to generate your own unique key to share among the nodes in your cluster.

7.3.3 Key File Requirements

The following are the requirements for the secure cookie key file:

  • The key value must have a minimum length of 128 bits (16 bytes/characters). RStudio won’t start if the key is too weak.
  • The key file must have restrictive permissions (i.e. 0600) to protect its contents from other users.
  • The key file must be identical on all nodes in a load-balanced cluster, so that the nodes can communicate with each other.
  • The key must have a secret value that cannot be guessed. Randomly generating the value is recommended; see below for one mechanism for doing so.

7.3.4 Generating a Key

You can create a secure cookie key using the uuid utility as follows:

sudo sh -c "echo `uuid` > /etc/rstudio/secure-cookie-key"
sudo chmod 0600 /etc/rstudio/secure-cookie-key

This is the recommended method, but any mechanism that generates a unique, random value will work.

You do not need to generate a secure-cookie-key file on each server; generate it once, and copy it to each node along with the rest of the /etc/rstudio directory.

7.3.4.1 Key File Location

You may optionally change the path of the secure-cookie-key by changing the secure-cookie-key-file setting in rserver.conf, though it is not necessary. Changing the path in this manner is only recommended in very specific circumstances when running the launcher with both RSP and Package Manager simultaneously. For example:

/etc/rstudio/rserver.conf

secure-cookie-key-file=/mnt/rstudio/secure-cookie-key

In addition, an explicit server-wide shared storage path must be defined (this is used for inter-node synchronization). This path is defined in the /etc/rstudio/rserver.conf file. For example:

/etc/rstudio/rserver.conf

server-shared-storage-path=/shared/rstudio-server/shared-storage

For convenience, this path will often be located on the same volume used for shared home directory storage (e.g. at path /home/rstudio-server/shared-storage).

7.3.5 Launcher Considerations

If you are running RStudio Server load balancing in addition to using Launcher sessions, you will need to ensure that the /etc/rstudio/launcher.pub and /etc/rstudio/launcher.pem files match on all RSP nodes in the cluster. Failure to do so will prevent users from being able to connect to their sessions from RSP nodes other than where their sessions were initiated.

For more information, see RStudio Server Pro Integration.

7.3.6 File Locking

In order to synchronize the creation of sessions across multiple nodes RStudio Server uses a cross-node locking scheme. This scheme relies on the clocks on all nodes being synchronized. RStudio Server includes a locktester utility which you can use to verify that file locking is working correctly. To use the locktester you should login (e.g. via SSH or telnet) to at least two nodes using the same user account and then invoke the utility from both sessions as follows:

$ /usr/lib/rstudio-server/bin/locktester

The first node you execute the utility from should print the following message:

*** File Lock Acquired ***

After the message is printed the process will pause so that it can retain the lock (you can cause it to release the lock by interrupting it e.g. via Ctrl+C).

The second and subsequent nodes you execute the utility from should print the following message:

Unable to Acquire File Lock

If you interrupt the first node (e.g. via Ctrl+C) the lock will be released and you can then acquire it from the other nodes.

If either of the following occurs then there is an issue with file locking capabilities (or configuration) that should be addressed prior to using load balancing:

  1. All nodes successfully acquire the file lock (i.e. more than one node can hold it concurrently).
  2. No nodes are able to acquire the file lock.

If either of the above conditions hold then RStudio won’t be able to correctly synchronize the creation of R sessions throughout the cluster (potentially resulting in duplicate sessions and lost data due to sessions overwriting each others state).

7.3.6.1 Lock Configuration

RStudio’s file locking scheme can be configured using a file at /etc/rstudio/file-locks. Valid entries are:

  • lock-type=[linkbased|advisory]
  • refresh-rate=[seconds]
  • timeout-interval=[seconds]
  • enable-logging=[0|1]
  • log-file=[path]

The default locking scheme, linkbased, uses a file locking scheme whereby locks are considered acquired when the process successfully hardlinks a dummy file to a location within the folder RStudio uses for client state (typically ~/.rstudio). This scheme is generally more robust with older network file systems, and the locks should survive temporary filesystem mounts / unmounts.

The timeout-interval and refresh-rate options can be used to configure how often the locks generated in the linkbased locking scheme are refreshed and reaped. By default, a process refreshes any locks it owns every 20 seconds, and scans for stale locks every 30 seconds. If an rsession process crashes, it can leave behind stale lock files; those lock files will be cleaned up after they expire by any newly-launched rsession processes.

advisory can be selected to use advisory file locks (using e.g. fcntl() or flock()). These locks are robust, but are not supported by all network file systems.

If you are having issues with file locking, you can set enable-logging=1, and set the log-file option to a path where output should be written. When logging is enabled, RStudio will report its attempts to acquire and release locks to the log file specified by log-file. When log-file is unset, log entries will be emitted to the system logfile, typically located at /var/log/messages or /var/lib/syslog.

7.3.7 Managing Nodes

7.3.7.1 Starting Up

After creating your configuration files you should ensure that these files (along with all other configuration defined in /etc/rstudio) are copied to all nodes in the cluster. Assuming that the server is already installed and running on each node, you can then apply the load balancing configuration by restarting the server:

sudo rstudio-server restart

7.3.7.2 Current Status

Once the cluster is running you can inspect its state (which sessions are running where) using the load balancing status HTTP endpoint. For example, when running the server on the default port (8787):

curl http://localhost:8787/load-balancer/status

Note that the status endpoint is accessed using localhost rather than an external IP address. This is because this endpoint is IP restricted to only be accessible within the cluster, so needs to be accessed directly from one of the nodes.

The status endpoint will return output similar to the following:

192.168.55.101:8787  Load: 0.45, 0.66, 0.32
   12108 - jdoe
   12202 - kmccurdy

192.168.55.102:8787  Load: 1, 0.75, 0.31
   3404 - bdylan

192.168.55.103:8787 (unreachable)  Load: 0, 0, 0

192.168.55.104:8787 (offline)  Load: 0.033, 0.38, 0.24

This output will show all of the nodes in the cluster. Each node is indicated by its address and an optional status indicating whether the node is unreachable or offline. If the node does not indicate a status, then it is healthy and servicing requests. Following the node address is its CPU Load, indicated by three decimal values indicating the last known 1-minute, 5-minute, and 15-minute load averages, represented as a fraction of total CPU load. On subsequent output lines, each RStudio IDE session that is running on that particular node is listed along with its process ID and running user.

An unreachable node indicates an issue connecting to it via the network. In most cases, this indicates that the rstudio-server service is not running on the node and should be troubleshooted by viewing any startup issues in the system logs for that particular node (see Diagnostics if the service is running and healthy). An offline node is one that was specifically put into offline mode via the command sudo rstudio-server offline, which causes it to stop servicing new sessions.

7.3.7.3 Adding and Removing Nodes

To temporarily remove a node from the cluster you can simply stop it:

sudo rstudio-server stop

R sessions running on that node will be automatically moved to another active node. Note that only the session state is moved, not the running processes. To restore the node you can simply start it back up again:

sudo rstudio-server start

Note that removing nodes does not require changing the list of defined nodes in /etc/rstudio/load-balancer (traffic is automatically routed around nodes not currently running).

To add new nodes, simply add the nodes to the /etc/rstudio/load-balancer configuration file and send the SIGHUP signal to the rserver process. It is recommended that this should be done to all nodes in the cluster, but depending on your routing configuration, at a minimum the node selected as the “main” node MUST be updated and signaled to start routing traffic to the new node(s).

When removing nodes from the configuration file via the SIGHUP signal, you must ensure that any removed nodes have their processes stopped by running sudo rstudio-server stop on that node. Alternatively, you can suspend any actively running sessions by running sudo rstudio-server suspend-all on the node to be removed. Failure to do this will cause existing sessions running on the removed node to be stuck in an inaccessible state, and users will not be able to connect to those sessions.

Reloading the load balancer configuration will also cause the rserver-http proxy configuration to be updated as well, which affects the RStudio’s running HTTP server. It is recommended that you do not make any other HTTP-related changes when updating the load balancer configuration unless you are aware of the potential side-effects!

7.3.8 Troubleshooting

If users are having difficulty accessing RStudio Server in a load balanced configuration it’s likely due to one of the load balancing requirements not being satisfied. This section describes several scenarios where a failure due to unsatisfied requirements might occur.

7.3.8.1 Node network instability

Some scenarios may causes RStudio to wait a long time for a node to respond due to network instability. You can limit how long is this waiting period with the timeout option, which is set to 10 seconds by default. This disable this timeout and use the system defaults, set it to zero.

/etc/rstudio/load-balancer

[config]

balancer = sessions
timeout = 5
...

7.3.8.2 SSL

If one of the nodes is temporarily using a self-signed or otherwise functional but invalid certificate the load balancer may fail to use that node. You can skip SSL certificate verification by disabling the option verify-ssl-certs, which is only applicable if connecting over HTTPS. For production use, you should always leave the default or have this set to true, but it can be disabled for testing purposes.

/etc/rstudio/load-balancer

[config]

balancer = sessions
verify-ssl-certs = 0
...

7.3.8.3 User Accounts Not Synchronized

One of the load balancing requirements is that user accounts must be accessible from each node and usernames and user ids must be identical on all nodes. If a user has the same username but different user ids on different nodes then permissions problems will result when the same user attempts to access shared storage using different user-ids.

You can determine the ID for a given username via the id command. For example:

id -u jsmith

7.3.8.4 NFS Volume Mounting Problems

If NFS volumes containing shared storage are unmounted during an RStudio session that session will become unreachable. Furthermore, unmounting can cause loss or corruption of file locks (see section below). If you are having problems related to accessing user directories then fully resetting the connections between RStudio nodes and NFS will often resolve them. To perform a full reset:

  1. Stop RStudio Server on all nodes (sudo rstudio-server stop).

  2. Fully unmount the NFS volume from all nodes.

  3. Remount the NFS volume on all nodes.

  4. Restart RStudio Server on all nodes (sudo rstudio-server start).

7.3.8.5 File Locking Problems

Shared user storage (e.g. NFS) must support file locking so that RStudio Server can synchronize access to sessions across the various nodes in the cluster. File locking will not work correctly if the clocks on all nodes in the cluster are not synchronized. This condition may be surfaced as 502 HTTP errors. You can verify that file locking is working correctly by following the instructions in the File Locking section above.

7.4 Access and Availablity

Once you’ve defined a cluster and brought it online you’ll need to decide how the cluster should be addressed by end users. There are two distinct approaches to this:

  1. Single Node Routing. Provide users with the address of one of the nodes. This node will automatically route traffic and sessions as required to the other nodes. This has the benefit of simplicity (no additional software or hardware required) but also results in a single point of failure.

  2. Multiple Node Routing. Put the nodes behind some type of system that routes traffic to them (e.g. dynamic DNS or a software or hardware load balancer). While this requires additional configuration it also enables all of nodes to serve as points of failover for each other.

Both of these options are described in detail below.

7.4.1 Single Node Routing

In a Single Node Routing configuration, you designate one of the nodes in the cluster as the main one and provide end users with the address of this node as their point of access. For example:

[nodes]
rstudio.example.com
rstudio2.example.com
rstudio3.example.com

Users would access the cluster using http://rstudio.example.com. This node would in turn route traffic and sessions both to itself and the other nodes in the cluster in accordance with the active load balancing strategy.

Note that in this configuration the rstudio2.example.com and rstudio3.example.com nodes can either fail or be removed from the cluster at any time and service will continue to users. However, if the main node fails or is removed then the cluster is effectively down.

7.4.2 Multiple Node Routing

In a Multiple Node Routing configuration all of the nodes in the cluster are peers and provide failover for each other. This requires that some external system (dynamic DNS or a load balancer) route traffic to the nodes; see below for examples and caveats. In this scenario any of the nodes can fail and service will continue, so long as the external router can respond intelligently to a node being unreachable.

For example, here’s an Nginx reverse-proxy configuration that you could use with the cluster defined above:

http {
  upstream rstudio-server {
    server rstudio1.example.com;
    server rstudio2.example.com backup;
    server rstudio3.example.com backup;
  }
  server {
    listen 80;
    location / {
      proxy_pass http://rstudio-server;
      proxy_redirect http://rstudio-server/ $scheme://$host/;
    }
  }
}

In this scenario the Nginx software load balancer would be running on rstudio.example.com and reverse proxy traffic to rstudio1.example.com, rstudio2.example.com, etc. Note that one node is designated by convention as the main one so traffic is routed there by default. However, if that node fails then Nginx automatically makes use of the backup nodes.

This is merely one example as there are many ways to route traffic to multiple servers—RStudio Server load balancing is designed to be compatible with all of them.

7.4.3 External Load Balancers

When using an external load balancer with a Multiple Node Routing configuration, the external load balancer may be configured as active/active or active/passive.

RStudio Server load balances all requests internally in an active/active way, deciding where new sessions will be started, and routing requests to existing sessions, regardless which RStudio Server node received the initial request from the external load balancer. The RStudio Server node that receives the request will re-route the request appropriately. Therefore, the external load balancer does not determine which RStudio Server node will respond to the request.

  • External load balancer configured as active/passive: All requests are routed by the external load balancer to a single RStudio Server node. If that node becomes unavailable or unresponsive, the external load balancer will select a different RStudio Server node. The RStudio Server node may route the request to another node to handle the request. The external load balancer provides failover / high availability, while RStudio Server’s load balancer provides scalability across nodes.

  • External load balancer configured as active/active: Per above, the RStudio Server’s internal load balancer may re-route the request to another node. Consequently, having the external load balancer select different nodes per request will not actually help balance the session load. Again, the external load balancer provides high availability, while scalability is still provided by the internal load balancer.

7.4.4 Using SSL

If you are running an RStudio Server on a public facing network then using SSL encryption is strongly recommended. Without this all user session data is sent in the clear and can be intercepted by malicious parties.

The recommended SSL configuration depends on which access topology you’ve deployed:

  1. For a Single Node Routing deployment, you would configure each node of the cluster to use SSL as described in the Secure Sockets (SSL) section. The nodes will then use SSL for both external and intra-machine communication.

  2. For a Multiple Node Routing deployment, you would configure SSL within the external routing layer (e.g. the Nginx server in the example above) and use standard unencrypted HTTP for the individual nodes. You can optionally configure the RStudio nodes to use SSL as well, but this is not strictly required if all communication with outside networks is done via the external routing layer.

7.5 Balancing Methods

There are four methods available for balancing R sessions across a cluster. The most appropriate method is installation specific and depends on the number of users and type of workloads they create.

7.5.1 Sessions

The default balancing method is sessions, which attempts to evenly distribute R sessions across the nodes of the cluster:

[config]
balancer = sessions

This method allocates new R sessions to the node with the least number of active R sessions. This is a good choice if you expect that users will for the most part have similar resource requirements.

7.5.2 System Load

The system-load balancing method distributes sessions based on the active workload of available nodes:

[config]
balancer = system-load

The metric used to establish active workload is the 5-minute load average, divided by the number of cores on the machine. This is a good choice if you expect widely disparate CPU workloads and want to ensure that machines with high CPU utilization don’t receive new sessions.

7.5.3 User Hash

The user-hash balancing method attempts to distribute load evenly and consistently across nodes by hashing the username of clients:

[config]
balancer = user-hash

The hashing algorithm used is CityHash, which will produce a relatively even distribution of users to nodes. This is a good choice if you want the assignment of users/sessions to nodes to be stable.

7.5.4 Custom

The custom balancing method calls out to external script to make load balancing decisions:

[config]
balancer = custom

When custom is specified, RStudio Server will execute the following script when it needs to make a choice about which node to start a new session on:

/usr/lib/rstudio-server/bin/rserver-balancer

This script will be passed two environment variables:

RSTUDIO_USERNAME — The user on behalf or which the new R session is being created.

RSTUDIO_NODES — Comma separated list of the IP address and port of available nodes.

The script should return the node to start the new session on using it’s standard output. Note that the format of the returned node should be identical to it’s format as passed to the script (i.e. include the IP address and port).

7.6 Diagnostics

To troubleshoot more complicated load balancing issues, RStudio can output detailed diagnostic information about internal load balancing traffic and state. You can enable this by using the diagnostics setting as follows:

[config]
diagnostics = tmp

Set this on every server in the cluster, and restart the servers to apply the change. This will write a file /tmp/rstudio-load-balancer-diagnostics on each server containing the diagnostic information.

The value stderr can be used in place of tmp to send diagnostics from the rserver process to standard error instead of a file on disk; this is useful if your RStudio Server Pro instance runs non-daemonized.