Using Amazon EFS (Elastic File System) with RStudio Team#
This document explains how to set up Amazon Elastic File System (EFS) with RStudio Team and details best practices for ongoing usage of EFS. Additionally, this document covers how we came up with these recommendations.
The Amazon Elastic File System (EFS) has unique design characteristics that can make it challenging to use with RStudio Team. To be successful with EFS, please be sure to read through this full document and adhere to its guidance.
Amazon EFS is a managed shared file system that scales elastically with the amount of storage you use. Since it supports Network File System (NFS), it is possible to use EFS as the shared file system with RStudio products. However, in some situations relevant to RStudio Professional Products, EFS can suffer from slower performance relative to Elastic Block Store (EBS).
This slower performance is particularly prevalent in workloads that are sensitive to latency (as opposed to throughput). Specifically, EFS is not performant when reading and writing thousands of small files. When managing R workloads on a server, this can be problematic because some R packages that contain C++ code can contain a great many C++ header files. For example, the
BH package on CRAN contains ~12K header files, so installation of
BH can be very slow on EFS.
On some (fairly rare) occasions, this can also affect direct data science work in cases where the workflow requires reading a large number of files. For example, performance may be poor when training a deep neural network that processes image files in bulk.
- Users interacting with EFS will experience slowness in certain operations. For additional information, see the Testing and benchmarking your configuration with
- EFS should be configured with specific settings, as described below.
- EFS is currently incompatible with Project Sharing in RStudio Workbench.
Product-specific limitations and recommendations#
This document is relevant to RStudio Workbench, RStudio Connect, and RStudio Package Manager usage. We only have a few product-specific limitations and recommendations.
For RStudio Workbench:
- You will not be able to use the Project Sharing feature of RStudio, due to the lack of support for access control lists (ACLs) on EFS.
- The default lock type of link-based won't work; the advisory type must be used instead.
RStudio Workbench 1.4, or greater
Upgrade to version 2021-09.0
We highly recommend upgrading to version 2021-09.0 because this release includes several performance improvements for EFS.
For RStudio Connect:
- If you choose to configure
Database.Dir, this also must point to the same shared location.
For RStudio Package Manager:
- Use the
lookupcache=posmount option to prevent long service delays due to attribute caching. See the NFS documentation for more information
Recommended EFS configuration settings#
Our recommendations for the optimal EFS configuration is as follows. Details on each of the settings are provided below.
When using and configuring EFS:
Use the general purpose performance mode, rather than Max I/O mode.
In most cases you should use bursting behavior, rather than provisioned throughput.
Single zone EFS will perform much faster than multiple availability zone, but this should be a careful design choice.
Install and use the Amazon package
efs-utilsto mount the file system, since this can have a significant positive impact in some cases.
If mounting with EFS Access Points, be cautious as this may mount all files with the same UID/GID, which is typically not desired.
Your choice of EC2 instance type can can also affect performance. We recommend provisioning instances that are memory or compute optimized (do not choose general purpose instances), and choose the network enhanced options, e.g.,
Use CloudWatch to monitor usage and identify system performance bottlenecks. In particular, watch out for situations where your burst credits ran out, since performance will be dramatically worse without bursting available.
Operations that write many small files (thousands or more) will not perform well in most EFS settings.
We recommend preinstalling R packages so that users do not have to repeatedly install them.
Consider adopting code patterns that prefer reading large files over splitting data between many small files.
When using an EFS file system for many users, we recommend segmenting data files into user-specific directories as much as is possible. Since writing large files will block metadata operations in the same directory until the write operation is complete, keeping the users' data isolated in separate directories will minimize the impact of large file operations on other users.
Since EFS performance is largely based on individual usage patterns, this document should serve as a starting point rather than an absolute directive. Be aware that you will need to tune your EFS configuration after monitoring user behavior, and will likely need to adjust it over time to ensure long-term performance.
Details on recommended EFS configuration settings#
Max I/O vs. GeneralPurpose#
When creating a file system you must choose a performance mode. This must be done at EFS creation time, and cannot be altered later.
We strongly recommend using the
File systems in the Max I/O mode can scale to higher levels of aggregate throughput and operations per second. This scaling is done with a tradeoff of slightly higher latencies for file metadata operations. Highly parallelized applications and workloads, such as big data analysis, media processing, and genomic analysis, can benefit from this mode.
However, in RStudio's testing, we have found Max IO mode to offer significantly worse performance because of the increased latency, especially in the "many small files" scenario.
Bursting vs. Provisioned Throughput#
EFS supports two throughput modes:
From the AWS documentation page:
With Bursting Throughput mode, throughput on Amazon EFS scales as the size of your file system in the EFS Standard or One Zone storage class grows. [...] With Provisioned Throughput mode, you can instantly provision the throughput of your file system (in MiB/s) independent of the amount of data stored.
The default bursting behavior is how most RStudio customers should start using EFS, until you understand the characteristics of file access patterns and the costs they incur.
You will need to monitor your Burst Credit balance and permitted throughput via CloudWatch to ensure you are not surprised by throttling if you run out of burst credits. We highly recommend setting alarms based on these metrics.
Throttling is remedied by either generating more Burst Credits (writing files to the file system or waiting for the Burst Credits to refresh) or converting to Provisioned Throughput mode. Large file systems (> 1TB) should theoretically be able to burst for 50% of the time. For smaller file systems, Provisioned Throughput can be set to maintain a constant performance level.
Generating large files to bump into a larger tier of burst performance is both time consuming and expensive. Weigh these options carefully. For example, creating 1TB of data could cost hundreds of dollars per month in storage costs.
If migrating to EFS, Provisioned Throughput can help save time if you wish to move a lot of data. In our tests, moving from Bursting to 500MiB Provisioned improved speed by 5x and preserved Burst Credits.
If you choose One Zone Storage (see below), differences between bursting and provisioned performance appeared to be minimal.
Multiple Availability Zone (default) vs One Zone Storage Classes#
AWS recently introduced Single Availability Zone (AZ) EFS with a different SLA than Multiple AZ instances. As of June 2021, Multiple AZ EFS instances support 99.99% uptime and Single AZ instances support 99.9% uptime. The Single AZ EFS is still durable, but there is no failover if the entire availability zone goes down.
Most organizations adopting EFS to load balance RStudio's Professional Products will prefer the Multiple AZ SLA. However, the Single AZ EFS performance is substantially better than Multiple AZ EFS, such that in our testing, its performance was comparable to a custom-configured NFS server.
This makes the Single AZ EFS server a strong option when configuring development environments
Read more about storage classes.
efs-utils when mounting the file system#
We strongly recommend using
efs-utils to mount the EFS file system. If this is not feasible, standard NFS client connections are possible, but there are mounting instructions and additional considerations to take into account.
read_ahead_kb size to 15 MB#
From the AWS performance tips page:
Linux kernels (5.4.*) use a
read_ahead_kbof 128 KB, however the AWS docs recommend 15 MB
The efs-utils package should set this correctly, but we recommend checking this value regardless to ensure that it is set to 15 MB as expected. Customers who wish to use only standard NFS utilities will need to set this value manually.
EC2 instance types#
In general for EFS, AWS recommends preferring instance types with more CPU or memory depending on the workload. Prefer memory-optimized or compute-optimized over general purpose instance types.
In our benchmarking, we have observed performance gains by using memory-optimized instance types, e.g.,
r5. For UI-related tasks like installing the
BH package, this could provide a better user experience.
For server installations that utilize many NFS client connections (e.g., Launcher), the enhanced networking might prove to be noticeably better. Consider using the
n variants, e.g.,
EC2 instance sizes#
We have observed significant gains in going from
xlarge instance sizes, primarily in parallelized load. For servers with many users, we recommend increasing the instance size.
Do not attempt to use smaller instance types, e.g.,
c5.large with 4GB memory.
If using Bursting mode:
Be sure to monitor the
BurstCreditBalancemetric. If this begins to decrease substantially over time, consider adding data to bump the file system size into a larger tier with more burst credits, or moving to Provisioned Throughput to establish a consistent baseline. Note that this is likely to incur extra cost, so please weigh the tradeoffs carefully.
Using metric math, you can compare
PermittedThroughputto know if you are using all of your available throughput. If you are, it might be an indication that you should move to Provisioned Throughput.
If using Provisioned Throughput:
PermittedThroughputcan be used to determine whether or not your storage volume has bumped you above your designated throughput setting.
Testing and benchmarking your configuration with
If you want to collect data about file system performance on your own EFS installation, you can use RStudio's benchmarking tool, which runs a suite of file operations to help characterize system behavior and compare it against other known configurations
For information on how to configure and run benchmark testing, please refer to the fsbench package documentation.