14 Repositories and Sources

R repositories contain package tar files and are the primary vehicle for organizing and distributing R packages. For more information on packages and repositories see B.

In RStudio Package Manager, repositories are created from one or more sources. The documentation in this chapter outlines repositories as well as the types and structure of sources.

14.1 Repository Structure

R package repositories have a specific structure that enables client commands like install.packages to query the repository’s contents and download packages.

A regular CRAN repository is just a set of files served from disk. RStudio Package Manager does not create repositories on disk. Instead, RStudio Package Manager maintains a single copy of each package source and uses a database and specialized web server to handle HTTP requests from R.

Some example requests that can be served by the RStudio Package Manager:

PACKAGES file

http://pkg-manager.example.com/repo/latest/src/contrib/PACKAGES

This serves a PACKAGES file. The PACKAGES file for a repository is human-readable and contains information on each package available in the repository. RStudio Package Manager can also serve requests for PACKAGES.gz and PACKAGES.rds.

Package Source

http://pkg-manager.example.com/repo/latest/src/contrib/package_2.1.0.tar.gz

This request downloads the package source to the client.

Archived Package Source

http://pkg-manager.example.com/repo/latest/src/contrib/archive/package/package_1.1.0.tar.gz

This request downloads the tar file for an older, archived version of the package.

Most importantly, a RStudio Package Manager repository is a CRAN-like repository which means users can access and install packages using their regular R functions: install.packages, available.packages, packrat, and devtools::install.

14.2 Repository Versioning

RStudio Package Manager tracks every change to a repository (or source) and associates each change with a transaction id. Together, these transaction ids create a full versioned history of each repository. If a user wants to install packages from a prior point in the repository’s history they can do so by replacing the /latest component of the request URL with a transaction id. Transaction ids can be obtained in a repository’s “Activity” log. The current transaction id is available in the “Setup” page. For projects that require strict reproducibility, we recommend configuring R to use a repository URL with a transaction id. Versioning is available for all repository and source types.

14.3 Sources

14.3.1 About Sources

RStudio Package Manager repositories are composed of one or more sources. There are currently four types of sources:

  1. cran source - A single cran source is automatically created. This source contains metadata and packages from RStudio’s CRAN service. The source can be used directly in a repository to give users access to all CRAN packages, or it can used indirectly by curated-cran sources.

While the cran source is created automatically, an administrator must use the CLI before any metadata or packages are downloaded to RStudio Package Manager. See the CLI section for more information on making CRAN available through RStudio Package Manager.

  1. curated-cran source - A curated CRAN source allows administrators to specify specific sets of approved CRAN packages. Administrators can add or remove packages from the set, and they can also update the set. See 14.5 for more information.

  2. local source - A local source is used as a mechanism to distribute locally developed packages or other packages without native support in RStudio Package Manager. Administrators add packages to local sources by specifying a path to a package’s tar file.

  3. git source - A git source allows RStudio Package Manager to automatically make packages in Git available to R users through install.packages (without requiring devtools). Git sources work for internal packages as well as external sites such as GitHub. Packages can be automatically updated on each commit or when a new Git tag is pushed.

14.3.2 Repositories with Multiple Sources

A repository can have more than one source. If you wish to serve both local packages and CRAN packages from a single repository, you can create a single repository that subscribes to multiple sources. For example:

  • public (a repository)
    • internal (local source)
    • cran (CRAN source)

The “public” repository above gives users access to both local and CRAN packages, and its PACKAGES list could be accessed, for example, at http://pkg-manager.example.com/public/latest/src/contrib/PACKAGES. A repository subscribes to sources, which means that changes to a source will be reflected in the repository. For example, if an admin adds a new package to the internal source, users will automatically be able to access the new package via the public repository.

14.3.3 Package Conflicts Between Sources

If a repository has multiple sources and a package with the same name exists in both sources, RStudio Package Manager eliminates duplicates, giving preference in the order the sources are subscribed. In the example repository above, if a package named “plumber” exists in both the “cran” and “internal” sources, the “plumber” package from the “internal” source would be served and listed since it is the first source for the repository. The same conflict resolution occurs as sources change. For example, in the sample above, even if a new package is added to CRAN with the same name as an internal package, the internal package will continue to be served. The precedence is also maintained during updates. In the example above, the internal version of plumber will continue to be served even if the CRAN version of plumber is updated. The order of sources within a repository can be re-arranged using the reorder command.

14.4 The CRAN Source

A primary use case for RStudio Package Manager is making packages in public repositories, like CRAN, available to enterprise users. Administrators can elect to make all of CRAN available, or to make only curated subsets of CRAN available.

14.4.1 What is RStudio’s Package Service?

RStudio Package Manager doesn’t download packages directly from CRAN. Instead, RStudio maintains a curated s3 bucket that contains metadata about CRAN and package tar files. The metadata is used to track CRAN’s day-to-day changes.

See D if your environment does not have access to the RStudio Package Service.

During a sync, the metadata is downloaded to RStudio Package Manager. The metadata is compared against the RStudio Package Manager database to determine what changes need to be applied. Package tarballs are then downloaded to the cache either eagerly or lazily depending on the sync mode.

Lazy sync mode is recommended.

Chapter 11.2 details the security measures in place for the RStudio Package Service.

14.4.2 Eager vs Lazy

The sync mode is configured using the CRAN.SyncMode property. The property defaults to “lazy”, but can be set to “eager” for eager package fetching. See A.1

14.4.2.1 Eager

For eager fetching, RSPM attempts to download any package sources that could be served to users. This means that RSPM will download packages under a number of scenarios:

  1. When a repository subscribes to the CRAN source.
  2. If a repository is subscribed to the CRAN source, then packages will be downloaded when a sync occurs, initiated with the CLI sync command or initiated based on the configured schedule.
  3. When a package is added to a curated CRAN source with the add command (and at least one repository has subscribed to the curated CRAN source).
  4. When a curated CRAN source is updated with the update command (and at least one repository has subscribed to the curated CRAN source).

All four of these actions happen automatically. Additionally, an admin can force RSPM to download packages using the CLI fetch command.

If a repository subscribes to the CRAN source, then all CRAN packages will downloaded. Downloading all of CRAN during an initial sync can take a significant amount of time, bandwidth, and disk space. However, downloading packages does not block end users from accessing repositories that subscribe to the source. End users can request any package as soon as the metadata is sync’d. If a user requests a package that is not already downloaded, the package will be immediately downloaded and served to the client.

14.4.2.2 Lazy

If RStudio Package Manager is set up for lazy fetching, it downloads packages as the packages are requested by end users. Package Manager will still download the metadata from CRAN on the sync schedule to keep the RStudio Package Manager database updated. The database serves as the source of truth for package availability. The benefit of lazy fetching is a smaller footprint in terms of network bandwidth and disk space.

14.4.2.3 Package Caching

In either mode, each version of a package is only downloaded once. RStudio Package Manager always checks the local cache to see if the required tar file is already available.

14.4.2.4 Changing Modes

If you change the sync mode, the following occurs:

Lazy to Eager - Eager fetching is applied to future syncs. When you next synchronize, all packages that have not yet been downloaded will be downloaded.

Eager to Lazy - Lazy fetching is applied to all future syncs. A sync with in-progress downloads will be completed.

14.4.3 Updates from CRAN

The cran source is updated according to a schedule set using the CRAN.SyncSchedule property in the RStudio Package Manager configuration file. This property accepts a string in crontab format. See A.1.

By default, the configuration file includes a crontab that will cause RStudio Package Manager to sync once a day at midnight (in the server’s timezone), if any repository subscribes to “cran”, if a “curated-cran” source is used by any repository, or if a manual sync has been run with the sync command. A sync schedule will not be applied if any of those 3 conditions are not met. If you only want manual syncs, change the configuration file to have a blank value for CRAN.SyncSchedule:

;/etc/rstudio-pm/rstudio-pm.gcfg
[CRAN]
SyncSchedule = ""

The SyncSchedule property does not necessarily determine when a repository will make updated packages available to users. If the repository subscribes directly to the cran source, users will see updates according to the sync schedule. In contrast, if the repository subscribes to a curated CRAN source, an administrator must explicitly update the source in order for updates to become available.

In addition, updating the repository does not automatically push updated packages to R clients. A repository specifies what packages are available, but the R user is in control of when and how to update the packages used by a project.

See the section on managing change control for more information.

RStudio Package Manager keeps track of old versions of packages as well. Old versions of packages are available in the repository’s archive, and are listed in the RStudio Package Manager web UI. This allows users to roll back updates if necessary or install packages as they existed at a prior time.

14.5 Curated CRAN Sources

Curated CRAN sources allow administrators to create and update approved subsets of CRAN. The behavior is best explained in an example.

Assume that RStudio Package Manager has been configured to sync CRAN updates daily.

January 1st - An administrator creates a curated CRAN source and is given a list of desired packages.

January 2nd - The administrator can use the add command with the dryrun flag, supplying the list of desired packages. RStudio Package Manager will identify all of the required dependencies and create a proposal. The proposal includes the set of packages to be added as well as information about each package, such as license type. This information can be used to facilitate an external review process.

January 15th - The proposal is approved. The administrator returns to RStudio Package Manager and runs the add command again, replacing the dryrun flag with a transaction ID included in the proposal. The set of packages is added from CRAN as they existed on January 1st, the date the source was created.

January 20th - The administrator receives a request to add a new package to the set of approved packages. The admin uses the add command with the dryrun flag, supplying the new package as an argument. RStudio Package Manager will create a proposal using the version of CRAN as it existed on January 1st. In order to ensure compatibility between the packages added to the source, RStudio Package Manager will add to the set of packages by pulling from CRAN as it existed the day the source was created. As before, if the proposal is accepted, the admin can commit the changes.

January 30th - Now the administrator gets a request to update the approved packages. In order to keep all packages consistent, the entire set is updated at once using the update command. Like the add command, the update command supports a dryrun flag that will enumerate all the changes needed to update the set of packages from January 1st to January 30th.

February 1st - The proposal is approved and the administrator completes the update command by using the transaction ID included in the dry run update. The set of packages is now tied to CRAN on January 30th. Future add commands will use this pinned date, until another update sequence occurs.

To summarize, curated CRAN sources allow admins to create a subset of CRAN at a point in time. Administrators can add packages to the subset from the same frozen point in time. Administrators can also update the subset to a newer point in time. Each change supports a dry run that creates a proposal and a confirmation run that applies the proposal.

Given a list of desired packages, RStudio Package Manager automatically determines the full set of dependencies and also tracks those dependencies over time. Admins can elect to include suggested dependencies or only required dependencies by using the include-suggests flag. During each update, older versions of packages are archived, ensuring that tools like packrat and RStudio Connect work seamlessly with the curated CRAN subset.

The update command will be impacted by the sync schedule defined on the server. If the server only syncs every few weeks, update will only reference the latest data from CRAN available on the server.

14.6 Git Sources

Git sources allow RStudio Package Manager to automatically expose R packages tracked in Git. Git sources work with internal packages as well as external sites such as GitHub.

Git sources require a configured R installation.

An admin follows these steps:

  1. Create a git source
  2. Add a Git endpoint to the source, specifying whether to watch for commits to a branch or tags. The endpoint can be HTTP or SSH (see below). See the add command for full details, e.g. how to track a specific branch.
  3. Based on the selection specified with the add command, RStudio Package Manager clones the Git endpoint and runs an R job to transform the Git clone into a package bundle. The package bundle is made available to any repositories subscribing to the source.
  4. RStudio Package Manager polls the Git endpoint to watch for either new commits or new tags (based on the selection specified with the add command). If an update is available, RStudio Package Manager automatically pulls the new changes and launches an R job. The R job creates a package bundle from the updated Git clone and updates the package available in the git source. Previous versions are archived.
  5. Users install the package from the repository via install.packages NOT devtools.

See the quickstart guide for a specific example.

14.6.1 Access restricted Git endpoints using SSH keys

Many internal Git endpoints require authentication. RStudio Package Manager can be configured to use SSH keys to authenticate against the endpoint.

Begin by creating an SSH key and granting the SSH key access to the Git endpoint. The specific steps will depend on your Git provider. Once you have the path to the SSH key, use the import command to securely name and store the SSH key for later use by RStudio Package Manager. If desired, you can now remove the SSH key file. Multiple keys can be imported.

To use the newly imported SSH key, use the SSH identifier for the Git endpoint in the add command and reference the name of the key with the --git-ssh-key flag.

14.6.1.1 SSH key Security

RStudio Package Manager encrypts and stores imported SSH keys in the metadata database. Any person (by default, members of the rstudio-pm unix group) with access to the admin CLI can:

  • Associate an imported key with a Git endpoint using the add command
  • List the names of available SSH keys

Users cannot access the contents of the key, nor is the key available for arbitrary actions. We recommend granting SSH keys imported to RStudio Package Manager limited read-only access to only the endpoints you wish to expose as R packages.

Currently RStudio Package Manager requires that keys do not use a passphrase. When imported, the keys are encrypted at rest, however during Git operations which require SSH, the keys are temporarily written to disk and immediately removed. The key files use 0600 permissions and are owned by the server user. During git operations (e.g. cloning a private repository) the keys are written into temporary files in the configured Git.BuilderDir location - see A.6.

14.6.2 Commits vs Tags

A package based on a Git endpoint can can be configured to watch one of two types of changes:

  1. Commits - RStudio Package Manager will update the package any time new commits are discovered in a branch. In this mode, RStudio Package Manager automatically modifies the package’s version, assigning a unique version number to each build. The version number is created based on the commit time-stamp and is designed to avoid conflicts with the version scheme used by the package author. For example, if the Description file for a package indicates a version of 1.1-3, the automatic version number would be: 1.1-3.0.0.0.1537204599. If the author updates the package with a new commit, but keeps the version in the Description file the same, the new automatic version number would reflect the new commit time-stamp, e.g. 1.1-3.0.0.0.1537218677. This process ensures that users of the package always get the correct behavior from install.packages, with newer commits being associated with a semantically higher version number.

  2. Tags - RStudio Package Manager will update the package any time a new Git tag is discovered. In this mode, RStudio Package Manager retains the version specified in the package’s Description file. This mode is designed to work when a Git tag is used to indicate a package release. Note: The name of the tag must match the version in the Description file. For example, if your package’s Description file has Version: 5.4.2 your tag must be either 5.4.2 or v5.4.2. If two tags reference the same version, preference is given to the newer tag. If a newer tag references an older version than a prior tag, the new tag is built as an archived package. If a tag is removed from a Git endpoint, the package is deleted.

Commit mode is recommended for bleeding edge repositories, whereas tag mode is suitable for exposing stable releases of packages.

A git source can support different packages with different modes. However, a given package can only have one mode in a source. If you would like to surface the same package in both commit and tag mode, you must create two git sources.

14.6.3 Managing Packages from Git

RStudio Package Manager automatically handles updating and archiving packages in git sources as the Git endpoints change. Additionally, the package artifacts themselves can be manually removed using the remove command.

Deleting a git source with the delete command will remove all the packages generated by the git source and will remove all the metadata about the Git endpoints.

Finally, it is possible to keep the package artifacts already created but stop RStudio Package Manager from tracking the Git endpoint. To do so, use:

rspm delete git-builder --name=[name of package] --source=[name of source]

To view information about the current Git endpoints that are being tracked, use:

rspm list git-builders

A git-builder is automatically created when a package is added to a Git source using the add command. Git builders do not need to be created manually.

14.6.4 Combining packages from Git(Hub) with other package sources

Local packages cannot be added manually to a git source, but a repository can surface packages from a git source alongside local packages and CRAN packages by subscribing to multiple sources. Take care when managing a repository’s subscriptions as order is important, see 14.3.2.

14.6.5 Polling Frequency

You can control how frequently RStudio Package Manager checks for updates using the Git.PollInterval configuration field. If multiple commits occur between checks, RStudio Package Manager will create a single version representing all of the changes. If multiple tags are created or removed between checks, RStudio Package Manager will build each tag individually, automatically archiving tags representing older versions of the package.

Repository Versioning is identical in all source types, including git sources.

14.6.6 Tracking Changes and Errors

If a repository subscribes to a git source, you can view the git source’s history in the Activity Log. The Activity Log will identify each change to a package including the new version, and a message will indicate the associated Git tag or commit as appropriate. If an error is encountered attempting to clone, poll, or bundle a package, the Activity Log will record the attempt and include a message with the CLI command to be run to view a full error log.

RStudio Package Manager automatically tries to build updates from a Git source 3 times. If the build fails more than 3 times, the update causing the failure is ignored. New updates are still discovered and built.

To force RStudio Package Manager to retry building an update, use the rerun command:

rspm rerun git-builder   \
  --name=[package name]  \
  --source=[source name] \ 
  --tag=[tag to rebuild, only required if the build trigger is tags]