Integrating RStudio Workbench and Jupyter with PySpark
This documentation describes the steps to use RStudio Workbench, formerly RStudio Server Pro, to connect to a
Spark cluster using Jupyter Notebooks and PySpark.
In this example, Apache Hadoop YARN is used as a resource manager on the Spark cluster, and
you'll create interactive Python sessions that use PySpark.
The RStudio Workbench server must have access to the Spark cluster and the
underlying configuration files for YARN and the Hadoop Distributed File System (HDFS). This typically requires
you to install RStudio Workbench on an edge node (i.e., gateway node) of
the Spark cluster using a Hadoop administration tool such as Cloudera
Manager or Apache Ambari. You can also achieve this by copying configuration
files from the Spark cluster to the RStudio Workbench server.
Add RStudio Workbench as an edge/gateway node
This section describes the process to add a single node as a Spark client to
a Hadoop cluster. This step is typically performed by a Hadoop
administrator. In this example, we use Cloudera Manager, but these steps can
also be adapted to other Spark clusters such as Amazon EMR.
This process may vary depending on different versions of Cloudera Distribution
authentication, and other variables. Refer to the Cloudera Manager documentation for more information.
Step 1: Add a new host to the Hadoop cluster
Since you already have a server with RStudio Workbench installed, the first
step is to add the RStudio Workbench node to an existing Cloudera CDH cluster.
Next, click the Classic Wizard on the first page.
Step 2: Specify the hostname for the RStudio Workbench node
Continue with the installation until it asks you to specify the hostname of the
node to add. Add the hostname of the RStudio Workbench node.
This hostname should be accessible from the Cloudera CDH cluster, Cloudera
Manager will verify this when you click Search.
When everything is verified
Step 3: Specify the credentials for the RStudio Workbench node
- Continue with the installation wizard until it asks for the login credentials
for the RStudio Workbench node.
In this example, we are using an Amazon EC2 instance with the username
and authentication via a private key. You might be using a different
authentication/credential mechanism depending on how you access the RStudio Workbench node.
Step 4: Wait for the Cloudera Manager agent and parcels to be installed
If the hostname and credentials are correct, Cloudera Manager installs the
Cloudera Manager Agent on the RStudio Workbench node.
The Cloudera CDH parcels will then be installed on the node (this might take a
couple of minutes).
Step 5: Verify that the new host with RStudio Workbench has been added
- Continue with the installation, and Cloudera Manager will inspect the hosts. If
everything installed correctly, then the RStudio Workbench node will join the
- Verify that the RStudio Workbench node appears in the list of hosts. Initially,
this node will not have any roles, but you will add the necessary roles in the
Step 6: Add roles to the RStudio Workbench node
You can now add roles to the RStudio Workbench node. The roles that you will
need to add are listed as follows:
- HDFS Gateway
- YARN Gateway
- Hive Gateway
- Spark Gateway
- Spark2 Gateway (if your Cloudera CDH cluster has Spark2 installed)
The following steps show an example of how to add the
Spark Gateway role to
the RStudio Workbench node in Cloudera Manager. You can then repeat this
process for all of the necessary roles.
Navigate to the Cloudera Manager Home page, select the tab of the service that
you want to add and click Add Role Instances.
Under the Gateway option, click Select hosts.
Select the RStudio Workbench node from the list of nodes.
You should now see the RStudio Workbench node selected under the Gateway
Be sure to verify the hostname of the RStudio Workbench node.
Follow the steps in the wizard and then re-deploy the client configuration by
clicking the Deploy button.
You can then repeat this process to add all of the necessary roles that are
After you've added all of the necessary roles, the cluster roles for the RStudio
Workbench node should look similar to the following figure.
Step 7: Verify that users exist on the Hadoop cluster and HDFS
It's important that the same users that log into RStudio Workbench also
exist within the Hadoop cluster. This is because the RStudio/Jupyter sessions
will run as that user, and any Spark contexts will inherit the YARN and HDFS
permissions of that user.
Synchronizing users across your RStudio Workbench instance and your Hadoop
cluster can be accomplished using multiple approaches. For example, both
systems might be configured to the same identity provider via LDAP/AD. For
more information, you can discuss this more with your Hadoop administrator.
Step 8: Verify network connectivity between RStudio Workbench and the Hadoop cluster
- Ensure that the RStudio Workbench node has network access to the Cloudera CDH
cluster. In Amazon AWS, we recommend allowing all communication between the
Cloudera CDH security group and the RStudio Workbench security group.
Using RStudio Workbench with Jupyter and PySpark
Now that RStudio Workbench is a member of the Hadoop/Spark cluster, you can
install and configure PySpark to work on RStudio Workbench Jupyter sessions.
This section describes the process for a user to work with RStudio Workbench and Jupyter Notebooks to connect to the Spark cluster via PySpark.
Step 1: Install PySpark in the Python environment
Step 2: Configure environment variables for Spark
To configure the Spark environment variables for all Jupyter sessions, create a
/etc/profile.d/ that exports the required configuration variables,
Java versions and the JAVA_HOME variable
Ensure that you export a
JAVA_HOME variable that matches the Java version
that PySpark was compiled with. In this example, we are using Java Version
Step 3: Create a Spark session via PySpark
Now you are ready to create a Spark session and connect to Spark.
From the RStudio Workbench home page, create a new Jupyter Notebook or
pyspark and create a new Spark session that uses YARN by running
the following Python code in the notebook:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
sc = SparkContext(conf=conf)
Step 4: Verify that the Spark application is running in YARN
At this point, you should be able to see that the Spark application is running
in the YARN resource manager.
Step 5: Run a sample computation
Step 6: Verify read/write operations to HDFS
You can run the following sample code in the notebook to verify that writes to
HDFS are working as expected:
# Save a file to HDFS
rdd = sc.parallelize(range(1, 4)).map(lambda x: (x, "a" * x))
You can run the following sample code in the notebook to verify that reads from
HDFS are working as expected:
# Read the same file from HDFS