Integrating RStudio Server Pro and Jupyter with PySpark
This documentation describes the steps to use RStudio Server Pro to connect to a Spark cluster using Jupyter Notebooks and PySpark.
In this example, YARN is used as a resource manager on the Spark cluster, and you'll create interactive Python sessions that use PySpark.
- RStudio Server Pro configured with Jupyter Notebooks on a Single Server
- Hadoop cluster configured with Spark and YARN
- Access from RStudio Server Pro to the Spark cluster
RStudio Server Pro server must have access to the Spark cluster and the underlying configuration files for YARN and HDFS. This typically requires you to install RStudio Server Pro on an edge node (i.e., gateway node) of the Spark cluster using a Hadoop administration tool such as Cloudera Manager or Apache Ambari. You can also achieve this by copying configuration files from the Spark cluster to the RStudio Server Pro server.
Add RStudio Server Pro as an edge/gateway node
This section describes the process to add a single node as a Spark client to a Hadoop cluster. This step is typically performed by a Hadoop administrator. In this example, we use Cloudera Manager, but these steps can also be adapted to other Spark clusters such as Amazon EMR.
This process may vary depending on different versions of Cloudera CDH, authentication and other variables. Refer to the Cloudera Manager documentation for more information.
1. Add a new host to the Hadoop cluster
Since you already have a server with RStudio Server Pro installed, The first step is to add the RStudio Server Pro node to an existing Cloudera CDH cluster.
From the Cloudera Manager dashboard, select the
Add Hosts option from the menu
and select the
Classic Wizard on the first page.
2. Specify the hostname for the RStudio Server Pro node
Continue with the installation until it asks you to specify the hostname of the node to add. Add the hostname of the RStudio Server Pro node.
This hostname should be accessible from the Cloudera CDH cluster, Cloudera
Manager will verify this when you click
Search when everything is verified
3. Specify the credentials for the RStudio Server Pro node
Continue with the installation wizard until it asks for the login credentials for the RStudio Server Pro node.
In this example, we are using an Amazon EC2 instance with the username
and authentication via a private key. You might be using a different
authentication/credential mechanism depending on how you access the RStudio
Server Pro node.
4. Wait for the Cloudera Manager agent and parcels to be installed
If the hostname and credentials are correct, Cloudera Manager will install the Cloudera Manager Agent on the RStudio Server Pro node.
The Cloudera CDH parcels will then be installed on the node (this might take a couple of minutes).
5. Verify that the new host with RStudio Server Pro has been added
Continue with the installation, and Cloudera Manager will inspect the hosts. If everything installed correctly, then the RStudio Server Pro node will join the Hadoop cluster.
Verify that the RStudio Server Pro node appears in the list of hosts. Initially, this node will not have any roles, but you will add the necessary roles in the following step.
6. Add roles to the RStudio Server Pro node
You can now add roles to the RStudio Server Pro node. The roles that you will need to add are listed as follows:
- HDFS Gateway
- YARN Gateway
- Hive Gateway
- Spark Gateway
- Spark2 Gateway (if your Cloudera CDH cluster has Spark2 installed)
The following steps show an example of how to add the
Spark Gateway role to
the RStudio Server Pro node in Cloudera Manager. You can then repeat this
process for all of the necessary roles.
Navigate to the Cloudera Manager home page and select
Add Role Instances under
the service that you want to add.
Select hosts under the
Select the RStudio Server Pro node from the list of nodes.
You should now see the RStudio Server Pro node selected under the Gateway option. Be sure to verify the hostname of the RStudio Server Pro node.
Follow the steps in the wizard and then re-deploy the client configuration.
You can then repeat this process to add all of the necessary roles that are listed above.
After you've added all of the necessary roles, the cluster roles for the RStudio Server Pro node should look similar to the following figure.
7. Verify that users exist on the Hadoop cluster and HDFS
It's important that the users you use to login into RStudio Server Pro also exist within the Hadoop cluster. This is because the RStudio/Jupyter sessions will run as that user, and any Spark contexts will inherit the YARN and HDFS permissions of that user.
Synchronizing users across your RStudio Server Pro instance and your Hadoop cluster can be accomplished using multiple approaches. For example, both systems might be configured to the same identity provider via LDAP/AD. For more information, you can discuss this more with your Hadoop administrator.
To manually create a user in HDFS, you can run the following command (replace
rstudio with the actual username):
$ hdfs dfs -mkdir /user/rstudio $ hdfs dfs -chown rstudio:rstudio /user/rstudio/
8. Verify network connectivity between RStudio Server Pro and the Hadoop cluster
Ensure that the RStudio Server Pro node has network access to the Cloudera CDH cluster. In Amazon AWS, we recommend allowing all communication between the Cloudera CDH security group and the RStudio Server Pro security group.
Using RStudio Server Pro with Jupyter and PySpark
This section describes the process for a user to work with RStudio Server Pro and Jupyter Notebooks to connect to the Spark cluster via PySpark.
Now that RStudio Server Pro is a member of the Hadoop/Spark cluster, you can install and configure PySpark to work on RStudio Server Pro Jupyter sessions.
1. Install PySpark in the Python environment
Install PySpark in the environments that are configured as Python kernels, for example:
sudo /opt/python/2.7.16/bin/pip install pyspark
Note on Python versions
PySpark cannot run with different minor versions of Python installed, be sure to use the same version of Python in RStudio Server Pro and the Spark cluster.
2. Configure environment variables for Spark
To configure the Spark environment variables for all Jupyter sessions, create a
/etc/profile.d/ that exports the required configuration variables,
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-126.96.36.199.b10-1.el7_7.x86_64/jre export HADOOP_CONF_DIR=/etc/hadoop/conf
Note on Java versions and the JAVA_HOME variable
Ensure that you export a
JAVA_HOME variable that matches the Java version
that PySpark was compiled with. In this example, we are using Java Version
3. Create a Spark session via PySpark
Now you are ready to create a Spark session and connect to Spark.
From the RStudio Server Pro home page, create a new Jupyter Notebook or JupyterLab session.
pyspark and create a new Spark session that uses YARN by running
the following Python code in the notebook:
from pyspark import SparkConf from pyspark import SparkContext conf = SparkConf() conf.setMaster('yarn-client') conf.setAppName('rstudio-pyspark') sc = SparkContext(conf=conf)
4. Verify that the Spark application is running in YARN
At this point, you should be able to see that the Spark application is running in the YARN resource manager:
5. Run a sample computation
You can run the following sample code in the notebook to verify that the Spark connectivity is working as expected:
data = [1, 2, 3, 4, 5] distData = sc.parallelize(data) distData.mean()
6. Verify read/write operations to HDFS
You can run the following sample code in the notebook to verify that writes to HDFS are working as expected:
# Save a file to HDFS rdd = sc.parallelize(range(1, 4)).map(lambda x: (x, "a" * x)) rdd.saveAsSequenceFile("saved_file")
You can run the following sample code in the notebook to verify that reads from HDFS are working as expected:
# Read the same file from HDFS sorted(sc.sequenceFile("saved_file").collect())