Data Science / DevOps / OpenShift / R

SparkR with OpenShift

Let’s set up a data science workbench on OpenShift (docker/kubernetes). The components will be an RStudio Server executing SparkR instructions to a remote Apache Spark instance, all hosted within a local OpenShift cluster instance. Examples include processing AWS S3 Bucket hosted data from Spark. So saddle up and ride ole’ Roxy all the way! This lab is another in the OpenShift MiniLabs series.

roxy

Objectives

We will demonstrate executing SparkR workloads from an RStudio Server instance colocated together with an Apache Spark instance within an OpenShift cluster instance. Our data science workbench environment will take the following form, which then becomes easily portable or reproducible to other hosting form factors – such as from laptop to Cloud. We are going to also sneak in a Jupyter r-notebook in there as well, just because we can.

mlops

Setup

Initial Attempt

This tutorial assumes you have completed the OpenShift MiniLabs installation procedure. Make sure you launch your local OpenShift instance with the “–use-existing-config=true” flag. You can also use MiniShift or your own OpenShift cluster as you prefer. Then refresh before continuing.

For example, my start script to launch a local OpenShift cluster instance looks as follows. Clone down the git repo suggested as it has the Dockerfiles to build the various assets required. Check that your oc version is 3.6 or later, download and install the latest release otherwise.

Give yourself loads-of-memory! I’ve tested this example on a MacBook (OS/X v10.12.6, 3.1 GHz Intel Core i7, 16 GB RAM), with 2 CPU and 8 GB RAM allocated to Docker (using Preferences…).

$ cd ~/MLOps
$ git clone https://github.com/StefanoPicozzi/MLOps.git
$ oc version
$ cat start.sh
export HOME=/Users/stefanopicozzi
export PROFILE=MLOps
echo $HOME
echo $PROFILE
oc cluster up --metrics \
  --public-hostname='127.0.0.1' \
  --host-data-dir=$HOME/.oc/profiles/$PROFILE/data \
  --host-config-dir=$HOME/.oc/profiles/$PROFILE/config \
  --use-existing-config=true                                  

Repeat Attempt

To reset your environment to repeat this tutorial do the following. Create yourself a home location such as suggested.

$ mkdir ~/MLOps
$ cd ~/MLOps
$ rm -rf MLOps
$ git clone https://github.com/StefanoPicozzi/MLOps.git
$ oc login -u developer -p developer
$ oc delete project mlops

Instructions

Pause, take a breath, pay your respect to the Code Gods and then proceed as follows. We are going to:

  1. Create a new project as our MLOps workbench
  2. Launch an Oshinko Spark Cluster inside the project
  3. Create an RStudio Server instance inside the project
  4. Create a Jupyter r-notebook instance inside the project
  5. Verify success with some sample R scripts

What the heck is MLOps anyway? Check the Trivia section at the end of this blog.

1. Create OpenShift MLOps Project

First step is to create an OpenShift project to organise all these assets. And add some extra privileges while we are there, which we need later for RStudio.

$ cd ~/MLOps
$ oc login -u developer -p developer
$ oc new-project mlops --description="MLOps with OpenShift" --display-name="MLOps with OpenShift Workbench"
$ oc login -u system:admin 
$ oc project mlops 
$ oc adm policy add-scc-to-user anyuid -z default

2. Launch Apache Spark with OpenShift

To launch an Apache Spark cluster OpenShift, we are going to make use of the Oshinko Project as described at http://radanalytics.io/. Oshinko is a set of technologies that integrate Apache Spark with OpenShift.

2.1 Create Spark Cluster

Visit the OpenShift Console and check whether a Spark Menu item appears on the left side margin. If it does you are good to go, otherwise proceed to Trivia section of the blog for instructions then return here upon completion. Click the Spark Clusters menu item https://127.0.0.1:8443/console/project/oshinko/oshinko and choose Deploy. Give it a name, e.g. “spark”, and select at least 2 for Number of Workers, e.g. “2”.

screen-shot-2017-02-21-at-10-00-03-am

2.2 Verify Spark Cluster

Expose the ui service (if not already done so) and visit the Spark Master Console as published by the service, e.g. http://spark-ui-route-mlops.127.0.0.1.nip.io/.  Make a note of the Spark Master URL, e.g. spark://172.17.0.2:7077 .

oc expose svc spark-ui

screen-shot-2017-02-21-at-10-01-56-am

2.3 Create a Modified Spark Image

Now that we have a vanilla OpenShift Oshinko Apache Spark implementation, we need to make a minor enhancement. The openshift-spark image as described a https://hub.docker.com/r/radanalyticsio/openshift-spark/ needs to be extended to include an R installation, SparkR libraries and a few AWS S3 connectivity jars. This has been done for you in the supplied Dockerfile which you cloned earlier. You can inspect the contents at https://raw.githubusercontent.com/StefanoPicozzi/MLOps/master/spark/Dockerfile. Once built, push it up to your Docker hub location, amending your access credentials as appropriate. This step is optional but by doing so we have an image accessible whether we choose to work with a local or remote OpenShift instance for our MLOps workbench.

$ cd ~/MLOps
$ docker build -t stefanopicozzi/spark MLOps/spark
$ docker tag 172.30.1.1:5000/mlops/spark stefanopicozzi/spark
$ docker login -u stefanopicozzi -p XXXXX 
$ docker push stefanopicozzi/spark

2.4 Modify the Spark DeploymentConfig

Now let’s update the deployment configs for our Apache Spark containers to use this new image rather than the installation default (radanalyticsio/openshift-spark). We can do this from the command line using “oc patch”. The dc update will trigger a new deployment pointing to our revised spark image source.

$ oc login -u developer -p developer
$ oc project mlops
$ oc patch dc/spark-m --patch '{ "spec": { "template": { "spec": { "containers": [ { "name": "spark-m", "image": "stefanopicozzi/spark:latest" } ] } } } }'
$ oc patch dc/spark-w --patch '{ "spec": { "template": { "spec": { "containers": [ { "name": "spark-w", "image": "stefanopicozzi/spark:latest" } ] } } } }'

3. Install RStudio Server with Apache Spark

We are going to need an RStudio Server instance on our OpenShift cluster with a local Apache Spark installation. The Dockerfile for RStudio contains instructions to add a Spark distribution and S3 libraries to our instance which you can inspect at https://raw.githubusercontent.com/StefanoPicozzi/MLOps/master/rocker/Dockerfile.

3.1 Create the RStudio Server Container

$ cd ~/MLOps
$ cd MLOps/rocker
$ oc login -u developer -p developer 
$ oc project mlops
$ oc new-app . --strategy=docker  -l name='rstudio' --name='rstudio' --allow-missing-images 
$ oc cancel-build bc/rstudio-build --state=new
$ oc start-build rstudio --from-dir=.
$ oc cancel-build bc/rstudio-build --state=new
$ oc expose service rstudio

3.2 Create a /home/rstudio PVC

Create a permanent volume claim (PVC) attached to /home/rstudio. This will enable us to install additional R packages and for that system state to be preserved on an RStudio restart. Details for this are described within R using RStudio Server with OpenShift.

You can also perform these steps using the OpenShift Console, as you prefer. Steps would be 1) click the Storage Menu and create a permanent volume named “rstudiovolume” and then 2) visit the container instance and choose Add Storage for mount point “/home/rstudio”. A redeploy will be triggered.

Screen Shot 2017-08-31 at 8.22.00 am

Screen Shot 2017-08-31 at 8.22.43 am

3.3 Verify RStudio Spark Client

Now launch the RStudio Server client and login as rstudio/rstudio. If you experience some weirdness trying to login, use the OpenShift Console to launch a terminal into the RStudio container and check that /home/guest is owned by user=rstudio, group=rstudio. If not, do a “# chown -R rstudio /home/rstudio; # chgrp -R rstudio /home/rstudio”.

Create a new RScript File and copy/paste the contents of the sample R script at https://raw.githubusercontent.com/StefanoPicozzi/MLOps/master/samples/SmokeTest.R. Edit the script as appropriate to reflect your AWS S3 credentials and sample file name. If you do not plan to use AWS S3, then comment out the relevant instructions.  Execute a “Source with Echo” to run the script. The RStudio Console Window should show something like below if all is well.

head(df)
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

4. Install Jupyter Notebook with Apache Spark

Optionally, we are also going to install a Jupyter Notebook instance on our OpenShift cluster with a local Apache Spark installation. The Dockerfile for Jupyter contains instructions to add a Spark distribution and S3 libraries to our instance which you can inspect at https://raw.githubusercontent.com/StefanoPicozzi/MLOps/master/r-notebook/Dockerfile.

4.1 Create the Jupyter Container

$ cd ~/MLOps
$ cd MLOps/r-notebook
$ oc login -u developer -p developer 
$ oc project mlops
$ oc new-app . --strategy=docker -l name='r-notebook' --name='r-notebook' --allow-missing-images
$ oc cancel-build bc/rstudio-build --state=new
$ oc start-build r-notebook --from-dir=. 
$ oc cancel-build bc/rstudio-build --state=new
$ oc expose service r-notebook

4.2 Create a /home/jovyan/work PVC

Create a permanent volume claim (PVC) attached to /home/jovyan/work. This will enable us to install additional R packages and for that system state to be preserved on a notebook restart. You can also perform these steps using the Console as you prefer. Steps would be 1) click the Storage Menu and create a permanent volume named “r-notebookvolume” and then 2) visit the container instance and choose Add Storage for mount point “/home/jovyan/work”. A redeploy will be triggered.

4.3 Verify Jupyter Client

Now launch the Jupyter notebook and login using the token which you can copy/paste from the r-notebook container log. The log entry will look something like:

Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=8a2ec085b2611d08735743b3bd46717860f03d5fdf3666ff 

If you do not see the work subdirectory upon login, use the OpenShift Console to launch a terminal into the r-notebook container and check that /home/jovyan/work is owned by user=jovyan, group=users. If not, do a “# chown -R jovyan /home/jovyan/work; # chgrp -R users /home/jovyan/work”.

Create a new R File inside the work subdirectory and copy/paste the contents of the sample R script at https://raw.githubusercontent.com/StefanoPicozzi/MLOps/master/samples/SmokeTest.R into separate notebooks lines. Edit the script as appropriate to reflect your Spark home location and AWS S3 credentials and sample file name. If you do not plan to use AWS S3, then comment out the relevant instructions.  Step through the script to execute each line. Output will be similar to the RStudio example if successful.

r-notebook

5. Verify Lab Success

Let’s verify that an RStudio Server instance can launch a SparkR job on a remote Apache Spark Cluster within an OpenShift cluster instance. Visit the RStudio Server Console and try out these examples.

5.1 RStudio calling Remote Spark Example

Let’s verify that our RStudio Server instance can interact with a remote Spark cluster all inside OpenShift using RSpark. To do this repeat the experiment described in the RStudio Verify section (3.3), but this time replace your Spark master URL with your OpenShift Spark cluster instance. The IP address for the Spark cluster is available by inspecting the services entry for our spark instance using the OpenShift Console. If successful, output should be similar to 3.3. Check the Spark Master console again and you should see references to the SparkR job that you issued.

screen-shot-2017-02-21-at-10-53-10-am

5.2 Example based on https://rpubs.com/wendyu/sparkr

Inspect the R script as published at https://raw.githubusercontent.com/StefanoPicozzi/MLOps/master/samples/WendyYu.R then copy/paste as a new R script inside your RStudio instance. Change the master spark:// endpoint as appropriate. Now install the “magrittr” R package from the RStudio “Tools > Install Packages …” menu item. “Source-with-echo” and check the RStudio Console output against that reported on the rpubs website.

5.3 Example showing s3 access based on https://www.codementor.io/jadianes/spark-r-data-frame-operations-sql-du1080rl5

Inspect the R script as published at https://raw.githubusercontent.com/StefanoPicozzi/MLOps/master/samples/Housing.R then copy/paste as a new R script inside your RStudio instance. Change the master spark:// endpoint and S3 access credentials as appropriate.  Now install the “ggplot2” R package from the RStudio “Tools > Install Packages …” menu item. “Source-with-echo” and check the RStudio Console output against that reported at the codementor website.

rstudio

5.4 Example showing Wine grading based on http://blog.learningtree.com/machine-learning-using-spark-r/

Let’s do this using our r-notebook. As per below, find the instance name of your running r-notebook container and then import the Wine.ipynb Notebook using “oc rsync”. Now visit your Notebook container and edit the spark:// endpoint to match your Spark cluster instance. Open the Wine file and step through the R instruction set.  Check your results against the source website.

$ cd ~/MLOps
$ oc -u login -p developer -u developer
$ oc project mlops
$ oc get pods | grep r-notebook | grep Running
$ oc rsync ./samples/ $PODID:/home/jovyan/work

Trivia

http://radanalytics.io

The main landing page for OpenShift and Spark related development is http://radanalytics.io/. While you are there, have some fun with http://radanalytics.io/applications/ophicleide!

MLOps

We want to take the goodness of DevOps and apply it to the machine learning (ML), artificial intelligence use case – hence MLOps. The objective is to make it easy and safe and fast to operationalise ML/AI assets to production. Standardising and automating using the OpenShift platform is an enabler because we can now represent our ML/AI assets in a form that is portable and promotable across environments and shareable between the constituents in the software delivery lifecycle. There will be a lot more to say about this, but some fundamentals are described in the free ebook, DevOps with OpenShift.

tensorflow

Rather knock about with tensorflow? Look here.

Install Spark Support

If the Spark Menu item is not present, you need to install the feature.  These instructions will add a Spark Cluster item to the left-margin menu OpenShift Console upon OpenShift cluster restart. The location of your master-config.yaml file may vary depending on where your profile directory is located.  This is yaml so be careful with your spacing! Once completed restart OpenShift and continue.

$ cd ~/.oc/profiles/MLOps/config/master
$ git clone https://github.com/radanalyticsio/oshinko-console.git
$ vi master-config.yaml
... 
  extensionScripts: 
  - oshinko-console/dist/scripts/templates.js 
  - oshinko-console/dist/scripts/scripts.js 
  extensionStylesheets: 
  - oshinko-console/dist/styles/oshinko.css 
...

 

Advertisements

3 thoughts on “SparkR with OpenShift

  1. Pingback: R using RStudio Server with OpenShift | emergile

  2. Pingback: Jupyter and R with OpenShift | emergile

  3. Pingback: RStudio Server as a Docker Container | emergile

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s