Setting up IPython Notebook with Apache Spark on the Hortonworks Sandbox

Apache Spark, if you haven’t heard of it, is a fast, in-memory data processing engine that can be used with data on Hadoop. It offers excellent performance and can handle tasks such as batch processing, streaming, interactive queries and machine learning.

Spark offers APIs in Java, Scala, and Python. Today, I’ll be covering how to get an IPython notebook set up on your Hortonworks sandbox so you can use it to run ad-hoc queries, data exploration, analysis, and visualization over your data.

Let’s get started

Setup

First, make sure your sandbox’s network adapter is in NAT mode.

While your sandbox is powered off, navigate to the settings of your Hortonworks sandbox by clicking Machine > Settings while ‘Hortonworks Sandbox with HDP 2.3_1′ is highlighted.

Navigate to Network once in the settings. Ensure that your NAT network adapter is turned on, and that it is the only network adapter turned on:

Enable NAT Adapter

The default settings should be fine.

With the proper network adapter enabled, start up your machine.

Install IPython

Next, use yum to install a number of necessary dependencies:

yum install nano centos-release-SCL zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libpng-devel libjpg-devel atlas-devel

Make sure you include all of them. When you’re prompted with a y/N question, be sure to type y.

Next, install the development tools dependency for Python 2.7:

yum groupinstall "Development tools"

Then, install Python 2.7:

yum install python27

Since there are now multiple Python versions on your Hortonworks sandbox, you need to switch to Python 2.7:

source /opt/rh/python27/enable

Don’t worry, the Python version will switch back to its default the next time you power off your sandbox.

Next, download ez_setup.py, which will let you install ez_install. You’ll then use ez_install to install pip. It’s a bit of jumping through hoops, but these tools will get you what you need. Start by navigating to a test directory, or any directory you feel comfortable downloading files to. I chose to create a directory called /test_dev/:

mkdir /test_dev/
cd /test_dev/

Next, download ez_setup.py:

wget http://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py

Run ez_setup.py:

python ez_setup.py

Now install pip:

easy_install-2.7 pip

With pip installed, you can now install some of the packages you’ll want to use with IPython. You want to install the following packages:

numpy scipy pandas scikit-learn tornado pyzmq pygments matplotlib jinja2 jsonschema

You can install them by running this code, followed by the packages you want to install:

pip install

For example:

pip install scipy pandas scikit-learn

I recommend installing only a few at a time, as you may run into issues if you try to install them all at once. The first couple of installations take a good amount of time to install.

Install IPython Notebook next:

pip install "ipython[notebook]"

Next, create an IPython profile for Spark:

ipython profile create pyspark

Before we continue any further, you should set up port forwarding for port 8889 on your sandbox.

Setting up port forwarding

Power off your sandbox. Go into the settings of your Hortonworks sandbox, then navigate to Network:

Network settings

Click Port Forwarding. Then click the plus sign to add a new rule. Call the rule ipython, give it the host IP 127.0.0.1, host port 8889, and guest port 8889. Leave the guest IP blank. It should look like this when you are done:

Port Forwarding in VirtualBox

Click OK, then OK again.

Start your sandbox back up.

Create a shell script to launch IPython Notebook

To make things easier for you to launch IPython Notebook, let’s create a shell script. Navigate to the folder you’d like to launch IPython Notebook from. For me that was /dev_test/:

cd /dev_test/

Launch nano to create the shell script:

nano start_ipython_notebook.sh

Enter this code in nano:

#!/bin/bash
source /opt/rh/python27/enable
IPYTHON_OPTS="notebook --port 8889 --notebook-dir=u'/usr/hdp/2.3.0.0-2557/spark/' --ip='*' --no-browser" pyspark

Type control + o to write/save the file. Then control + x to quit.

Run it using this code:

sh start_ipython_notebook.sh

It should look like this:

IPython Notebook running on Hortonworks sandbox

Congratulations! You’re now running IPython Notebook on the Hortonworks sandbox in conjunction with Spark.

Using your new IPython Notebook

To start exploring with your new notebook, go to this web address: 127.0.0.1:8889

That’s it for this guide. In the next guide I’ll cover some basics of IPython Notebook and how you can get started using it with Spark.

Credits

Credit goes out to Hortonworks for writing their own guide, which I used as a basis of knowledge for this post. Since their guide was outdated at the time of writing, this post has updates and modifications which ensure a seamless installation of IPython Notebook with Apache Spark on Hortonworks Sandbox 2.3 and 2.3.1.

1 Comments

Leave a Reply to Behemoth Cancel reply