Getting Started with SparkSQL on the Hortonworks Sandbox I: Installing Zeppelin

SparkSQL is a module for Spark which allows you to run SQL queries over the Spark engine. It works alongside Hive, in the sense that it reuses the Hive frontend and metastore to give you complete compatibility with any of your Hive UDFs, data, or queries. The advantage of SparkSQL over classic Hive (which used MapReduce to run queries) is speed and integration with Spark.

Before diving into SparkSQL, it’s worth noting that Hive on Apache Tez, of which integration was brought about by Hortonworks’ Stinger Initiative, increased Hive’s performance on interactive queries. This slide deck by Hortonworks even suggests that Hive on Tez outperforms SparkSQL for short running queries, ETL, large joins and aggregates, and resource utilization. However, SparkSQL still makes for a great way to explore your data on your cluster. With APIs in Java, Python, and Scala, as well as a web based interpreter called Zeppelin, it offers a multitude of options for running queries and even visualizing data.

To start you off easy, we’ll begin this guide by installing Zeppelin so that we can use the web based interpreter. If you’re familiar with R, think of Zeppelin as a bit like RStudio. It allows for data ingestion, discovery, analytics, and visualization.

Install Zeppelin

Download Zeppelin

Let’s start by downloading the Zeppelin files. To do this, we’ll clone it from its Github repository:

git clone https://github.com/apache/incubator-zeppelin.git

Install Maven

In order to build Zeppelin, you’ll need Maven on your sandbox. To install it, first download it:

wget http://apache.cs.utah.edu/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz

Extract it:

tar xvf apache-maven-3.3.3-bin.tar.gz

Move it:

mv apache-maven-3.3.3 /usr/local/apache-maven

Add the environment variables:

export M2_HOME=/usr/local/apache-maven
export M2=$M2_HOME/bin
export PATH=$M2:$PATH

Run this command:

source ~/.bashrc

Then verify that Maven works by running this command:

mvn -version

Build Zeppelin

With Maven installed, you’re now ready to build Zeppelin:

mvn clean install -DskipTests -Pspark-1.3 -Dspark.version=1.3.1 -Phadoop-2.6 -Pyarn

Prepare for it to take about 15-20 minutes to build. Let it run.

When the build finishes you should get a screen that looks like this:

Zeppelin Maven Build

Next, create the zeppelin-env.sh file by copying the zeppelin-env.sh.template:

cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh

The above code assumes you are in the directory that you downloaded zeppelin to.

Configure Zeppelin

Next, edit the zeppelin-env.sh file:

vi conf/zeppelin-env.sh

Hit i to enter edit/insert mode. Add these lines at the end of the file:

export HADOOP_CONF_DIR=/etc/hadoop/conf
export ZEPPELIN_PORT=10008
export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.0.0-2557"

Note: the Dhdp.version should be the version of Hadoop that you are running. If you are running the 2.3 version of the sandbox your version will be the same as mine.To figure out your version in other cases, run this code:

hadoop version

You should get something that looks like this:

Hadoop version

The version you’ll want to type in comes after the first three numbers (2.7.1).

Once you know your version, go back and edit the zeppelin-env.sh file as instructed. Hit escape, then ‘:wq’ to save and quit.

Next, copy the hive-site.xml file to the conf folder:

cp /etc/hive/conf/hive-site.xml conf

The above code assumes you are in the directory you downloaded Zeppelin to.

Switch to the hdfs user next:

su hdfs

Then create a directory in HDFS for zeppelin:

hdfs dfs -mkdir /user/zeppelin
hdfs dfs -chown zeppelin:hdfs /user/zeppelin

You’re almost ready to start Zeppelin. But first you need to change your port forwarding settings on your sandbox.

Add Port Forwarding

Power off your sandbox, then navigate to the Machine > Settings while your Hortonworks sandbox is selected in the VirtualBox Manager. Click on Network once in the settings. You should be in NAT mode. Click on Advanced > Port Forwarding.

Port Forwarding

Next, add a rule by clicking the green plus side. Call the rule zeppelin, and give it the Host Name 127.0.0.1, the Host Port 10008, and the Guest Port 10008.

Zeppelin Port

Click OK twice, then start your sandbox back up.

Start Zeppelin

Ready to start Zeppelin? Navigate to wherever you downloaded Zeppelin (the incubator-zeppelin folder), then type this code into your command line:

bin/zeppelin-daemon.sh start

Congratulations! You should now have Zeppelin up and running on port 10008, i.e. 127.0.0.1:10008.

If you want to stop it, run this code:

bin/zeppelin-daemon stop

With Zeppelin up and running, it’s time to start exploring SparkSQL. Check out Part II of this guide for an introduction to SparkSQL.

Leave a Comment.