SparkSQL is a module for Spark which allows you to run SQL queries over the Spark engine. It works alongside Hive, in the sense that it reuses the Hive frontend and metastore to give you complete compatibility with any of your Hive UDFs, data, or queries. The advantage of SparkSQL over classic Hive (which used MapReduce to run queries) is speed and integration with Spark.
Before diving into SparkSQL, it’s worth noting that Hive on Apache Tez, of which integration was brought about by Hortonworks’ Stinger Initiative, increased Hive’s performance on interactive queries. This slide deck by Hortonworks even suggests that Hive on Tez outperforms SparkSQL for short running queries, ETL, large joins and aggregates, and resource utilization. However, SparkSQL still makes for a great way to explore your data on your cluster. With APIs in Java, Python, and Scala, as well as a web based interpreter called Zeppelin, it offers a multitude of options for running queries and even visualizing data.
To start you off easy, we’ll begin this guide by installing Zeppelin so that we can use the web based interpreter. If you’re familiar with R, think of Zeppelin as a bit like RStudio. It allows for data ingestion, discovery, analytics, and visualization.
Install Zeppelin
Download Zeppelin
Let’s start by downloading the Zeppelin files. To do this, we’ll clone it from its Github repository:
git clone https://github.com/apache/incubator-zeppelin.git
Install Maven
In order to build Zeppelin, you’ll need Maven on your sandbox. To install it, first download it:
wget http://apache.cs.utah.edu/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
Extract it:
tar xvf apache-maven-3.3.3-bin.tar.gz
Move it:
mv apache-maven-3.3.3 /usr/local/apache-maven
Add the environment variables:
export M2_HOME=/usr/local/apache-maven export M2=$M2_HOME/bin export PATH=$M2:$PATH
Run this command:
source ~/.bashrc
Then verify that Maven works by running this command:
mvn -version
Build Zeppelin
With Maven installed, you’re now ready to build Zeppelin:
mvn clean install -DskipTests -Pspark-1.3 -Dspark.version=1.3.1 -Phadoop-2.6 -Pyarn
Prepare for it to take about 15-20 minutes to build. Let it run.
When the build finishes you should get a screen that looks like this:
Next, create the zeppelin-env.sh file by copying the zeppelin-env.sh.template:
cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh
The above code assumes you are in the directory that you downloaded zeppelin to.
Configure Zeppelin
Next, edit the zeppelin-env.sh file:
vi conf/zeppelin-env.sh
Hit i to enter edit/insert mode. Add these lines at the end of the file:
export HADOOP_CONF_DIR=/etc/hadoop/conf export ZEPPELIN_PORT=10008 export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.0.0-2557"
Note: the Dhdp.version should be the version of Hadoop that you are running. If you are running the 2.3 version of the sandbox your version will be the same as mine.To figure out your version in other cases, run this code:
hadoop version
You should get something that looks like this:
The version you’ll want to type in comes after the first three numbers (2.7.1).
Once you know your version, go back and edit the zeppelin-env.sh file as instructed. Hit escape, then ‘:wq’ to save and quit.
Next, copy the hive-site.xml file to the conf folder:
cp /etc/hive/conf/hive-site.xml conf
The above code assumes you are in the directory you downloaded Zeppelin to.
Switch to the hdfs user next:
su hdfs
Then create a directory in HDFS for zeppelin:
hdfs dfs -mkdir /user/zeppelin hdfs dfs -chown zeppelin:hdfs /user/zeppelin
You’re almost ready to start Zeppelin. But first you need to change your port forwarding settings on your sandbox.
Add Port Forwarding
Power off your sandbox, then navigate to the Machine > Settings while your Hortonworks sandbox is selected in the VirtualBox Manager. Click on Network once in the settings. You should be in NAT mode. Click on Advanced > Port Forwarding.
Next, add a rule by clicking the green plus side. Call the rule zeppelin, and give it the Host Name 127.0.0.1, the Host Port 10008, and the Guest Port 10008.
Click OK twice, then start your sandbox back up.
Start Zeppelin
Ready to start Zeppelin? Navigate to wherever you downloaded Zeppelin (the incubator-zeppelin folder), then type this code into your command line:
bin/zeppelin-daemon.sh start
Congratulations! You should now have Zeppelin up and running on port 10008, i.e. 127.0.0.1:10008.
If you want to stop it, run this code:
bin/zeppelin-daemon stop
With Zeppelin up and running, it’s time to start exploring SparkSQL. Check out Part II of this guide for an introduction to SparkSQL.