Getting Started with SparkSQL on the Hortonworks Sandbox I: Installing Zeppelin

SparkSQL is a module for Spark which allows you to run SQL queries over the Spark engine. It works alongside Hive, in the sense that it reuses the Hive frontend and metastore to give you complete compatibility with any of your Hive UDFs, data, or queries. The advantage of SparkSQL over classic Hive (which used MapReduce to run queries) is speed and integration with Spark.

Before diving into SparkSQL, it’s worth noting that Hive on Apache Tez, of which integration was brought about by Hortonworks’ Stinger Initiative, increased Hive’s performance on interactive queries. This slide deck by Hortonworks even suggests that Hive on Tez outperforms SparkSQL for short running queries, ETL, large joins and aggregates, and resource utilization. However, SparkSQL still makes for a great way to explore your data on your cluster. With APIs in Java, Python, and Scala, as well as a web based interpreter called Zeppelin, it offers a multitude of options for running queries and even visualizing data.

To start you off easy, we’ll begin this guide by installing Zeppelin so that we can use the web based interpreter. If you’re familiar with R, think of Zeppelin as a bit like RStudio. It allows for data ingestion, discovery, analytics, and visualization.

Install Zeppelin

Download Zeppelin

Let’s start by downloading the Zeppelin files. To do this, we’ll clone it from its Github repository:

git clone

Install Maven

In order to build Zeppelin, you’ll need Maven on your sandbox. To install it, first download it:


Extract it:

tar xvf apache-maven-3.3.3-bin.tar.gz

Move it:

mv apache-maven-3.3.3 /usr/local/apache-maven

Add the environment variables:

export M2_HOME=/usr/local/apache-maven
export M2=$M2_HOME/bin
export PATH=$M2:$PATH

Run this command:

source ~/.bashrc

Then verify that Maven works by running this command:

mvn -version

Build Zeppelin

With Maven installed, you’re now ready to build Zeppelin:

mvn clean install -DskipTests -Pspark-1.3 -Dspark.version=1.3.1 -Phadoop-2.6 -Pyarn

Prepare for it to take about 15-20 minutes to build. Let it run.

When the build finishes you should get a screen that looks like this:

Zeppelin Maven Build

Next, create the file by copying the

cp conf/ conf/

The above code assumes you are in the directory that you downloaded zeppelin to.

Configure Zeppelin

Next, edit the file:

vi conf/

Hit i to enter edit/insert mode. Add these lines at the end of the file:

export HADOOP_CONF_DIR=/etc/hadoop/conf
export ZEPPELIN_PORT=10008
export ZEPPELIN_JAVA_OPTS="-Dhdp.version="

Note: the Dhdp.version should be the version of Hadoop that you are running. If you are running the 2.3 version of the sandbox your version will be the same as mine.To figure out your version in other cases, run this code:

hadoop version

You should get something that looks like this:

Hadoop version

The version you’ll want to type in comes after the first three numbers (2.7.1).

Once you know your version, go back and edit the file as instructed. Hit escape, then ‘:wq’ to save and quit.

Next, copy the hive-site.xml file to the conf folder:

cp /etc/hive/conf/hive-site.xml conf

The above code assumes you are in the directory you downloaded Zeppelin to.

Switch to the hdfs user next:

su hdfs

Then create a directory in HDFS for zeppelin:

hdfs dfs -mkdir /user/zeppelin
hdfs dfs -chown zeppelin:hdfs /user/zeppelin

You’re almost ready to start Zeppelin. But first you need to change your port forwarding settings on your sandbox.

Add Port Forwarding

Power off your sandbox, then navigate to the Machine > Settings while your Hortonworks sandbox is selected in the VirtualBox Manager. Click on Network once in the settings. You should be in NAT mode. Click on Advanced > Port Forwarding.

Port Forwarding

Next, add a rule by clicking the green plus side. Call the rule zeppelin, and give it the Host Name, the Host Port 10008, and the Guest Port 10008.

Zeppelin Port

Click OK twice, then start your sandbox back up.

Start Zeppelin

Ready to start Zeppelin? Navigate to wherever you downloaded Zeppelin (the incubator-zeppelin folder), then type this code into your command line:

bin/ start

Congratulations! You should now have Zeppelin up and running on port 10008, i.e.

If you want to stop it, run this code:

bin/zeppelin-daemon stop

With Zeppelin up and running, it’s time to start exploring SparkSQL. Check out Part II of this guide for an introduction to SparkSQL.

Setting up IPython Notebook with Apache Spark on the Hortonworks Sandbox

Apache Spark, if you haven’t heard of it, is a fast, in-memory data processing engine that can be used with data on Hadoop. It offers excellent performance and can handle tasks such as batch processing, streaming, interactive queries and machine learning.

Spark offers APIs in Java, Scala, and Python. Today, I’ll be covering how to get an IPython notebook set up on your Hortonworks sandbox so you can use it to run ad-hoc queries, data exploration, analysis, and visualization over your data.

Let’s get started


First, make sure your sandbox’s network adapter is in NAT mode.

While your sandbox is powered off, navigate to the settings of your Hortonworks sandbox by clicking Machine > Settings while ‘Hortonworks Sandbox with HDP 2.3_1′ is highlighted.

Navigate to Network once in the settings. Ensure that your NAT network adapter is turned on, and that it is the only network adapter turned on:

Enable NAT Adapter

The default settings should be fine.

With the proper network adapter enabled, start up your machine.

Install IPython

Next, use yum to install a number of necessary dependencies:

yum install nano centos-release-SCL zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libpng-devel libjpg-devel atlas-devel

Make sure you include all of them. When you’re prompted with a y/N question, be sure to type y.

Next, install the development tools dependency for Python 2.7:

yum groupinstall "Development tools"

Then, install Python 2.7:

yum install python27

Since there are now multiple Python versions on your Hortonworks sandbox, you need to switch to Python 2.7:

source /opt/rh/python27/enable

Don’t worry, the Python version will switch back to its default the next time you power off your sandbox.

Next, download, which will let you install ez_install. You’ll then use ez_install to install pip. It’s a bit of jumping through hoops, but these tools will get you what you need. Start by navigating to a test directory, or any directory you feel comfortable downloading files to. I chose to create a directory called /test_dev/:

mkdir /test_dev/
cd /test_dev/

Next, download




Now install pip:

easy_install-2.7 pip

With pip installed, you can now install some of the packages you’ll want to use with IPython. You want to install the following packages:

numpy scipy pandas scikit-learn tornado pyzmq pygments matplotlib jinja2 jsonschema

You can install them by running this code, followed by the packages you want to install:

pip install

For example:

pip install scipy pandas scikit-learn

I recommend installing only a few at a time, as you may run into issues if you try to install them all at once. The first couple of installations take a good amount of time to install.

Install IPython Notebook next:

pip install "ipython[notebook]"

Next, create an IPython profile for Spark:

ipython profile create pyspark

Before we continue any further, you should set up port forwarding for port 8889 on your sandbox.

Setting up port forwarding

Power off your sandbox. Go into the settings of your Hortonworks sandbox, then navigate to Network:

Network settings

Click Port Forwarding. Then click the plus sign to add a new rule. Call the rule ipython, give it the host IP, host port 8889, and guest port 8889. Leave the guest IP blank. It should look like this when you are done:

Port Forwarding in VirtualBox

Click OK, then OK again.

Start your sandbox back up.

Create a shell script to launch IPython Notebook

To make things easier for you to launch IPython Notebook, let’s create a shell script. Navigate to the folder you’d like to launch IPython Notebook from. For me that was /dev_test/:

cd /dev_test/

Launch nano to create the shell script:


Enter this code in nano:

source /opt/rh/python27/enable
IPYTHON_OPTS="notebook --port 8889 --notebook-dir=u'/usr/hdp/' --ip='*' --no-browser" pyspark

Type control + o to write/save the file. Then control + x to quit.

Run it using this code:


It should look like this:

IPython Notebook running on Hortonworks sandbox

Congratulations! You’re now running IPython Notebook on the Hortonworks sandbox in conjunction with Spark.

Using your new IPython Notebook

To start exploring with your new notebook, go to this web address:

That’s it for this guide. In the next guide I’ll cover some basics of IPython Notebook and how you can get started using it with Spark.


Credit goes out to Hortonworks for writing their own guide, which I used as a basis of knowledge for this post. Since their guide was outdated at the time of writing, this post has updates and modifications which ensure a seamless installation of IPython Notebook with Apache Spark on Hortonworks Sandbox 2.3 and 2.3.1.

Connecting to an Accumulo Instance Remotely via Java Client Code

In my last guide, I showed you how to properly set up Accumulo on a Hortonworks sandbox. This time, I’ll be showing you how to remotely connect to that Accumulo instance via a Java client.

Setting up

If you haven’t already set up Eclipse for big data development, go look at this post. It will cover how to set up Maven, as well as how to connect to Github (if you wish).

Before we get started, create a new Java project in Eclipse, then convert it to a Maven project. Recall that to convert to a Maven project you can right click on it, then click Configure > Convert to Maven Project.

Edit the pom.xml file

In order to run the code you’ll be using, you need to add the proper jars to your pom.xml file. Double click your pom.xml file, then click Dependencies.

Click ‘Add…’ then fill out the following information:

Accumulo-core dependency

You can leave the scope as Compile.

Click Done.

Add your sandbox IP to your hosts file

To make it easier to connect to your sandbox, we’ll add the IP address to your hosts file. To do this on a Mac, open up your terminal and type:

sudo nano /etc/hosts

Add a line that looks similar to this (may vary based on your sandbox IP address):

Hosts file

Save and exit.

Edit the Authorizations of the root user

Since the code in this guide covers an Accumulo concept known as Authorizations, you’ll need to give your root user the proper authorizations in order to use it.

To give your root user an authorization to modify Accumulo rows that have a “public” authorization, first power on your sandbox and start up Accumulo:


Log on to the accumulo shell:

accumulo shell

Enter your password (hadoop).

Then run this code to to tell Accumulo to give the root user an authorization for any rows with the ‘public’ authorization:

setauths -s public -u root

Finally, quit out of the accumulo shell:


Create a file

This step is optional, but if you choose to ignore it you’ll have to edit out a few lines of code in the Java program you create. A file will allow you to use a Logger with your Java program, which is useful for debugging.

See Customizing Log/Print Statements in the HBase guide for information on how to create the file.

Create the Java class

Now that everything is set up, create the class you’ll be using to connect to Accumulo. I called my class AccumuloConnection.class, but call it whatever you like.

Next, add this code to your class.

Examine the comments of the code to understand how it works. Some notes about the code:

  • It uses a Logger to keep track of system messages as well as any message the developer wants to print
  • It sets up the Logger configuration using the configuration file we created earlier
  • It connects to Accumulo and Zookeeper, then tries to create a table if the table doesn’t already exist
  • Next, it writes a mutation (row) to the server using a BatchWriter
    • Multiple mutations could be written using the BatchWriter, but for this example we just write one
  • Next, a scanner is created
    • Authorizations for the scanner are specified
    • A range to scan over is given
    • A column family is specified to further narrow down the search
  • Finally, the entries the scanner returns are iterated through and printed to console

Hope you enjoyed learning about Accumulo development using Java. If you have any questions feel free to reach out in the comments or an email.

How to Start Accumulo on the Hortonworks Sandbox

The Hortonworks sandbox is a great virtual environment for learning about technologies in the Hadoop ecosystem. It comes bundled with the ability to start and stop services such as MapReduce, Hive, HBase, Kafka, Spark and more in just a few clicks.

Unfortunately, installing and starting Accumulo on the Hortonworks sandbox is a little trickier than that. Luckily for you, all you have to do is follow the steps in this guide and you’ll be up and running Accumulo in no time.

Setting up

Before we get started, make sure your Hortonworks sandbox has the proper network adapter settings.

While your sandbox is powered off, navigate to the settings of your Hortonworks sandbox by clicking Machine > Settings while ‘Hortonworks Sandbox with HDP 2.3_1′ is highlighted. Navigate to Network once in the settings and ensure that your NAT network adapter is turned on, and that it is the only network adapter turned on:

Enable NAT Adapter

The default settings should be fine.

Once you’re enabled the correct network adapter, start up your virtual machine.

Installing Accumulo

Since Accumulo doesn’t come bundled with the Hortonworks sandbox you’ll have to install it. Run this code:

yum install accumulo

It will install Accumulo under a directory similar to /usr/hdp/

If the install code fails, run this command, then try again:

sudo sed -i "s/mirrorlist=https/mirrorlist=http/" /etc/yum.repos.d/epel.repo

Switch to Host-only adapter

Before we go any further, switch to your host-only adapter on your Hortonworks virtual machine. This will allow you to set things up properly for if you later decide you want to remotely connect to your Accumulo instance.

First, power down your VM.

Navigate to your Hortonworks virtual machine settings: click the Hortonworks sandbox on the virtualbox manager then clicking Machine > Settings. Navigate to Network and ensure the NAT network adapter is unchecked.

Ensure NAT is turned off

Next, turn on your Host-only network adapter:

Host-only network adapter

The default settings should be fine. If you were able to follow these steps, move on to Copy a Configuration Example.

If you don’t see an option for a Host-only network adapter after Name, navigate to VirtualBox > Preferences > Network > Host-only Networks:

Adding a new host-only network

Click the plus sign to add a new Host-only network. Then go back and ensure that your Hortonworks virtual machine settings (Machine > Settings > Network) are correct.  I.e. ensure NAT is unchecked, and create a new network adapter for which you select Host-only and vboxnet0. Refer to the screenshots/steps above if you’re still lost, or reach out to me to ask.

Copy a configuration example

Next, you’ll want to copy the files from one of the configuration examples provided by Accumulo to Accumulo’s config directory:

cp /usr/hdp/* /usr/hdp/

You can choose different sizes ranging from 512MB to 3GB based on your available memory. I choose 1 GB since my sandbox system is on the smaller side.


Next, open in a text editor. I’ll use vi:

vi /usr/hdp/

Press to enter insert/edit mode in vi.

Edit the line about your JAVA_HOME to be:

test -z "$JAVA_HOME" && export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64

The above code is most likely sandwiched between and if … else statement. Don’t modify those if..else statements, just modify the line that looks like the line I gave you. The important thing that’s changed in the line I gave you is the path. Everything else should be as-is.

Next edit ZOOKEEPER_HOME to be:

test -z "$ZOOKEEPER_HOME" && export ZOOKEEPER_HOME=/usr/hdp/

Then, find the line with HADOOP_PREFIX and change it to:

test -z "$HADOOP_PREFIX" && export HADOOP_PREFIX=/usr/hdp/

Finally, uncomment this line:


Press the escape key, then type ‘:wq’ to save and exit vi.

Edit accumulo-site.xml

Open up accumulo-site.xml in vi:

vi /usr/hdp/

Press i to enter edit/insert mode.

Change instance.secret property’s value to be hadoop:

  A secret unique to a given instance that all servers must know in order to communicate with one another. Change it before initialization. To change it later use ./bin/accumulo org.apache.accumulo.server.util.ChangeSecret --old [oldpasswd] --new [newpasswd], and then update this file.

Then scroll down and change to hadoop:

Press escape then ‘:wq’ to save and exit

Edit gc, masters, monitor, slaves, and tracers files

To ensure you are able to use your Accumulo instance with a client program (Java) you need to replace ‘localhost’ in all of the following files with your sandbox’s IP address:

  • gc
  • masters
  • monitor
  • slaves
  • tracers

The files are located in your accumulo conf folder:

cd /usr/hdp/

Recall that you can figure out what your ip address is by typing this in your terminal:


Edit Accumulo User Properties

Now you need to change the accumulo user properties. Edit your password file:

vi /etc/passwd

Press to edit. Scroll all the way to the bottom and edit the accumulo line to read:


Don’t worry if the third entry isn’t 496. The important thing is to change the 4th entry to 501 and the 6th entry to /home/accumulo. Press escape then ‘:wq’ to save and exit.

Create a home directory for accumulo

The next step we’ll be doing is creating a home directory for the accumulo data to reside in on your local filesystem and hadoop filesystem. Think of this like a development directory. Create it using this code:

mkdir -p /home/accumulo/data


hadoop fs -mkdir -p /home/accumulo/data

Change the permissions:

chown -R accumulo:hadoop /home/accumulo/data
sudo -u hdfs hadoop fs -chown -R accumulo:hadoop /home/accumulo/data

Initialize Accumulo

Now that you have everything set up, it’s time to initialize Accumulo. Run the following lines of code in your Hortonworks sandbox:

su - accumulo
. /usr/hdp/
accumulo init

Once you run accumulo init a few messages will come up on your screen, followed by a message asking you to give accumulo an instance name. I kept it simple and chose accumulo-instance as mine, but choose whatever you like.

Next, enter the password from earlier: hadoop

Change file permissions of Accumulo folder on HDFS

In order for Accumulo to actually be able to run, we need to change the file permissions of the Accumulo folder. To do that, exit the accumulo user by typing ‘su’ into your hortonworks sandbox terminal.

You will be prompted for a password. Type hadoop.

Now run this code:

sudo -u hdfs hadoop fs -chmod 777 /accumulo

Run Accumulo

If you followed all the above steps, you should now be ready to run Accumulo. Enter this code into your terminal to run the start-all shell script:


You’re done! Accumulo should now be successfully running on your VM. To check, go to this web address:

If you notice that your instance name is null the first time you load that page, simply reload the page. It should then display properly.

If you just wanted to get Accumulo up and running, congratulations! You’ve successfully completed this guide.

As a bonus, here’s how to stop Accumulo.

Stopping Accumulo

Use this code to stop Accumulo:


Starting Accumulo Back Up

Want to start Accumulo back up?


Hope you enjoyed reading, and, as always, feel free to reach out with questions. In my next guide I’ll show you how to connect to Accumulo remotely.

The simple, quick start guide to using SQuirreL

SQuirreL, if you haven’t heard of it yet, is a graphical java program (GUI) which lets you see the structure of your Phoenix database, browse the data in your tables, and run SQL queries over the data.

In the last big data guide, I helped you set up SQuirreL with Phoenix. If you’re reading this, I’ll assume you’re using Hortonworks sandbox and already have SQuirreL connected to Phoenix. In this guide, we’re going to delve into two SQuirreL basics that will quickly let you start viewing and transforming your data: viewing table information and running queries.

Viewing basic table information

To see information about your table such as the row count, columns, primary key, and indexes, you’ll want to be in the Objects view:


Select Table > YourTableName and you’ll be able to access some stats about your table.

Pretty simple right?

Let’s get to the good stuff: running SQL queries.

Running SQL Queries

SQuirreL allows you to run single SQL queries or batches of SQL queries. To do, select the SQL tab, enter the queries you’d like to run, and click the icon that looks like a running person:


SQuirreL will run your queries and send output to the screen, as well as include the time it took to run. There’s also options to store the result of your SQL queries in a table or file:Store results in table or file

Knowing how to do just these two things will allow you to run most of the SQL queries you might want. If you’re going for something more complex, you might find some of SQuirreL’s other features useful. Similarly, you might want to read up more about Phoenix and it’s features to understand how to optimize your queries and schemas.