How to Integrate Apache Phoenix with HBase

If you’re looking to get started with using Apache Phoenix, the open source SQL skin for HBase, the first thing you’ll want to do is install it. This guide will show you how to do that on the Hortonworks virtual sandbox.

If you’re running your setup on a machine that isn’t the Hortonworks sandbox, the installation guide over on the Phoenix website should help. Hortonworks also has an installation guide for both unsecure and secure hadoop clusters. In this guide we’ll be setting up Phoenix on an unsecure cluster (sandbox).

What is Apache Phoenix?

Before we start, let’s talk briefly about what Phoenix is and what it can do for you. As previously mentioned, Phoenix is an open source SQL skin for HBase. This means that it takes your SQL queries and transforms them into a series of HBase scans. The transformations are all done under the hood. For the most part, you can run SQL queries over HBase as if you were merely using a relational database like MySQL or SQLite.

What are some use cases for Phoenix?

Phoenix can be used for a few different use cases:

  • At SiftScience, they use Phoenix for ad-hoc queries and exposing data insights.
  • At Alibaba, they use Phoenix for queries where there is a large dataset with a relatively small result (10,000 records or so), or for complicated queries over large dataset with a large result (millions of records).
  • At Ebay, they use Phoenix for Path or Flow analysis, as well as for real time data trends.

To see more use cases, go here.

Where to learn more

If you’re inclined to learn more about Phoenix before we get started, check out the FAQ, learn about which SQL statements are supported (a lot), or simply check out the project home page.

Installing Phoenix

Ready to get started? We’re going to be using the open-source, package management utility yum (Yellowdog Updater, Modified) to install Phoenix. To start the installation run:

yum install phoenix

Possible installation errors (and their fixes)

There’s a good chance that code will fail if you haven’t used yum before.

If it fails with the error message Couldn’t resolve host mirrorlist.centos.org, the issue most likely stems from your network adapter. Check your network adapter settings: Machine > Settings > Network. You should have a network adapter enabled that is attached to NAT. Make sure no other network adapters are enabled. If you don’t have a NAT adapter enabled, power off your machine. Once it’s powered off you can return to the same Machine > Settings > Network menu to add or enable a NAT adapter. The default settings should be fine:

Enable NAT Adapter

If you receive the message Error: Cannot retrieve metalink for respository : epel, you will have to run this code in your VM:

sudo sed -i "s/mirrorlist=https/mirrorlist=http/" /etc/yum.repos.d/epel.repo

It will update yum’s repository to use http instead of https.

Installing Phoenix with Yum

With the above fixes in place, you should be ready to install Phoenix with yum. Run this code:

yum install phoenix

Once the installation finishes, find your Phoenix core jar file. For me it was located at /usr/hdp/2.3.0.0-2557/phoenix/lib/phoenix-core-4.4.0.2.3.0.0-2557.jar. Link the Phoenix core jar file to the HBase Master and Region servers. Here was the code I used to link it:

ln -sf /usr/hdp/2.3.0.0-2557/phoenix/lib/phoenix-core-4.4.0.2.3.0.0-2557.jar /usr/hdp/2.3.0.0-2557/hbase/lib/phoenix.jar

Change the version numbers if you have different versions of hadoop (hdp) or phoenix.

Edit the hbase-site.xml

The next step is editing the hbase-site.xml. Run this:

vi /usr/hdp/2.3.0.0-2557/etc/hbase/conf.dist/hbase-site.xml

Again, change version numbers in that code as necessary. Now that you’re in vi, a linux text editor, hit to change from command mode to insert mode. Insert mode will let you make changes to the text in the file, while command mode lets you cause actions that will be taken on the file. Place this code between the two configuration tags:

   hbase.defaults.for.version.skip
   true
hbase.regionserver.wal.codec
   org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec

The file should look like this when you’re done:

hbase-site.xml file

Notice that ‘:wq’ will allow you to save the file and exit

Save the file by pressing ‘ESC’ to change from Insert Mode to Command mode, then hit ‘:wq’ to save and quit.

Start HBase

If HBase isn’t running yet, you need to start it. Similarly, if HBase is already running, you need to restart it.

Log into Ambari in your browser at 127.0.0.1:8080 with username/password admin/admin. If that doesn’t work, check which ip address to use by typing this code in your terminal:

ifconfig

Once you’re logged in, start HBase by clicking HBase on the left panel

Starting HBase in Ambari

then Service Actions > Start:

HBase tab in Ambari

 

 

 

 

 

 

Give it a minute or two to start if you get a red alert when it first starts up. If the alert persists, you may have to stop another service to free up memory on your sandbox. I chose to stop MapReduce2 for now. You can always enable it later.

Phoenix should now be installed and ready for use.

Testing your new Phoenix installation

To test your new Phoenix installation, navigate to phoenix’s bin folder:

cd /usr/hdp/2.3.0.0-2557/phoenix/bin

Let’s run the sqlline.py program:

python sqlline.py localhost:2181:/hbase-unsecure

It may take a minute or two to start up. If it hangs for too long go check Ambari to make sure HBase is still running.

Once the program starts, enter these commands:

create table test (mykey integer not null primary key, mycolumn varchar);
upsert into test values (1,'Hello');
upsert into test values (2,'World!');
select * from test;

The first command creates a table called test with an integer (numeric) key and a varchar (text) column. The next two commands insert rows into the table. In this case, the third command selects all rows from the table and prints them to the screen:

Phoenix results

That’s it for now! You’ve successfully integrated Apache Phoenix with HBase used it to create a simple table. If you’d like to use a GUI to interact with Phoenix, go check out this guide. To dive deeper into Phoenix, check out the quick start guide, or the FAQ. And as always, if you have any questions feel free to reach out to me.

MapReduce & Eclipse: a Quick Start Guide for Java Developers

Want to start developing MapReduce programs in Java using Eclipse? This guide will get you up to speed.

It will walk you through setting up Maven (a great build manager) with Eclipse, setting up Github (a great version control system) with Eclipse, setting up a shared folder between your computer and the Hortonworks sandbox, and conclude with an example MapReduce application written in Java for you to learn from.

Setting up Maven with Eclipse

When developing MapReduce programs in Eclipse, a lot of the code you’ll be using requires you to have certain .jar files on your system. One way to do this is to download the jar files yourself. A better way to do it is to use Maven. Maven helps you manage your project builds.

Installing M2Eclipse

To use Maven with Eclipse, we’ll be using the plugin M2Eclipse. Open Eclipse and navigate to Help > Install New Software. Enter ‘http://download.eclipse.org/technology/m2e/releases’ in the form after Work with: and click Add. Once the download options appear, select Maven Integration for Eclipse.

Finish the installation and restart Eclipse.

Converting a project to a Maven project

To convert your project to a Maven project, right click on it and select Configure > Convert to Maven Project. The default settings it brings up should be fine. Click Finish.

You now have a Maven Project.

Adding dependencies to your Maven project

To add dependencies (jar files) to your Maven project, click on the pom.xml. Click Dependencies, then click Add. You’ll need to input the Group ID, Artifact ID, and Version for each dependency you add. For a MapReduce program on Hortonwork’s sandbox running Hadoop version 2.7.1 I include hadoop-client 2.7.1 and commons-logging 1.1.1. The former has the Group ID org.apache.hadoop, the Artifact ID hadoop-client, and the version 2.7.1. The latter has the Group ID commons-logging, the Artifact ID commons-logging, and the version 1.1.1.

You could search the internet for the specific group, artifact, and versions you want, or you could connect Maven to a repository and do the search right from the Add popup.

Connecting to the Hortonworks repository

By now, you have M2Eclipse and a Maven project set up in Eclipse, but you still don’t have any searchable repositories. To get those, you’ll need to edit or create the settings.xml file in the .m2 folder (${HOME}/.m2/). This folder might be hidden on your computer.

Creating or Editing your settings.xml file

If you don’t have a settings.xml file, you’ll need to create it. Using your favorite text editor (Sublime Text is a nice one) paste the code from here. If you already have a settings.xml file, simply add the standard-extra-repos profile. Save your new settings as settings.xml in the .m2 folder (${HOME}/.m2/).

Now that you have a settings.xml file, go back into Eclipse. Select Eclipse > Preferences > Maven > User Settings. Update your settings so that it points to your new settings.xml file. Click Update Settings and Apply. Restart Eclipse and you should now be able to search for Maven dependencies.

This concludes the section on setting up Maven. If you run into any issues Google search will be your friend; there’s lots of people who have probably run into the bug you’re having before. If you’re still stuck, feel free to email me.

Connecting Eclipse with your Github account

Github, if you haven’t heard of it, is a popular version control system which uses Git. To learn more about it, check out the short explanation here, medium explanation here, and longer tutorials here. It’s a great way to collaborate with teammates in a distributed fashion.

Let’s get started. If you don’t already have an account at Github, create one.

Installing Egit and Jgit

Have an account? Time to set up Eclipse to work seamlessly with Github. Open Eclipse and navigate to Help > Install New Software. Enter ‘http://download.eclipse.org/egit/updates‘ in the form after Work with: and click Add. Once the download options appear, select Eclipse Team Provider and JGit.

Egit and Jgit Installation for Eclipse

Finish the installation and restart Eclipse.

Creating a repository

Now, before we can do anything in Eclipse, we need a repository to upload our project to. Go to Github in your browser, log in, and click the ‘+ New Repository’ button. Give it a name, set it to Public or Private, and initialize it with a README. You can edit the README if you like by clicking on it and hitting the ‘Edit this file’ button/pencil.

Next, copy the clone URL of your repository by clicking the copy to clipboard button right below where it says ‘HTTPS clone URL’. It’s on the right side of your repository in your browser, towards the bottom of the screen.

Cloning the repository in Eclipse

Head back to Eclipse and make sure the Git Repositories view is showing: Window > Show View > Other… > Git > Git Repositories. Click Clone a Git Repository in the new window that appeared. Egit should automatically fill out URI, Host, and Repository path. Fill out your username and password in the Authentication section. Store it in the secure store if you don’t want to keep typing it in. Now, select your Master branch, hit Next, pick where you want to store it, and hit Finish.

Sharing your project to Eclipse

The next step is sharing your project to Eclipse. To do this, right click on your project and navigate to Team > Share Project. Select the repository you just created and hit Finish.

We’re not done yet. Right click on your project and hit ‘Team > Add to Index’. This will allow you to start tracking changes. Next, create a .gitignore file so that your bin won’t be tracked by Github (tracking the bin leads to file conflicts).

Egit - adding .gitignore file

Source: http://wiki.eclipse.org/EGit/User_Guide#Getting_Started

If a .gitignore already exists but you cannot see it in your project files, try clicking on the white, downward facing arrow in the Navigator pane. Select Filters… and uncheck .* resources. Make sure /bin/ and /target/ are in the .gitignore file.

You’re not ready to commit your project. Right click your project, select Team > Commit. Put in a comment about the commit and select ‘Commit and Push’.

Check your repository on Github. It should now contain your project.

You have successfully connected Eclipse with your Github account. If you’d like to learn more about git and using Github, refer to the references linked to at the start of this section.

Setting up a shared folder between your computer and the Hortonworks sandbox

Having a shared folder between your computer and the Hortonworks sandbox can help speed up your development process. To set one up using VMWare fusion follow these steps:

  1. Pause your sandbox if it is already running
  2. Navigate to the sandbox settings > sharing
  3. Click the + icon and select a folder to share between your computer and your sandbox
  4. You’re done! Your shared folder will be accessible in your sandbox at /mnt/hgfs/hdp_shared_folder/ where hdp_shared_folder is the name of your shared folder

To share a folder in VirtualBox the process is similar:

  1. Pause your sandbox if it is already running. You might have to power it off.
  2. Navigate to Settings > Shared Folders
  3. Click the + icon and select a folder to share between your computer and your sandbox
  4. Select the auto-mount option
  5. You’re done! Your shared folder will be accessible in your sandbox at /media/sf_hdp_shared_folder where hdp_shared_folder is the name of your shared folder

To copy files out of your shared folder to the current directory you’re working in use (make sure you include the dot at the end):

cp /mnt/hgfs/hdp_shared_folder/filename.txt .

Coding Your MapReduce Program

Now that you’re development environment is set up, it’s time to start developing. A great way to get started is to do the classic word count example. There’s plenty of tutorials out there to guide you through that though, so I’m going to walk you through a new kind of MapReduce program. This one will use one MapReduce job to sum the individual characters of a text file, followed by another MapReduce job to sort the output by value in descending order of occurrences.

To get started, go check out the source code, located here.

Next, either Fork the repository (which essentially creates a copy of the project under your Github account so you can modify it as you please) or simply copy and paste the source code into new, properly named class files in Eclipse.

To run the code you’ll need to export it as a runnable jar. Right click, Export > Java > Runnable Jar. Make sure JobChainer is set as the launch configuration, and export the file.

If you are unable to select JobChainer as the launch configuration, you will have to change the .classpath in your project. To do that, you must first be able to see the .classpath file in Eclipse.

Click on the white down arrow in the package explorer, then Filters. After that, make sure *.resources is unchecked:

Filters

Enable viewing of dot files in Eclipse

Classpath file

 

 

 

 

 

 

Click okay. You should now be able to view the .classpath file in your project.

In the classpath file, look for the two Hadoop dependencies which have a reference to:

/Users/ryanwolniak/.m2/repository/org/apache/hadoop/...

Classpath dependencies

Change those references to point to the corresponding Hadoop files on your system.

When you are done, you should be able to successfully export JobChainer as a runnable jar. If you are still having issues, try running the JobChainer class in Eclipse. It should generate the necessary launch configuration for you.

Next, upload the jar to your sandbox and run it using:

hadoop jar CharCount.jar JobChainer /test/input.txt /test/output/

Notes

  • You will have to create the input directory and put an input file in there
  • You need to choose a new output folder each time you run the script
  • You may have to change the file permissions of the CharCount.jar file
    • chmod 777 CharCount.jar

Once the MapReduce job finishes, check that there is an output file:

hadoop fs -ls /test/output/

Then take a look inside the output file for your results:

hadoop fs -cat /test/output/part-r-00000

That’s it for the quick-start guide. If you have any questions, feel free to send me an email or leave a comment.

A Beginner’s Guide to Generating SSH Keys for Github

Hard at work

A mouse sure comes in handy when using bash

I’m currently in the process of setting up an account on a RHEL (Red Hat Enterprise Linux) server (accessed via PuTTY on a Windows machine) to communicate with GitHub. There’s a great guide for generating the necessary SSH keys over at Github. If you know your way around Linux, give it a look. It should be pretty straightforward to follow. If Linux is a bit new to you though, you may run into some questions and issues with the guide. This post will alleviate two commons issues with the guide.

Problem 1: Could not open a connection to your authentication agent

  • This error comes about as a result of attempting to run ‘ssh-add ~/.ssh/id_rsa’
    • To fix: run ‘ssh-agent’ followed by ‘eval $(ssh-agent)’. Run ‘ssh-add’ and you’re good to go.

Problem 2: clip: command not found

  • Less of a problem than an inconvenience. To work around not being able to run ‘clip < ~/.ssh/id_rsa.pub’, you can instead simply run ‘cat ~/.ssh/id_rsa.pub’. Highlight the output, which will copy it to your clip board. Now paste into Github where the tutorial tells you to paste it. You may have to remove some white space, but once you do you’re set.

Hope these tips help!