MapReduce & Eclipse: a Quick Start Guide for Java Developers

Want to start developing MapReduce programs in Java using Eclipse? This guide will get you up to speed.

It will walk you through setting up Maven (a great build manager) with Eclipse, setting up Github (a great version control system) with Eclipse, setting up a shared folder between your computer and the Hortonworks sandbox, and conclude with an example MapReduce application written in Java for you to learn from.

Setting up Maven with Eclipse

When developing MapReduce programs in Eclipse, a lot of the code you’ll be using requires you to have certain .jar files on your system. One way to do this is to download the jar files yourself. A better way to do it is to use Maven. Maven helps you manage your project builds.

Installing M2Eclipse

To use Maven with Eclipse, we’ll be using the plugin M2Eclipse. Open Eclipse and navigate to Help > Install New Software. Enter ‘http://download.eclipse.org/technology/m2e/releases’ in the form after Work with: and click Add. Once the download options appear, select Maven Integration for Eclipse.

Finish the installation and restart Eclipse.

Converting a project to a Maven project

To convert your project to a Maven project, right click on it and select Configure > Convert to Maven Project. The default settings it brings up should be fine. Click Finish.

You now have a Maven Project.

Adding dependencies to your Maven project

To add dependencies (jar files) to your Maven project, click on the pom.xml. Click Dependencies, then click Add. You’ll need to input the Group ID, Artifact ID, and Version for each dependency you add. For a MapReduce program on Hortonwork’s sandbox running Hadoop version 2.7.1 I include hadoop-client 2.7.1 and commons-logging 1.1.1. The former has the Group ID org.apache.hadoop, the Artifact ID hadoop-client, and the version 2.7.1. The latter has the Group ID commons-logging, the Artifact ID commons-logging, and the version 1.1.1.

You could search the internet for the specific group, artifact, and versions you want, or you could connect Maven to a repository and do the search right from the Add popup.

Connecting to the Hortonworks repository

By now, you have M2Eclipse and a Maven project set up in Eclipse, but you still don’t have any searchable repositories. To get those, you’ll need to edit or create the settings.xml file in the .m2 folder (${HOME}/.m2/). This folder might be hidden on your computer.

Creating or Editing your settings.xml file

If you don’t have a settings.xml file, you’ll need to create it. Using your favorite text editor (Sublime Text is a nice one) paste the code from here. If you already have a settings.xml file, simply add the standard-extra-repos profile. Save your new settings as settings.xml in the .m2 folder (${HOME}/.m2/).

Now that you have a settings.xml file, go back into Eclipse. Select Eclipse > Preferences > Maven > User Settings. Update your settings so that it points to your new settings.xml file. Click Update Settings and Apply. Restart Eclipse and you should now be able to search for Maven dependencies.

This concludes the section on setting up Maven. If you run into any issues Google search will be your friend; there’s lots of people who have probably run into the bug you’re having before. If you’re still stuck, feel free to email me.

Connecting Eclipse with your Github account

Github, if you haven’t heard of it, is a popular version control system which uses Git. To learn more about it, check out the short explanation here, medium explanation here, and longer tutorials here. It’s a great way to collaborate with teammates in a distributed fashion.

Let’s get started. If you don’t already have an account at Github, create one.

Installing Egit and Jgit

Have an account? Time to set up Eclipse to work seamlessly with Github. Open Eclipse and navigate to Help > Install New Software. Enter ‘http://download.eclipse.org/egit/updates‘ in the form after Work with: and click Add. Once the download options appear, select Eclipse Team Provider and JGit.

Egit and Jgit Installation for Eclipse

Finish the installation and restart Eclipse.

Creating a repository

Now, before we can do anything in Eclipse, we need a repository to upload our project to. Go to Github in your browser, log in, and click the ‘+ New Repository’ button. Give it a name, set it to Public or Private, and initialize it with a README. You can edit the README if you like by clicking on it and hitting the ‘Edit this file’ button/pencil.

Next, copy the clone URL of your repository by clicking the copy to clipboard button right below where it says ‘HTTPS clone URL’. It’s on the right side of your repository in your browser, towards the bottom of the screen.

Cloning the repository in Eclipse

Head back to Eclipse and make sure the Git Repositories view is showing: Window > Show View > Other… > Git > Git Repositories. Click Clone a Git Repository in the new window that appeared. Egit should automatically fill out URI, Host, and Repository path. Fill out your username and password in the Authentication section. Store it in the secure store if you don’t want to keep typing it in. Now, select your Master branch, hit Next, pick where you want to store it, and hit Finish.

Sharing your project to Eclipse

The next step is sharing your project to Eclipse. To do this, right click on your project and navigate to Team > Share Project. Select the repository you just created and hit Finish.

We’re not done yet. Right click on your project and hit ‘Team > Add to Index’. This will allow you to start tracking changes. Next, create a .gitignore file so that your bin won’t be tracked by Github (tracking the bin leads to file conflicts).

Egit - adding .gitignore file

Source: http://wiki.eclipse.org/EGit/User_Guide#Getting_Started

If a .gitignore already exists but you cannot see it in your project files, try clicking on the white, downward facing arrow in the Navigator pane. Select Filters… and uncheck .* resources. Make sure /bin/ and /target/ are in the .gitignore file.

You’re not ready to commit your project. Right click your project, select Team > Commit. Put in a comment about the commit and select ‘Commit and Push’.

Check your repository on Github. It should now contain your project.

You have successfully connected Eclipse with your Github account. If you’d like to learn more about git and using Github, refer to the references linked to at the start of this section.

Setting up a shared folder between your computer and the Hortonworks sandbox

Having a shared folder between your computer and the Hortonworks sandbox can help speed up your development process. To set one up using VMWare fusion follow these steps:

  1. Pause your sandbox if it is already running
  2. Navigate to the sandbox settings > sharing
  3. Click the + icon and select a folder to share between your computer and your sandbox
  4. You’re done! Your shared folder will be accessible in your sandbox at /mnt/hgfs/hdp_shared_folder/ where hdp_shared_folder is the name of your shared folder

To copy files out of your shared folder to the current directory you’re working in use (make sure you include the dot at the end):

cp /mnt/hgfs/hdp_shared_folder/filename.txt .

Coding Your MapReduce Program

Now that you’re development environment is set up, it’s time to start developing. A great way to get started is to do the classic word count example. There’s plenty of tutorials out there to guide you through that though, so I’m going to walk you through a new kind of MapReduce program. This one will use one MapReduce job to sum the individual characters of a text file, followed by another MapReduce job to sort the output by value in descending order of occurrences.

To get started, go check out the source code, located here.

Next, either Fork the repository (which essentially creates a copy of the project under your Github account so you can modify it as you please) or simply copy and paste the source code into new, properly named class files in Eclipse.

To run the code you’ll need to export it as a runnable jar. Right click, Export > Java > Runnable Jar. Make sure JobChainer is set as the launch configuration, and export the file.

Upload the jar to your sandbox and run it using:

hadoop jar CharCount.jar JobChainer /test/input.txt /test/output/

Notes

  • You will have to create the input directory and put an input file in there
  • You need to choose a new output folder each time you run the script
  • You may have to change the file permissions of the CharCount.jar file
    • chmod 777 CharCount.jar

Once the MapReduce job finishes, check that there is an output file:

hadoop fs -ls /test/output/

Then take a look inside the output file for your results:

hadoop fs -cat /test/output/part-r-00000

That’s it for the quick-start guide. If you have any questions, feel free to send me an email or leave a comment.

Getting Started With Apache Giraph on CDH 5.1.2

Wondering how to set up a working version of Apache Giraph on CDH 5.1.2? This guide will get you started.

Building Giraph

1. Clone Giraph from GitHub: ‘git clone https://github.com/apache/giraph.git’

2. Modify the hadoop_2 profile in the pom.xml contained in the giraph folder you just cloned

  • Change the hadoop.version to read ’2.3.0-cdh5.1.2′

3. Compile, package and install Giraph: ‘mvn -Phadoop_2 -fae -DskipTests clean install’

  • Giraph is now located in giraph/giraph-core/target

Running an Example

Now that you’ve built Giraph, it’s time to run an example.

Create a simple graph text file to use as input. For example:

[0,0,[[1,1],[3,3]]]
[1,0,[[0,1],[2,2],[3,1]]]
[2,0,[[1,2],[4,4]]]
[3,0,[[0,3],[1,1],[4,4]]]
[4,0,[[3,4],[2,4]]]

I called the graph tiny_graph.txt. Next, create a shell script to take care of running the example:

#remove everything from the folder called giraph/output in the hadoop file system
hadoop fs -rm -r giraph/output/*
#remove all text file from the giraph/input folder
hadoop fs -rm giraph/input/*.txt
#put the 2nd argument to this script (located at /path/) into the hdfs folder giraph/input
hadoop fs -put /path/$2 giraph/input/
#change path and ClusterURL:Port as neccesary. $1 = name of example to run. $3 = num workers
hadoop jar /path/giraph/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-2.3.0-cdh5.1.2-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -D mapred.child.java.opts="-Xms10240m -Xmx15360m" -D mapred.job.tracker="ClusterURL:Port" -D giraph.zkList="ClusterURL:Port" org.apache.giraph.examples.$1 -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip giraph/input/$2 -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op giraph/output/lcc -w $3 -ca giraph.SplitMasterWorker=false
rm -f part-m-00001

That’s it! You should now be able to run it with ‘sh nameOfScript.sh ExampleName InputFileName.txt NumWorkers’. Thank you to Abdul Quamar, who wrote the shell script mine is based on.

A Beginner’s Guide to Generating SSH Keys for Github

Hard at work

A mouse sure comes in handy when using bash

I’m currently in the process of setting up an account on a RHEL (Red Hat Enterprise Linux) server (accessed via PuTTY on a Windows machine) to communicate with GitHub. There’s a great guide for generating the necessary SSH keys over at Github. If you know your way around Linux, give it a look. It should be pretty straightforward to follow. If Linux is a bit new to you though, you may run into some questions and issues with the guide. This post will alleviate two commons issues with the guide.

Problem 1: Could not open a connection to your authentication agent

  • This error comes about as a result of attempting to run ‘ssh-add ~/.ssh/id_rsa’
    • To fix: run ‘ssh-agent’ followed by ‘eval $(ssh-agent)’. Run ‘ssh-add’ and you’re good to go.

Problem 2: clip: command not found

  • Less of a problem than an inconvenience. To work around not being able to run ‘clip < ~/.ssh/id_rsa.pub’, you can instead simply run ‘cat ~/.ssh/id_rsa.pub’. Highlight the output, which will copy it to your clip board. Now paste into Github where the tutorial tells you to paste it. You may have to remove some white space, but once you do you’re set.

Hope these tips help!

Interview with Thomson Nguyen

Last month I had the chance to talk to Thomson Nguyen, co-founder & CEO at Framed Data, about what got him to where he is today, startups, big data, traveling, and more. Here’s what I learned:

Thomson Nguyen

About Thomson

Thomson spent his undergraduate years at Berkeley. He started as a Bioengineering major before moving on to a Pure Mathematics major with an English minor. Or unemployable math, as he lightly calls it. This consisted of a good deal of logic, set theory, and other pure math concepts. While his classes might not have geared towards been solving real world applications in math, his formative years built a foundation for his later work with computers. Before he got to that point though, he went off to Cambridge for a Master’s degree in computational biology.

Travel (and how to pay for it)

While at Cambridge, Thomson knew that he wanted to travel and explore Europe. One problem though: travel gets expensive.

To pay for his adventures, Thomson started setting up fish and chips websites for local sellers. These owners were often unfamiliar with the internet, but wanted to have a presence on Google Maps, Yelp, and the web.

How do you get started when you don’t have any existing work to show clients?

Thomson made his first website for a fish and chips owner for free. He wowed the owner, and from there his business started taking off. Able to show off his previous work to other fish and chips sellers, he could justify charging for the sites. In 2 months of contract work, he made enough money to travel around Europe for 6 months. As a current college student myself, I’m going to be taking this advice and laterally shifting it to the photography business to fund my upcoming trip abroad. More on that in another blog post though.

Data Science

After graduate school, Thomson went to NY to work at a hedge fund over the summer. The path didn’t seem right though, so when it came time to find a job Thomson went to NYU to research machine learning. There, he learned to derive insights form data via statistical methods. In mid-2011, he went to work for Lookout. Lookout is a security company which helps keep users’ mobile phones safe and private. They also offer solutions for government and enterprise. When Thomson was at Lookout, the idea was to look at Android app data and figure out which apps were malicious. For example, say the average Flashlight application is 70 kb and the average permissions needed is 0. With Android’s rich ontology, apps are already helpfully grouped into specific subcategories. If there are apps that are significantly larger than that, or are asking for more permissions, given a specific distribution curve then it should be flagged for review.

Thomson enjoyed his time at Lookout, but an opportunity came to take a job at Causes that he couldn’t resist. Here, he was a data scientist lead. His day-to-day experiences here consisted of one-on-one meetings with people, product management, and coding. In this exercise in management, he learned how to scale up a team and products. This is the sort of thing that academia doesn’t teach.

Framed Data

After a year and a half, Thomson left Causes to start working on his own company: Framed Data. Framed Data is a predictive analytics application that takes in user data and predicts when they’re going to leave your application, and why they’re leaving. It helps users figure out how they can improve their application, as well as knowing when to reach out to high-risk users.

The idea wasn’t always this polished though. The original idea for the company was to take data scientist’s models and productionize them. The concept was good, and it got Framed Data into Y Combinator Winter 2014. From there, Framed Data pivoted to provide a marketing automation service which helps applications retain their users.

Since getting into Y Combinator, Thomson mentioned his job has transitioned from coding to hiring. Now much of his day-to-day activities are non-technical, and mainly involve sales, business development, customer support, and management.

Data Science Tips and Advice

With his experience in data science, Thomson expanded upon a number of different ways for someone interested in data science to get more involved. One aspect he mentioned was General Assembly courses, which teach you how to become a data scientist over the course of 12 weeks. These types of courses tend to be better suited for non-technical people, though technical people can still pick up a lot from them too. In the course Thomson taught, 95% of the participants graduated from the course with jobs.

Another option for learning more about data science is Kaggle. Here, you compete with teams from around the world on one of the many different data science problems available. It’s a great way to build your data science portfolio on Github. Side tip: your portfolio should consist of code AND plain English talking about the different trends you found. One of the benefits (and also perhaps a drawback) of Kaggle is that it’s graded on a quantitative scale. You are optimizing for a predetermined number. This makes it great for competitions, but in the real world things aren’t always so objective.

Closing Thoughts

Thomson ended our discussion with some closing thoughts on start-ups, career paths, and locations.

Big Companies vs Startups

In his opinion, companies can destroy creativity. Six-figures for a young person starting at their first job makes life too comfortable. Once you’re being compensated at that level, you’re not going to want to take the risk to start your own company.

As an employee at a startup, you’re going to have high visibility to everyone else. You’ll be able to interact with the investors, VPs, CEOs, etc. on a much more regular basis than you would at a big company like Google. Lastly, if you want to work at a startup, consider moving out west. It’s a great place to live, and Framed Data is hiring.

 

HackTech Highlights

In my last post, I talked a little bit about my travels in Venice and Santa Monica. Today, I’ll be talking about my experiences at HackTech, the hackathon put on by the students of Caltech.

Let the Hacking Begin

Around 7 on Friday night, 1/24/14 we arrived at the venue we’d be hacking in: a convention center in the middle of Santa Monica Place. I couldn’t have asked for a better location.

Our (pretty awesome) team, consisting of Britt, CraigKunal, and me, grabbed a table with other UMD hackers and started setting up our development environment. I updated to a fresh version of Ubuntu and installed Sublime text. With some help from Craig and Kunal, and some command line magic, I was ready to start hacking.

Our idea, which didn’t have a name at this point, was to build a web application using eBay’s API which could find the average price of items. The user would enter a search term, select the category, and our application would find the average price for said item. This could be used to allow a buyer to find out how much he should be paying for his item, or for a seller to figure out how much to sell the item for. We didn’t want to stop there though, we decided to build in another feature which would search for items that were under-priced and allow buyers to find great deals on eBay.

Kunal and myself would familiarize ourselves with the eBay API and build the nit and grit of the back-end, while Craig and Britt would work on the gorgeous front-end. We chose Python/Django for our back-end and were excited to start. But first, I needed to familiarize myself with Python/Django.

After some quick tutorials in Python I was ready to begin. Having never used Python before, I was glad I had Craig and Kunal around to offer advice. One of the things I love about hackathons is how much learning can take place in such a short period of time. I went from knowing nothing about Python to coding the piece of our backend used to find underpriced items. It was a great learning experience, about both Python and eBay’s API (which was very easy to use, thanks eBay!).

While Kunal and myself worked on the backend, Craig and Britt were busy making an amazing front-end. Craig’s skills in both back-end development and front-end development proved crucial in making our application a success.

On Saturday, a name was decided upon for our hack: Dat Price. Special thank you to Namecheap.com for the free domain name.

Speaking of successes, we presented our application to the judges on Sunday and won eBay’s prize! It was an exciting moment, and I’m honored we were selected to win. Thanks eBay!

Alexis Ohanian and me

Me and a really chill dude

Other notable events

Alexis Ohanian showed up

I got to meet him! He could only stay for a little while, but he was really excited about the hacks going on and took the time to take pictures with everyone (or at least as many people as he could) before his manager told him he had to get going.

Free In-N-Out Burger was given out

It was delicious.

A lot of great tech companies and start-ups were there

Some of my favorites included Pebble, Whisper, Firebase, Fitbit, Namecheap, Dropbox, Mitek, Lob, and eBay, Inc.

Acknowledgements

I’d like to thank all the organizers of Hacktech for putting on a great hackathon, and the sponsors for providing the funding to make it happen.

I’d also like to give a shout out to my team. I enjoyed hacking with all of you and would gladly do so again.