Friday, January 30, 2015

Installing SpatialHadoop on an existing Hadoop cluster

I occasionally get a question about how to install SpatialHadoop on an existing cluster that runs Hadoop. So, decided to write this blog post to describe the different ways to setup SpatialHadoop on an existing cluster.
In this blog post, I'll describe two techniques to install SpatialHadoop on an existing cluster. The first techniques requires an administrator access to Hadoop, not necessarily to the while system. The second technique is less efficient but can work even if you cannot restart the cluster or manage it.

The first techniques

In this technique, all you need to do is extract the binaries of SpatialHadoop on every node in your cluster. This technique is only tested with Hadoop 1.x but it can also with with Hadoop 2.x, at least in concept. The binary archive of SpatialHadoop matches this of an Apache Hadoop 1.x installation. Basically, it installs the required libraries in the lib folder. Once the required libraries are in place on all machines, you need to restart the cluster to ensure that the libraries are loaded. After that, your cluster is ready to use.

Hadoop 2.x

Although not officially supported, you can use the same technique to install SpatialHadoop on Apache Hadoop 2.x. To do that, you first need to grab the source code of SpatialHadoop and build the binary package, then you can install it in your Hadoop distribution.
To grab the latest source code
git clone https://github.com/aseldawy/spatialhadoop2.git
ant dist2
The created package can be installed in a similar way on an Apache Hadoop 2.x

The second technique

In this technique, we assume that you don't have administrator access to the cluster so you can't install the libraries in Hadoop nodes or restart the cluster. Therefore, we compile SpatialHadoop libraries along with all required libraries into one jar which you can run using 'hadoop jar' command.
To create that jar, you need to grab the latest source code from github and then create the jar using the ant command.
git clone https://github.com/aseldawy/spatialhadoop2.git
ant emr-jar1
Once you create the jar file, you can run it using the command hadoop jar.
Similarly, if you're going to run the created jar on Hadoop 2.x, you should use the ant target emr-jar2 instead of emr-jar1