Saturday, February 20, 2016

HadoopViz: Extensible Visualization of Big Spatial Data

With huge sizes of spatial data, a common functionality that users are looking for is to visualize this data to see how it looks like. This gives users the power of quickly exploring new datasets with huge sizes. For example, the video below summarizes 1 trillion points that represent the temperature of every 1 km2 on the earth surface on every day from 2009 to 2014 (total of six years).

This video consists of 72 frames, as one per month. These frames are put together in this video. While one can use a single machine to produce these 72 images, it might take up to 60 hours due to the huge size of the input.
In this blog post, we describe how to use HadoopViz, an extensible visualization framework based on SpatialHadoop, to visualize the same dataset in just three hours using a cluster of 10 machines.
Other than single-level images which are typically of low resolution, HadoopViz can also produce multilevel images where users can interactively zoom in and out to explore huge datasets with a lot of details. For example, the image below is a visualization of a 92GB dataset which represents all the objects extracted from OpenStreetMap dataset. You can pan and zoom in this image to view more details about a specific area.

Overview

In a nutshell, HadoopViz uses the parallelization power of MapReduce along with the efficiency of SpatialHadoop to partition the data into smaller parts, visualize each part separately into a smaller image, and then put these partial images together to produce the final image. HadoopViz builds on this idea and provides four key features that make it easy to use and very efficient.
  1. HadoopViz piggybacks data smoothing with visualization allowing it to smooth the data on-the-fly as the image is generated.
  2. HadoopViz automatically decides the best way to partition the data allowing it to scale to generate both small and large images efficiently.
  3. HadoopViz can also visualize multilevel images where users can freely pan and zoom into the image to interactively explore the huge dataset..
  4. Instead of customizing the algorithm for a specific use case, e.g., satellite data, HadoopViz provides an extensible implementation that can support a wide range of visualization types.
Below, we first describe how to generate the visualizations show above using HadoopViz, which ships with the recent version of SpatialHadoop. Then, we describe some technical details about the smoothing, partitioning, and extensibility features.

How to ...

... generate the temperature video

  1. You need to download and setup the most recent version of SpatialHadoop which ships with HadoopViz as its visualization package. Check this page for more details about setting up the most recent version of SpatialHadoop on both Hadoop 1.x and Hadoop 2.x.
  2. Download the temperature dataset you would like to visualize. The temperature dataset we used can be obtained from LP DAAC archive on this link.
    You can use this ruby script to download all the data for the six years if you have a good internet connection and enough storage on your machine. Run it using the following command:
    ruby hdf_downloader.rb http://e4ftl01.cr.usgs.gov/MOLA/MYD11A1.005/ time:2009.01.01..2014.12.31
  3. Once you have all the data, you can upload it to your HDFS using 'copyFromLocal' command. Let's assume the data is available at hdfs://user/hadoop/temperature
  4. To visualize the 72 frames, run the following SpatialHadoop command
    shadoop multihdfplot hdfs://user/hadoop/temperature combine:31 dataset:LST_Day_1km hdfs://user/hadoop/frames/ time:2009.01.01..2014.12.31
  5. The frames will be available in the output path hdfs://user/hadoop/frames. Download them using 'copyToLocal' command.
  6. Now, upload the frames to YouTube which will put them together into a video similar to the one shown above.

... generate the multilevel image

  1. Follow step 1 above to download and install SpatialHadoop, if you haven't done already.
  2. Download the 'All objects' dataset at the following link
    http://spatialhadoop.cs.umn.edu/datasets.html#osm2
  3. Upload the file to HDFS using the 'copyFromLocal' command. Let's assume it is uploaded to hdfs://user/hadoop/objects/
    NB: You don't have to decompress the file as SpatialHadoop can decompress it on the fly while visualizing. However, if you upload the compressed file, you need to keep the .bz2 extension to tell SpatialHadoop it is compressed.
  4. To generate a multilevel image with 11 levels similar to the one shown above, type the following command
    shadoop gplot hdfs://user/hadoop/objects -pyramid levels:11 hdfs://user/hadoop/multilevel shape:osm
  5. The generated image will be available at hdfs://user/hadoop/multilevel. Download it to your machine using the 'copyToLocal' command.
  6. To view the image in your browser, open the 'index.html' file available in the output directory.

Smoothing

In visualization, smoothing means the fuse of nearby records according to visualization logic to produce a correct result. For example, satellite datasets typically contain holes which are results of clouds that obstructs the view of the satellites. A smoothing function can recover these holes by estimating the missing values using simple interpolation techniques. The two figures below show an example of how the smoothing function can recover missing points.
Original data without smoothing
Data is smoothed using HadoopViz
HadoopViz support on-the-fly smoothing of the data as the visualization is done. This means that you can easily plug in a different smoothing function and regenerate the image without having to carry out the complex smoothing function as a separate step.

Partitioning

HadoopViz supports two ways of partitioning the data which affect the way it merges intermediate partial images. It can use either the default HDFS partitioning or the spatial partitioning that ships with SpatialHadoop.

Default HDFS Partitioning

By default, when you upload a file to HDFS, it is partitioned into equi-sized chunks of 128MB each. Spatial locations of records are not taken into account and nearby records will typically end up in two different partitions. This means that every partition would possibly cover the entire input space and we will end up overlaying intermediate images to produce the final image as shown below.
Overlay intermediate images

Spatial Partitioning

If we use the spatial partitioning that ships with SpatialHadoop, each partition would only contain data from a small limited space and we will end up stitching intermediate images as shown below.
Stitch intermediate images

Which partitioning technique is better?

While both techniques will end up producing the same final answer, the performance might be different. HadoopViz needs to automatically decide which one to use. First of all, if the data needs to be smoothed, then HadoopViz has to choose spatial partitioning as it is the only one that groups nearby records together in one partition before they can be fused.
If HadoopViz doesn't need to apply a smoothing function, then both techniques are applicable. According to the image size, There's an overhead between the partitioning and merging steps. The default HDFS partitioning is faster than spatial partitioning, but the overlay process is more time consuming than stitch due to the huge sizes of intermediate images. HadoopViz decides to go for spatial partitioning if the image size is huge as the cost of the overlay process becomes more and more time consuming.

Multilevel images

A multilevel image consists of a pyramid of fixed-size tiles, typically, each of size 256x256 pixels. The figure below shows an example of a three-level image with 1, 4, and 16 tiles in its three levels, aka, a pyramid of three levels.
A multilevel of three levels
A naive way to generate a multilevel image is to generate each tile independently using the (single-level) techniques shown above. However, this would require executing the single-level algorithm millions of times. Therefore, HadoopViz provides specialized multilevel visualization algorithms for multi-level images that take into consideration the pyramid structure of multi-level images. Similar to single-level visualization, HadoopViz supports two partitioning techniques, namely, default HDFS partitioning and pyramid partitioning.

Default HDFS Partitioning

If we use default HDFS partitioning, each partition might contain records from all over the input space. In this case, each machine plots all these records to all overlapping tiles in all pyramid levels. The generated tiles are considered partial images as multiple partitions might overlap the same tile. Thus, a final merge step will need to overlay all intermediate partial images for the same tile to produce the final image for that tile.

Pyramid Partitioning

The other option for HadoopViz is to first repartition the data so that all records that overlap with one tile go to one partition. Then, these records are visualized to generate the final image for that tile. No merging is needed here as each tile is only generated by one machine.

Which partitioning technique is better?

Again, there is no clear winner here. It all depends on how many tiles are generated. If only a few tiles are generated, then default HDFS partitioning is better as it only needs to merge a few images. However, if a huge number of tiles are generated, pyramid partitioning is better as it avoids altogether the need for merging intermediate tiles.
HadoopViz splits a huge pyramid into two parts, the top and the base of the pyramid. The top of the pyramid contains only a few tiles and is generated by the default HDFS partitioning technique, while the base contains too many tiles and is generated by the pyramid partitioning technique. The tiles are then put together to produce the final image without any extra processing.

Extensibility

While the above techniques can be customized for every visualization type, it would require a huge coding effort to build and maintain all these implementations. Therefore, HadoopViz proposes a visualization abstraction that is used to describe the visualization logic. This abstraction is then plugged into generic implementations of the above algorithms to produce the image efficiently at scale. In short, if you would like to visualize your own data in a new way, all you need to do is write a small class that extends an abstract class, and you're ready to go with both single-level and multilevel visualization techniques.
A new visualization type is defined by extending the base class Plotter. There are mainly five functions that you would like to implement for a new visualization type.

<S extends Shape> Iterable<S> smooth(Iterable<S> r)

This function takes a set of nearby records, fuses them together, and returns a new set of records. This function can be used to apply a user-specified smoothing logic.

Canvas createCanvas(int width, int height, Rectangle mbr)

This function initializes an empty canvas with the given size in width and height. It also associates this canvas with the given MBR in input space. Notice that Canvas can be virtually anything. We provide a simple abstract Canvas class as a skeleton.

void plot(Canvas layer, Shape shape)

The plot function updates the canvas layer by plotting the given shape on it. Users can define their own visualization logic for one shape based on the format of the shape and the canvas layer.

void merge(Canvas finalLayer, Canvas intermediateLayer)

The merge function merges two intermediate canvases. It updates the finalLayer by merging the intermediateLayer into it.

void writeImage(Canvas layer, DataOutputStream out, boolean vflip)

This writeImage function encodes the canvas layer into a standard image that can be displayed to the end user. The image is written to the given DataOutputStream which typically goes to an output file. If the vflip flag is set to true, the image should be vertically flipped before written to the output. The vflip flag is useful when the y-axis of the input is in a different direction than the final image. For example, in PNG images, the y-axis increases from bottom to top while in geographical coordinates, latitude increases from top to bottom.

Acknowledgement

This work was partially supported by an AWS in Education Grant.

Further References

  1. SpatialHadoop homepage: http://spatialhadoop.cs.umn.edu/
  2. Ahmed Eldawy, Mohamed F. Mokbel and Christopher Jonathan "HadoopViz: A MapReduce Framework for Extensible Visualization of Big Spatial Data". In Proceedings of the 32nd IEEE International Conference on Data Engineering, IEEE ICDE 2016, Helsinki, Finland, May 16-20, 2016
  3. Ahmed Eldawy, Mohamed F. Mokbel and Christopher Jonathan "A Demonstration of HadoopViz: An Extensible MapReduce System for Visualizing Big Spatial Data". In Proceedings of the International Conference on Very Large Databases, VLDB 2015, Kohala Coast, HI, 2015

11 comments:

  1. Nice Explanation

    I have Shared my thoughts on Big Data Analytics with Hadoop. We have providing Certification Program on Big data in Hyderabad, Bangalore, India

    For More Details Please visit our website

    www.insofe.edu.in

    ReplyDelete
    Replies
    1. I'm Shalin from Creately online visualizations and collaboration tool. I love how data is visualized here. Good job!

      Delete
  2. Very nice and useful information you shared thank you. Know more about Big Data Hadoop Training

    ReplyDelete
  3. Webtrackker technology is the best IT training institute in NCR. Webtrackker provide training on all latest technology such as hadoop training. Webtrackker is not only training institute but also it also provide best IT solution to his client. Webtrackker provide training by experienced and working in the industry on same technology.Webtrackker Technology C-67 Sector-63 Noida 8802820025

    Hadoop Training institute in indirapuram


    Hadoop Training institute in Noida


    Hadoop Training institute in Ghaziabad


    Hadoop Training institute in Vaishali


    Hadoop Training institute in Vasundhara


    Hadoop Training institute in Delhi South Ex

    ReplyDelete
  4. Thanks for sharing Valuable information about hadoop. Really helpful. Keep sharing...........

    ReplyDelete
  5. Excellent and very cool idea and the subject at the top of magnificence and I am happy to this post..Interesting post! Thanks for writing it. What's wrong with this kind of post exactly? It follows your previous guideline for post length as well as clarity..

    Dot Net Training in Chennai

    Software Testing Training in Chennai

    ReplyDelete
  6. Helpful article.. All concept explanation are very clear and step by step so easy to understand.. thank you for sharing..

    hadoop training center in chennai

    ReplyDelete
  7. Hello,
    Hadoop is a framework that allows distributed processing of large data sets across clusters of computers using simple and fault tolerant programming model. It is designed to scale up from a very few to thousands of machines, each machine provides local computation and storage. The Hadoop software library itself is designed to detect and handle failures at the application layer.
    Hadoop is written in java by Apache Software Foundation. It process data very reliably and fault-tolerant manner. Learn more About Hadoop Administration Training.

    ReplyDelete
  8. This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
    Cloud computing Training in Chennai
    Hadoop Training in Chennai
    Cloud computing Training Chennai
    Cloud computing Training centers in Chennai
    Hadoop training institutes in chennai
    hadoop big data training in chennai

    ReplyDelete