Ahmed Eldawy: HadoopViz: Extensible Visualization of Big Spatial Data

Saturday, February 20, 2016

HadoopViz: Extensible Visualization of Big Spatial Data

With huge sizes of spatial data, a common functionality that users are looking for is to visualize this data to see how it looks like. This gives users the power of quickly exploring new datasets with huge sizes. For example, the video below summarizes 1 trillion points that represent the temperature of every 1 km2 on the earth surface on every day from 2009 to 2014 (total of six years).

This video consists of 72 frames, as one per month. These frames are put together in this video. While one can use a single machine to produce these 72 images, it might take up to 60 hours due to the huge size of the input.
In this blog post, we describe how to use HadoopViz, an extensible visualization framework based on SpatialHadoop, to visualize the same dataset in just three hours using a cluster of 10 machines.
Other than single-level images which are typically of low resolution, HadoopViz can also produce multilevel images where users can interactively zoom in and out to explore huge datasets with a lot of details. For example, the image below is a visualization of a 92GB dataset which represents all the objects extracted from OpenStreetMap dataset. You can pan and zoom in this image to view more details about a specific area.

Overview

In a nutshell, HadoopViz uses the parallelization power of MapReduce along with the efficiency of SpatialHadoop to partition the data into smaller parts, visualize each part separately into a smaller image, and then put these partial images together to produce the final image. HadoopViz builds on this idea and provides four key features that make it easy to use and very efficient.

HadoopViz piggybacks data smoothing with visualization allowing it to smooth the data on-the-fly as the image is generated.
HadoopViz automatically decides the best way to partition the data allowing it to scale to generate both small and large images efficiently.
HadoopViz can also visualize multilevel images where users can freely pan and zoom into the image to interactively explore the huge dataset..
Instead of customizing the algorithm for a specific use case, e.g., satellite data, HadoopViz provides an extensible implementation that can support a wide range of visualization types.

Below, we first describe how to generate the visualizations show above using HadoopViz, which ships with the recent version of SpatialHadoop. Then, we describe some technical details about the smoothing, partitioning, and extensibility features.

How to ...

... generate the temperature video

You need to download and setup the most recent version of SpatialHadoop which ships with HadoopViz as its visualization package. Check this page for more details about setting up the most recent version of SpatialHadoop on both Hadoop 1.x and Hadoop 2.x.
Download the temperature dataset you would like to visualize. The temperature dataset we used can be obtained from LP DAAC archive on this link.
You can use this ruby script to download all the data for the six years if you have a good internet connection and enough storage on your machine. Run it using the following command:
ruby hdf_downloader.rb http://e4ftl01.cr.usgs.gov/MOLA/MYD11A1.005/ time:2009.01.01..2014.12.31
Once you have all the data, you can upload it to your HDFS using 'copyFromLocal' command. Let's assume the data is available at hdfs://user/hadoop/temperature
To visualize the 72 frames, run the following SpatialHadoop command
shadoop multihdfplot hdfs://user/hadoop/temperature combine:31 dataset:LST_Day_1km hdfs://user/hadoop/frames/ time:2009.01.01..2014.12.31
The frames will be available in the output path hdfs://user/hadoop/frames. Download them using 'copyToLocal' command.
Now, upload the frames to YouTube which will put them together into a video similar to the one shown above.

... generate the multilevel image

Follow step 1 above to download and install SpatialHadoop, if you haven't done already.
Download the 'All objects' dataset at the following link
http://spatialhadoop.cs.umn.edu/datasets.html#osm2
Upload the file to HDFS using the 'copyFromLocal' command. Let's assume it is uploaded to hdfs://user/hadoop/objects/
NB: You don't have to decompress the file as SpatialHadoop can decompress it on the fly while visualizing. However, if you upload the compressed file, you need to keep the .bz2 extension to tell SpatialHadoop it is compressed.
To generate a multilevel image with 11 levels similar to the one shown above, type the following command
shadoop gplot hdfs://user/hadoop/objects -pyramid levels:11 hdfs://user/hadoop/multilevel shape:osm
The generated image will be available at hdfs://user/hadoop/multilevel. Download it to your machine using the 'copyToLocal' command.
To view the image in your browser, open the 'index.html' file available in the output directory.

Smoothing

In visualization, smoothing means the fuse of nearby records according to visualization logic to produce a correct result. For example, satellite datasets typically contain holes which are results of clouds that obstructs the view of the satellites. A smoothing function can recover these holes by estimating the missing values using simple interpolation techniques. The two figures below show an example of how the smoothing function can recover missing points.

Original data without smoothing

Data is smoothed using HadoopViz

HadoopViz support on-the-fly smoothing of the data as the visualization is done. This means that you can easily plug in a different smoothing function and regenerate the image without having to carry out the complex smoothing function as a separate step.

Partitioning

HadoopViz supports two ways of partitioning the data which affect the way it merges intermediate partial images. It can use either the default HDFS partitioning or the spatial partitioning that ships with SpatialHadoop.

Default HDFS Partitioning

By default, when you upload a file to HDFS, it is partitioned into equi-sized chunks of 128MB each. Spatial locations of records are not taken into account and nearby records will typically end up in two different partitions. This means that every partition would possibly cover the entire input space and we will end up overlaying intermediate images to produce the final image as shown below.

Overlay intermediate images

Spatial Partitioning

If we use the spatial partitioning that ships with SpatialHadoop, each partition would only contain data from a small limited space and we will end up stitching intermediate images as shown below.

Stitch intermediate images

Which partitioning technique is better?

While both techniques will end up producing the same final answer, the performance might be different. HadoopViz needs to automatically decide which one to use. First of all, if the data needs to be smoothed, then HadoopViz has to choose spatial partitioning as it is the only one that groups nearby records together in one partition before they can be fused.

If HadoopViz doesn't need to apply a smoothing function, then both techniques are applicable. According to the image size, There's an overhead between the partitioning and merging steps. The default HDFS partitioning is faster than spatial partitioning, but the overlay process is more time consuming than stitch due to the huge sizes of intermediate images. HadoopViz decides to go for spatial partitioning if the image size is huge as the cost of the overlay process becomes more and more time consuming.

Multilevel images

A multilevel image consists of a pyramid of fixed-size tiles, typically, each of size 256x256 pixels. The figure below shows an example of a three-level image with 1, 4, and 16 tiles in its three levels, aka, a pyramid of three levels.

A multilevel of three levels

A naive way to generate a multilevel image is to generate each tile independently using the (single-level) techniques shown above. However, this would require executing the single-level algorithm millions of times. Therefore, HadoopViz provides specialized multilevel visualization algorithms for multi-level images that take into consideration the pyramid structure of multi-level images. Similar to single-level visualization, HadoopViz supports two partitioning techniques, namely, default HDFS partitioning and pyramid partitioning.

Default HDFS Partitioning

If we use default HDFS partitioning, each partition might contain records from all over the input space. In this case, each machine plots all these records to all overlapping tiles in all pyramid levels. The generated tiles are considered partial images as multiple partitions might overlap the same tile. Thus, a final merge step will need to overlay all intermediate partial images for the same tile to produce the final image for that tile.

Pyramid Partitioning

The other option for HadoopViz is to first repartition the data so that all records that overlap with one tile go to one partition. Then, these records are visualized to generate the final image for that tile. No merging is needed here as each tile is only generated by one machine.

Which partitioning technique is better?

Again, there is no clear winner here. It all depends on how many tiles are generated. If only a few tiles are generated, then default HDFS partitioning is better as it only needs to merge a few images. However, if a huge number of tiles are generated, pyramid partitioning is better as it avoids altogether the need for merging intermediate tiles.

HadoopViz splits a huge pyramid into two parts, the top and the base of the pyramid. The top of the pyramid contains only a few tiles and is generated by the default HDFS partitioning technique, while the base contains too many tiles and is generated by the pyramid partitioning technique. The tiles are then put together to produce the final image without any extra processing.

Extensibility

While the above techniques can be customized for every visualization type, it would require a huge coding effort to build and maintain all these implementations. Therefore, HadoopViz proposes a visualization abstraction that is used to describe the visualization logic. This abstraction is then plugged into generic implementations of the above algorithms to produce the image efficiently at scale. In short, if you would like to visualize your own data in a new way, all you need to do is write a small class that extends an abstract class, and you're ready to go with both single-level and multilevel visualization techniques.
A new visualization type is defined by extending the base class Plotter. There are mainly five functions that you would like to implement for a new visualization type.

<S extends Shape> Iterable<S> smooth(Iterable<S> r)

This function takes a set of nearby records, fuses them together, and returns a new set of records. This function can be used to apply a user-specified smoothing logic.

Canvas createCanvas(int width, int height, Rectangle mbr)

This function initializes an empty canvas with the given size in width and height. It also associates this canvas with the given MBR in input space. Notice that Canvas can be virtually anything. We provide a simple abstract Canvas class as a skeleton.

void plot(Canvas layer, Shape shape)

The plot function updates the canvas layer by plotting the given shape on it. Users can define their own visualization logic for one shape based on the format of the shape and the canvas layer.

void merge(Canvas finalLayer, Canvas intermediateLayer)

The merge function merges two intermediate canvases. It updates the finalLayer by merging the intermediateLayer into it.

void writeImage(Canvas layer, DataOutputStream out, boolean vflip)

This writeImage function encodes the canvas layer into a standard image that can be displayed to the end user. The image is written to the given DataOutputStream which typically goes to an output file. If the vflip flag is set to true, the image should be vertically flipped before written to the output. The vflip flag is useful when the y-axis of the input is in a different direction than the final image. For example, in PNG images, the y-axis increases from bottom to top while in geographical coordinates, latitude increases from top to bottom.

Acknowledgement

This work was partially supported by an AWS in Education Grant.

Further References

SpatialHadoop homepage: http://spatialhadoop.cs.umn.edu/
Ahmed Eldawy, Mohamed F. Mokbel and Christopher Jonathan "HadoopViz: A MapReduce Framework for Extensible Visualization of Big Spatial Data". In Proceedings of the 32nd IEEE International Conference on Data Engineering, IEEE ICDE 2016, Helsinki, Finland, May 16-20, 2016
Ahmed Eldawy, Mohamed F. Mokbel and Christopher Jonathan "A Demonstration of HadoopViz: An Extensible MapReduce System for Visualizing Big Spatial Data". In Proceedings of the International Conference on Very Large Databases, VLDB 2015, Kohala Coast, HI, 2015