Friday, October 20, 2017

Visualize SpatialHadoop indexes

I received several requests asking for help in building visualizations for SpatialHadoop indexes. In many of my papers, posters, and presentation, I display a visualization of spatial indexes like the one shown below.
[Click to enlarge] A Quad-tree-based index for a 400 GB dataset that represents the world road network extracted from OpenStreetMap.
There are actually several ways to visualize these indexes and the good news is that all of them are fairly simple. You can choose between them based on your needs.

Prerequisites

You need SpatialHadoop installed and running to be able to build the indexes that we are going to visualize. I assume that you already have a spatial index constructed using SpatialHadoop and you only need to visualize it. For more details on how to setup SpatialHadoop and use it to build distributed indexes for big spatial data, please check the SpatialHadoop website and Wiki pages.

Using QGIS

The most straightforward way to visualize your index is though QGIS. As a side product of the SpatialHadoop index command, a WKT file is generated that describes the shape of the index along with some additional information such as the size of each partition. By loading this small file into QGIS, you can interactively explore the index. Here are more detailed steps.
  1. Build an index in SpatialHadoop using the 'index' command.
  2. In the index directory, you will find a file with the extension '.wkt'. Copy that file to the local machine using 'hdfs dfs -get' command. For example:
    hdfs dfs -get cemetery.str/_str.wkt .
  3. Start QGIS and use the "Add Delimited Text Layer" button.
  4. Use the "Browse" button and choose the wkt file.
  5. Usually, QGIS can automatically detect the format of the file. In case you need to manually set the options, please do the following:
    1. Choose the "Tab" delimiter.
    2. Check the box "First record has field names"
    3. Choose "Well Known Text (WKT)" geometry definition.
    4. Choose "Boundaries" as a Geometry field.
  6. Press the OK button to import the file.
  7. Set the correct Coordinate Reference System according to your data format. Usually, you can choose "WGS 84".
  8. The index partitions are displayed in QGIS as in the following picture.
In QGIS, you can select any partition to find all the details about it such as the corresponding file name, size in bytes, or the total number of records. You can also interactively zoom in and out or color the partitions based on their attributes. The drawback of this method is that you do not see the raw data. While you can load the original file in QGIS as well, it will be too slow if the input file is large.

Using HadoopViz

HadoopViz is the visualization component of SpatialHadoop. You can use HadoopViz to visualize two separate images for the data and the index, and then overlay them on top of each other. Please follow the steps below assuming the index is in HDFS under the directory 'cemetery.str'.
  1. To plot the data, issue the following command:
    shadoop gplot  cemetery.str cemetery.png shape:osm
    The command will produce the image 'cemetery.png' such as the one below.
  2. Issue the following command to plot the index:
    shadoop gplot cemetery.str/_master.str cemetery_index.png shape:edu.umn.cs.spatialHadoop.indexing.Partition
    The output can be similar tot he image below
  3. All you need to do after this is to overlay the two images on top of each other to get the final picture shown below.
While you will not be able to interactively zoom in and out in this picture, this method can easily scale to very large data. The reason is that the gplot function runs as a MapReduce program in SpatialHadoop and is able to scale to terabytes or more depending on your cluster size.

Thursday, December 22, 2016

Visualize your ideas using Rasem

A major part of a researchers' work is to write papers and articles that describe their work and make posters and presentations to better communicate their ideas. We all believe that "A picture is worth a thousand words" and we are always looking for better ways to visualize our ideas. In this blog article, I present Rasem, a library that I built as I started my PhD and used it in many of my papers and presentation to build nice visualizations like the ones shown below.

Thursday, March 31, 2016

Around the world in one hour! (revisit)

In this blog post, we revisit an earlier blog post about extracting data from OpenStreetMap Planet.osm file. We still use the same extraction script in Pigeon but we make it modular and easier to reuse. We make use of the macro definitions in Pig to extract common code into a separate file. In the following part, we first describe the OSMX.pig file which contains the reusable macros. After that, we describe how to use it in your own Pig script.

Saturday, February 20, 2016

HadoopViz: Extensible Visualization of Big Spatial Data

With huge sizes of spatial data, a common functionality that users are looking for is to visualize this data to see how it looks like. This gives users the power of quickly exploring new datasets with huge sizes. For example, the video below summarizes 1 trillion points that represent the temperature of every 1 km2 on the earth surface on every day from 2009 to 2014 (total of six years).

Wednesday, December 2, 2015

Voronoi diagram and Dealunay triangulation construction of Big Spatial Data using SpatialHadoop

Voronoi Diagram and Delaunay Triangulation

A very popular computational geometry problem is the Voronoi Diagram (VD), and its dual Delaunay Triangulation (DT). In both cases, the input is a set of points (sites). In VD, the output is a tessellation of the space into convex polygons, as one per input site, such that each polygon covers all locations that are closest to the corresponding site than any other site. In DT, the output is a triangulation, where each triangle connects three sites, such that the circumcirlce of each triangle does not contain any other sites. These two constructs are dual in a sense that each edge in the DT connects two sites that share a common edge in VD.

Monday, November 30, 2015

Reducing the memory footprint of the spatial join operator in Hyracks

This is the fourth blog post in a series that describes how to build an efficient spatial join Hyracks operator in AsterixDB. You can refer to the previous posts below:
  1. An Introduction to Hyracks Operators in AsterixDB
  2. Your first Hyracks operator
  3. A Hyracks operator for plane-sweep join

Scope of this post

In the third post, I described how to implement an efficient plane-sweep join algorithm in a Hyracks operator. That implementation simply caches all data frame, or simply records, in memory before running the plane-sweep algorithm. As the input datasets go larger, this algorithm might require a huge memory footprint which is not desirable with the big data that is handled by AsterixDB. In this blog post, I will describe how to improve the previous operator to run with a limited memory capacity.

Tuesday, November 24, 2015

A Hyracks operator for plane-sweep join

This is the third blog post in a series of blog posts about creating an efficient Hyracks operator for spatial join. In the previous two posts, we gave an introduction to Hyracks operators and briefly described how to write a simple Hyracks operator. In this blog post, we describe how to make the previously created operator more efficient by using a plane-sweep spatial join algorithm instead of a naive nested loop algorithm.

Scope of this blog post

In this blog post, we will focus on improving the operator we created in the last blog post by replacing the nested-loop join subroutine with the more efficient plane-sweep join algorithm. In addition, we will do some minor code refactor to keep the code organized. For simplicity, we assume that the two inputs are already sorted on the x coordinate which can be done using one of the sort operators that ship with AsterixDB, e.g., ExternalSortOperatorDescriptor.