Ahmed Eldawy

Thursday, November 8, 2018

In the big data forest, we grow groves not trees

In this blog post, I describe a new indexing mechanism for big data. While the approach is general and can adapt many existing indexes to big data, this post particularly focuses on spatial index trees such as the R-tree as they tend to be more challenging. The key idea is that regular index structures are designed to write the index to disk pages in a regular file system and the indexes are expected to accommodate new records. In big data, indexes are written to the distributed file system which does not allow modifying the files.

Visualize SpatialHadoop indexes

I received several requests asking for help in building visualizations for SpatialHadoop indexes. In many of my papers, posters, and presentation, I display a visualization of spatial indexes like the one shown below.

[Click to enlarge] A Quad-tree-based index for a 400 GB dataset that represents the world road network extracted from OpenStreetMap.

There are actually several ways to visualize these indexes and the good news is that all of them are fairly simple. You can choose between them based on your needs.

Visualize your ideas using Rasem

A major part of a researchers' work is to write papers and articles that describe their work and make posters and presentations to better communicate their ideas. We all believe that "A picture is worth a thousand words" and we are always looking for better ways to visualize our ideas. In this blog article, I present Rasem, a library that I built as I started my PhD and used it in many of my papers and presentation to build nice visualizations like the ones shown below.

Around the world in one hour! (revisit)

In this blog post, we revisit an earlier blog post about extracting data from OpenStreetMap Planet.osm file. We still use the same extraction script in Pigeon but we make it modular and easier to reuse. We make use of the macro definitions in Pig to extract common code into a separate file. In the following part, we first describe the OSMX.pig file which contains the reusable macros. After that, we describe how to use it in your own Pig script.

HadoopViz: Extensible Visualization of Big Spatial Data

With huge sizes of spatial data, a common functionality that users are looking for is to visualize this data to see how it looks like. This gives users the power of quickly exploring new datasets with huge sizes. For example, the video below summarizes 1 trillion points that represent the temperature of every 1 km2 on the earth surface on every day from 2009 to 2014 (total of six years).

Voronoi diagram and Dealunay triangulation construction of Big Spatial Data using SpatialHadoop

Voronoi Diagram and Delaunay Triangulation

A very popular computational geometry problem is the Voronoi Diagram (VD), and its dual Delaunay Triangulation (DT). In both cases, the input is a set of points (sites). In VD, the output is a tessellation of the space into convex polygons, as one per input site, such that each polygon covers all locations that are closest to the corresponding site than any other site. In DT, the output is a triangulation, where each triangle connects three sites, such that the circumcirlce of each triangle does not contain any other sites. These two constructs are dual in a sense that each edge in the DT connects two sites that share a common edge in VD.

Reducing the memory footprint of the spatial join operator in Hyracks

This is the fourth blog post in a series that describes how to build an efficient spatial join Hyracks operator in AsterixDB. You can refer to the previous posts below:

Scope of this post

In the third post, I described how to implement an efficient plane-sweep join algorithm in a Hyracks operator. That implementation simply caches all data frame, or simply records, in memory before running the plane-sweep algorithm. As the input datasets go larger, this algorithm might require a huge memory footprint which is not desirable with the big data that is handled by AsterixDB. In this blog post, I will describe how to improve the previous operator to run with a limited memory capacity.

Ahmed Eldawy

Thursday, November 8, 2018

In the big data forest, we grow groves not trees

Friday, October 20, 2017

Visualize SpatialHadoop indexes

Thursday, December 22, 2016

Visualize your ideas using Rasem

Thursday, March 31, 2016

Around the world in one hour! (revisit)

Saturday, February 20, 2016

HadoopViz: Extensible Visualization of Big Spatial Data

Wednesday, December 2, 2015

Voronoi diagram and Dealunay triangulation construction of Big Spatial Data using SpatialHadoop

Voronoi Diagram and Delaunay Triangulation

Monday, November 30, 2015

Reducing the memory footprint of the spatial join operator in Hyracks

Scope of this post