Thursday, November 8, 2018
In the big data forest, we grow groves not trees
In this blog post, I describe a new indexing mechanism for big data. While the approach is general and can adapt many existing indexes to big data, this post particularly focuses on spatial index trees such as the R-tree as they tend to be more challenging. The key idea is that regular index structures are designed to write the index to disk pages in a regular file system and the indexes are expected to accommodate new records. In big data, indexes are written to the distributed file system which does not allow modifying the files.
Friday, October 20, 2017
Visualize SpatialHadoop indexes
I received several requests asking for help in building visualizations for SpatialHadoop indexes. In many of my papers, posters, and presentation, I display a visualization of spatial indexes like the one shown below.
There are actually several ways to visualize these indexes and the good news is that all of them are fairly simple. You can choose between them based on your needs.
![]() |
[Click to enlarge] A Quad-tree-based index for a 400 GB dataset that represents the world road network extracted from OpenStreetMap. |
Thursday, December 22, 2016
Visualize your ideas using Rasem
A major part of a researchers' work is to write papers and articles that describe their work and make posters and presentations to better communicate their ideas. We all believe that "A picture is worth a thousand words" and we are always looking for better ways to visualize our ideas. In this blog article, I present Rasem, a library that I built as I started my PhD and used it in many of my papers and presentation to build nice visualizations like the ones shown below.
Thursday, March 31, 2016
Around the world in one hour! (revisit)
In this blog post, we revisit an earlier blog post about extracting data from OpenStreetMap Planet.osm file. We still use the same extraction script in Pigeon but we make it modular and easier to reuse. We make use of the macro definitions in Pig to extract common code into a separate file. In the following part, we first describe the OSMX.pig file which contains the reusable macros. After that, we describe how to use it in your own Pig script.
Labels:
OpenStreetMap,
OSM,
Pig,
Pigeon,
PigLatin,
Planet,
spatialhadoop,
tareeg
Saturday, February 20, 2016
HadoopViz: Extensible Visualization of Big Spatial Data
With huge sizes of spatial data, a common functionality that users are looking for is to visualize this data to see how it looks like. This gives users the power of quickly exploring new datasets with huge sizes. For example, the video below summarizes 1 trillion points that represent the temperature of every 1 km2 on the earth surface on every day from 2009 to 2014 (total of six years).
Wednesday, December 2, 2015
Voronoi diagram and Dealunay triangulation construction of Big Spatial Data using SpatialHadoop
Voronoi Diagram and Delaunay Triangulation
A very popular computational geometry problem is the Voronoi Diagram (VD), and its dual Delaunay Triangulation (DT). In both cases, the input is a set of points (sites). In VD, the output is a tessellation of the space into convex polygons, as one per input site, such that each polygon covers all locations that are closest to the corresponding site than any other site. In DT, the output is a triangulation, where each triangle connects three sites, such that the circumcirlce of each triangle does not contain any other sites. These two constructs are dual in a sense that each edge in the DT connects two sites that share a common edge in VD.Monday, November 30, 2015
Reducing the memory footprint of the spatial join operator in Hyracks
This is the fourth blog post in a series that describes how to build an efficient spatial join Hyracks operator in AsterixDB. You can refer to the previous posts below:
- An Introduction to Hyracks Operators in AsterixDB
- Your first Hyracks operator
- A Hyracks operator for plane-sweep join
Scope of this post
In the third post, I described how to implement an efficient plane-sweep join algorithm in a Hyracks operator. That implementation simply caches all data frame, or simply records, in memory before running the plane-sweep algorithm. As the input datasets go larger, this algorithm might require a huge memory footprint which is not desirable with the big data that is handled by AsterixDB. In this blog post, I will describe how to improve the previous operator to run with a limited memory capacity.
Labels:
AsterixDB,
Hyracks,
Memory,
Plane sweep,
Spatial Join
Location:
Irvine, CA, USA
Subscribe to:
Posts (Atom)