Saturday, January 9, 2021

Standardized generation of big spatial data in Spark

If you build a system or algorithm for spatial data processing, you might need to generate large scale spatial data for benchmarking. The generated data needs to have the following characteristics:
  1. Flexible: You should be able to easily control the characteristics of the data, e.g., size or skewness.
  2. Reproducible: It should be relatively easy to reproduce this dataset to allow others to repeat the experiments.
  3. Efficient: To be able to generate large-scale data without a problem.
All these characteristics are available in, spider, the award-winning open-source spatial data generator. Spider has currently three implementations, in Python, Ruby, and Scala on Spark. Spider was published in SpatialGems 2019 [1] and won the best paper award and was demonstrated in SIGSAPTIAL 2020 [2]. It is also publicly available on []. The video below gives an overview of SpiderWeb. This article gives an overview on how the Scala implementation on Spark works.

Monday, July 29, 2019

UCR Star reveals Google Maps poor quality in Beijing, China (Or may be not!)

We used to hear stories about some mapping applications failing due to poor data quality. However, it makes a big difference when you find major failure of the most prevalent mapping application in the country with the biggest user base. The story is that while exploring some datasets in UCR Star, I found the following view of building in Beijing, China.

Sunday, July 14, 2019

A Star Is Born at UCR


In the Big Data Lab at UCR, we are happy to announce the first release of UCR STAR, formally, the UCR Spatio-temporal Active Repository []. STAR is made available as a service to the research community to provide easy access to existing big spatio-temporal datasets through an interactive exploratory interface. Researchers and developers can choose from the datasets and interactively explore them through a map-based interface. Users can also search and filter those datasets as if they are shopping for their research, except that everything is free. The website is best accessed through a desktop browser but a limited mobile-friendly interface is also provided. You can find more details on how to use the archive at [].

Thursday, November 8, 2018

In the big data forest, we grow groves not trees

In this blog post, I describe a new indexing mechanism for big data. While the approach is general and can adapt many existing indexes to big data, this post particularly focuses on spatial index trees such as the R-tree as they tend to be more challenging. The key idea is that regular index structures are designed to write the index to disk pages in a regular file system and the indexes are expected to accommodate new records. In big data, indexes are written to the distributed file system which does not allow modifying the files.

Friday, October 20, 2017

Visualize SpatialHadoop indexes

I received several requests asking for help in building visualizations for SpatialHadoop indexes. In many of my papers, posters, and presentation, I display a visualization of spatial indexes like the one shown below.
[Click to enlarge] A Quad-tree-based index for a 400 GB dataset that represents the world road network extracted from OpenStreetMap.
There are actually several ways to visualize these indexes and the good news is that all of them are fairly simple. You can choose between them based on your needs.

Thursday, December 22, 2016

Visualize your ideas using Rasem

A major part of a researchers' work is to write papers and articles that describe their work and make posters and presentations to better communicate their ideas. We all believe that "A picture is worth a thousand words" and we are always looking for better ways to visualize our ideas. In this blog article, I present Rasem, a library that I built as I started my PhD and used it in many of my papers and presentation to build nice visualizations like the ones shown below.

Thursday, March 31, 2016

Around the world in one hour! (revisit)

In this blog post, we revisit an earlier blog post about extracting data from OpenStreetMap Planet.osm file. We still use the same extraction script in Pigeon but we make it modular and easier to reuse. We make use of the macro definitions in Pig to extract common code into a separate file. In the following part, we first describe the OSMX.pig file which contains the reusable macros. After that, we describe how to use it in your own Pig script.