Saturday, January 9, 2021

Standardized generation of big spatial data in Spark

If you build a system or algorithm for spatial data processing, you might need to generate large scale spatial data for benchmarking. The generated data needs to have the following characteristics:
  1. Flexible: You should be able to easily control the characteristics of the data, e.g., size or skewness.
  2. Reproducible: It should be relatively easy to reproduce this dataset to allow others to repeat the experiments.
  3. Efficient: To be able to generate large-scale data without a problem.
All these characteristics are available in, spider, the award-winning open-source spatial data generator. Spider has currently three implementations, in Python, Ruby, and Scala on Spark. Spider was published in SpatialGems 2019 [1] and won the best paper award and was demonstrated in SIGSAPTIAL 2020 [2]. It is also publicly available on [https://spider.cs.ucr.edu]. The video below gives an overview of SpiderWeb. This article gives an overview on how the Scala implementation on Spark works.