Saturday, January 9, 2021

Standardized generation of big spatial data in Spark

If you build a system or algorithm for spatial data processing, you might need to generate large scale spatial data for benchmarking. The generated data needs to have the following characteristics:
  1. Flexible: You should be able to easily control the characteristics of the data, e.g., size or skewness.
  2. Reproducible: It should be relatively easy to reproduce this dataset to allow others to repeat the experiments.
  3. Efficient: To be able to generate large-scale data without a problem.
All these characteristics are available in, spider, the award-winning open-source spatial data generator. Spider has currently three implementations, in Python, Ruby, and Scala on Spark. Spider was published in SpatialGems 2019 [1] and won the best paper award and was demonstrated in SIGSAPTIAL 2020 [2]. It is also publicly available on [https://spider.cs.ucr.edu]. The video below gives an overview of SpiderWeb. This article gives an overview on how the Scala implementation on Spark works.

Distributions

Spider supports six data distributions shown below.

Spatial Data Generation in Spark

To generate big spatial data, we wanted to implement the six distributions provided by Spider in Spark. This allows one to generate very large scale data in a short time by allowing multiple machines to generate the data in parallel. Generating random data in Spark is challenging, however, because it is against the nature of Spark and most big data platforms. Spark needs every unit of work to be deterministic so that it can be repeated in case of a task failure. This means that any sort of randomization is not allowed. To solve this problem, we utilize the seed that is associated with random number generators for repeatability of tasks. However, we still want the same dataset to be randomized from one run to another and we want to ensure that not all the machines will generate the same data.

How to use the Spark Generator

The following code snippet shows how to generate random spatial data in Spark using the Spark generator.
import edu.ucr.cs.bdlab.beast._
val randomData: SpatialRDD = sc.generateSpatialData(UniformDistribution, 1000000000)
This generates one billion random points with a uniform distribution. Following the Spark design, the data is generated lazily, i.e., it is generated only when accessed. Furthermore, it is generated in parallel using all executor nodes.

RandomSpatialRDD Design

The main component that we build is a new type of RDD called RandomSpatialRDD. This RDD is not dependent on any other RDDs; in other words, it is a root RDD that does not have a parent. To create a RandomSpatialRDD, you need to specify at least the following:
  • SparkContext: Which is needed for any RDD.
  • Distribution: The distribution of the dataset to generate.
  • Cardinality: The number of records to generate.
These parameters are enough to describe how the data will be generated. The data will not be generated right-away. Rather, it will be lazily generated when the RDD is computed. When the RandomSpatialRDD is created, a seed is generated to identify a specific instance of random data that will be generated for this RDD. If the RDD is used multiple times throughout the run of the Spark application, the same exact random data is generated. However, if the application is restarted, a new seed will be generated so a new random data will be generated. Finally, you can manually specify a seed to ensure that your program will always generated the same data like the example below.
import edu.ucr.cs.bdlab.beast._
val randomData: SpatialRDD = sc.generateSpatialData(UniformDistribution,
    1000000000, opts = "seed" -> 0)

RandomSpatialPartition

What the constructor of the RandomSpatialRDD does, is that it creates a number of partitions to allow the data to be generated in parallel. By default, RandomSpatialRDD creates one partition for each 1 million records but you can override the number of partitions with an additional parameters to the constructor. These partitions are given sequence numbers starting at zero. When the RandomSpatialRDD is used in a Spark action, the RDD seed is added to the partition number to generated a different seed for each partition. At the same time, if one partition is computed twice, e.g., in case of a task failure, the same seed will be used and the same data will be generated again.

Point-based Generators

Spider supports six generators; five of them are called point-based generators as they generate points directly, or generated boxed by using these points as center points. The point-based generators are, uniform, gaussian, diagonal, Sierpinski, and bit distributions. All generators in Spider generate data in a unit square (0, 0) -> (1,1). To generate data in parallel for these five generators, we simply let each worker node generate data in one partition using the same distribution. Since each partition has its own seed, the data in each partition will be different. When combined, the entire random spatial data will be generated as desired.

Parcel Generator

The Parcel generator cannot directly by parallelized in the same way since it must generate disjoint boxes. Normally, it works sequentially by splitting the unit square into smaller boxes, either vertically or horizontally, until the desired number of boxes is generated. Doing this in parallel is not straight-forward since all the generated boxes must appear roughly at the same depth. To overcome this problem, we follow a two-level approach to generate this data in Spark. The first level creates the partitions and the second level creates records inside each partition.

Parcel Partitions

To create the partitions of the parcel data, the RandomSpatialRDD will first define the number of partitions based on the data size, say n partitions. Then, it runs the parcel generator to split the input space into n boxes. To ensure that the behavior of the parcel generator is correct, we set the dither parameter to zero in this first phase. This simply ensures that the entire input space will be covered. For example, to generate 10,000,000 records, RandomSpatialRDD will create 10 partition by default which will look like the one below.
Each box in the above figure represents a partition. When the RDD is executed, each partition will be populated with records by further splitting it using the same logic. This means that the parcel generator will continue from where it stopped at the first level. When the RandomSpatialRDD is executed, each partition will run the parcel generator to generate all the data in one partition. To reduce the memory footprint of each partition, we transform the parcel generator to follow a depth-first traversal rather than a breadth-first traversal. This reduces the memory footprint from n to log(n). Further details are given in the SpiderWeb demo [2]. For more details, check the source code of the parcel generator in Scala.

References

No comments:

Post a Comment