Thursday, November 8, 2018

In the big data forest, we grow groves not trees

In this blog post, I describe a new indexing mechanism for big data. While the approach is general and can adapt many existing indexes to big data, this post particularly focuses on spatial index trees such as the R-tree as they tend to be more challenging. The key idea is that regular index structures are designed to write the index to disk pages in a regular file system and the indexes are expected to accommodate new records. In big data, indexes are written to the distributed file system which does not allow modifying the files.

R-trees

The term R-tree was originally introduced by Guttman in his 1984 paper [1] and it is a way to adapt the B+-tree to store rectangles (or boxes in general). The tree structure itself is simple. Each tree node is associated with a rectangle that represents the minimum bounding rectangle (MBR) of all its children. Searching R-trees is simple. However, inserting new records entails two challenges that were not observed in the regular single-dimensional B+-tree. First, if the inserted key overlaps mutliple nodes or does not overlap any node, there must be a new technique to choose where to insert it. Second, when a node overflows, there should be a new splitting mechanism. The original R-tree paper proposed solutions to these two challenges but they were not optimal. The paper was followed by many other papers that improve these two key parts to improve the index. In particular, the R*-tree [2] is an appealing solution that employs a few heuristics that were found to be practically good.

R-tree node size

By design, a node in R-tree has to have between m (minimum) and M (maximum) children. Also, by design, m≤M/2. This is a reasonable limitation to make splitting a node possible. While insertion, a node is split when it has M+1 children. To split it into two nodes, one of the two nodes has to have at most M/2 children. In practice, the R*-tree sets m=0.3M which results in many underutilized nodes. While this reduces the disk utilization because each node occupies a disk page regardless of its size. However, it has an advantage of reducing the insertion time for future records. When a record is inserted in one of these underutilized nodes, the insertion is as simple as modifying the single disk page that contains that node.

Sample-based big R-trees

To index big spatial data, a common approach is the sample-based approach originally introduced in the SpatialHadoop paper[3]. The idea is to draw a sample of the input, say 1%, inserts this sample into an in-memory R-tree, and finally, use its leaf nodes as partition boundaries. Ideally, we want the number of leaf nodes in the R-tree to be equal to the number of blocks in the file. The number of blocks (N) is the file size (F) divided by the block size (B=128 MB by default) (N=F/B). Assuming a perfect load balancing, the contents of each partition will perfectly fit in one file system block. Since the load is not perfectly balanced, we want to slightly increase the number of leaf nodes (L), say, (L=F/B*1.05).
The issue here is that we cannot directly set the number of leaf nodes in an R-tree. We can only adjust the parameters m and M which indirectly affects the number of leaf nodes. Assuming that we have (S) sample points, the number of leaf nodes is anything between ⌈S/M⌉ and ⌊S/m⌋. For example, assume that we build an index for a (F=10GB) file with file block size (B=128MB). Then, the ideal number of partitions is (N=80 blocks). With a sample of size (S=10,000 points), we would set M=10,000/80=125 to get at least 80 leaf nodes. In this case, m=0.3M=38 which causes the maximum number of leaf nodes to be 263.
In this case, we can easily see the huge variability in the number of leaf nodes which will result in many underutilized nodes. Keep in mind that all those underutilized nodes in a distributed file system are sealed and cannot accommodate new records. This means they are practically wasted.

STR

The idea above was applied to the STR packing algorithm. To use it for big data indexing, the sample is sorted by x, partitioned into equi-sized strips, and then each strip is recursively sorted by y and partitioned into equi-sized parts. The result is as shown below.
As shown, it produces balanced partitions but the partitions are not very good due to the very thin and very wide partitions which results in a sub-part performance for queries.

R-Groves to the rescue

The main drawback of the sample-based approach described above is that it cannot accurately adjust the number of blocks. Notice that this limitation does not apply to single-dimensional indexes as you can simply sort all the points to produce equi-sized partitions. However, in spatial data, other considerations should be taken into account such as the shape of the partitions (squares are preferred over rectangles) or the overlap between them (smaller overlap is preferred). The R-tree-family produce high-quality partitions in terms of spatial properties but completely ignore the size of the partitions. Below, we describe how the R-Grove overcomes this limitation while still inheriting the nice properties of R-trees and R*-trees.

Balanced partitions

The key idea in the R-Grove is to relax the limitation of m≤M/2, thus, we can set m=0.95 which makes the range [⌈S/M⌉, ⌊S/m⌋] tighter. For the same example we had above, if we adjust m=0.95, then the range of partitions would be [80,84] which is much tighter than the [80,263] that we had earlier with the R*-tree.

Example

Below is an example of two indexes built with the R*-tree and RR*-tree approaches. The colors indicate the size of the partitions which reflects the number of partitions in each one.

R*-tree-based (with m=0.3M)

While it avoids the thin and wide partitions in the STR-based index, the partition sizes are variant due to the wide range of [m,M].

R*-Grove (with m=0.95M)


By increasing the value of m in the R*-Grove, we can still maintain the high quality of the partitions while increasing the disk utilization for the partitions.

The key idea

The key idea is that after reading the sample, we build an R*-tree-like index in a top-down fashion. We start with one node that contains all the points and then recursively split it using the R*-tree node splitting algorithm until each node contains at most M records. However, this does not yet guarantee that each node will contain at least m records. Therefore, we further extend the node splitting algorithm by adding an additional test that ensures that each record contains at least m records. The test turns out to be very simple. All we need to ensure is that the size of each split satisfies the following inequality.
⌈S/M⌉⌊S/m⌋
The above test guarantees that the final partitions will have between m and M records even if m>M/2. Further details and proofs are available in the R-Grove paper [4]. The source code of the R*-Grove construction method is available in the method RStarTree#partitionPoints.

References

[1] Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Conference 1984: 47-57. doi>10.1145/971697.602266
[2] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger: The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. SIGMOD Conference 1990: 322-331. doi>10.1145/93605.98741
[3] Ahmed Eldawy, Mohamed F. Mokbel: SpatialHadoop: A MapReduce framework for spatial data. ICDE 2015: 1352-1363. doi> 10.1109/ICDE.2015.7113382
[4] Tin Vu and Ahmed Eldawy: R-Grove: Growing a Family of R-trees in the Big Data Forest. ACM SIGSPATIAL 2018, Seattle, WA. doi> 10.1145/3274895.3274984 [PDF]

Friday, October 20, 2017

Visualize SpatialHadoop indexes

I received several requests asking for help in building visualizations for SpatialHadoop indexes. In many of my papers, posters, and presentation, I display a visualization of spatial indexes like the one shown below.
[Click to enlarge] A Quad-tree-based index for a 400 GB dataset that represents the world road network extracted from OpenStreetMap.
There are actually several ways to visualize these indexes and the good news is that all of them are fairly simple. You can choose between them based on your needs.

Thursday, December 22, 2016

Visualize your ideas using Rasem

A major part of a researchers' work is to write papers and articles that describe their work and make posters and presentations to better communicate their ideas. We all believe that "A picture is worth a thousand words" and we are always looking for better ways to visualize our ideas. In this blog article, I present Rasem, a library that I built as I started my PhD and used it in many of my papers and presentation to build nice visualizations like the ones shown below.

Thursday, March 31, 2016

Around the world in one hour! (revisit)

In this blog post, we revisit an earlier blog post about extracting data from OpenStreetMap Planet.osm file. We still use the same extraction script in Pigeon but we make it modular and easier to reuse. We make use of the macro definitions in Pig to extract common code into a separate file. In the following part, we first describe the OSMX.pig file which contains the reusable macros. After that, we describe how to use it in your own Pig script.

Saturday, February 20, 2016

HadoopViz: Extensible Visualization of Big Spatial Data

With huge sizes of spatial data, a common functionality that users are looking for is to visualize this data to see how it looks like. This gives users the power of quickly exploring new datasets with huge sizes. For example, the video below summarizes 1 trillion points that represent the temperature of every 1 km2 on the earth surface on every day from 2009 to 2014 (total of six years).

Wednesday, December 2, 2015

Voronoi diagram and Dealunay triangulation construction of Big Spatial Data using SpatialHadoop

Voronoi Diagram and Delaunay Triangulation

A very popular computational geometry problem is the Voronoi Diagram (VD), and its dual Delaunay Triangulation (DT). In both cases, the input is a set of points (sites). In VD, the output is a tessellation of the space into convex polygons, as one per input site, such that each polygon covers all locations that are closest to the corresponding site than any other site. In DT, the output is a triangulation, where each triangle connects three sites, such that the circumcirlce of each triangle does not contain any other sites. These two constructs are dual in a sense that each edge in the DT connects two sites that share a common edge in VD.

Monday, November 30, 2015

Reducing the memory footprint of the spatial join operator in Hyracks

This is the fourth blog post in a series that describes how to build an efficient spatial join Hyracks operator in AsterixDB. You can refer to the previous posts below:
  1. An Introduction to Hyracks Operators in AsterixDB
  2. Your first Hyracks operator
  3. A Hyracks operator for plane-sweep join

Scope of this post

In the third post, I described how to implement an efficient plane-sweep join algorithm in a Hyracks operator. That implementation simply caches all data frame, or simply records, in memory before running the plane-sweep algorithm. As the input datasets go larger, this algorithm might require a huge memory footprint which is not desirable with the big data that is handled by AsterixDB. In this blog post, I will describe how to improve the previous operator to run with a limited memory capacity.