Ahmed Eldawy: In the big data forest, we grow groves not trees

In this blog post, I describe a new indexing mechanism for big data. While the approach is general and can adapt many existing indexes to big data, this post particularly focuses on spatial index trees such as the R-tree as they tend to be more challenging. The key idea is that regular index structures are designed to write the index to disk pages in a regular file system and the indexes are expected to accommodate new records. In big data, indexes are written to the distributed file system which does not allow modifying the files.

R-trees

The term R-tree was originally introduced by Guttman in his 1984 paper [1] and it is a way to adapt the B+-tree to store rectangles (or boxes in general). The tree structure itself is simple. Each tree node is associated with a rectangle that represents the minimum bounding rectangle (MBR) of all its children. Searching R-trees is simple. However, inserting new records entails two challenges that were not observed in the regular single-dimensional B+-tree. First, if the inserted key overlaps mutliple nodes or does not overlap any node, there must be a new technique to choose where to insert it. Second, when a node overflows, there should be a new splitting mechanism. The original R-tree paper proposed solutions to these two challenges but they were not optimal. The paper was followed by many other papers that improve these two key parts to improve the index. In particular, the R*-tree [2] is an appealing solution that employs a few heuristics that were found to be practically good.

R-tree node size

By design, a node in R-tree has to have between m (minimum) and M (maximum) children. Also, by design, m≤M/2. This is a reasonable limitation to make splitting a node possible. While insertion, a node is split when it has M+1 children. To split it into two nodes, one of the two nodes has to have at most M/2 children. In practice, the R*-tree sets m=0.3M which results in many underutilized nodes. While this reduces the disk utilization because each node occupies a disk page regardless of its size. However, it has an advantage of reducing the insertion time for future records. When a record is inserted in one of these underutilized nodes, the insertion is as simple as modifying the single disk page that contains that node.

Sample-based big R-trees

To index big spatial data, a common approach is the sample-based approach originally introduced in the SpatialHadoop paper[3]. The idea is to draw a sample of the input, say 1%, inserts this sample into an in-memory R-tree, and finally, use its leaf nodes as partition boundaries. Ideally, we want the number of leaf nodes in the R-tree to be equal to the number of blocks in the file. The number of blocks (N) is the file size (F) divided by the block size (B=128 MB by default) (N=F/B). Assuming a perfect load balancing, the contents of each partition will perfectly fit in one file system block. Since the load is not perfectly balanced, we want to slightly increase the number of leaf nodes (L), say, (L=F/B*1.05).

The issue here is that we cannot directly set the number of leaf nodes in an R-tree. We can only adjust the parameters m and M which indirectly affects the number of leaf nodes. Assuming that we have (S) sample points, the number of leaf nodes is anything between ⌈S/M⌉ and ⌊S/m⌋. For example, assume that we build an index for a (F=10GB) file with file block size (B=128MB). Then, the ideal number of partitions is (N=80 blocks). With a sample of size (S=10,000 points), we would set M=10,000/80=125 to get at least 80 leaf nodes. In this case, m=0.3M=38 which causes the maximum number of leaf nodes to be 263.
In this case, we can easily see the huge variability in the number of leaf nodes which will result in many underutilized nodes. Keep in mind that all those underutilized nodes in a distributed file system are sealed and cannot accommodate new records. This means they are practically wasted.

STR

The idea above was applied to the STR packing algorithm. To use it for big data indexing, the sample is sorted by x, partitioned into equi-sized strips, and then each strip is recursively sorted by y and partitioned into equi-sized parts. The result is as shown below.

As shown, it produces balanced partitions but the partitions are not very good due to the very thin and very wide partitions which results in a sub-part performance for queries.

R-Groves to the rescue

The main drawback of the sample-based approach described above is that it cannot accurately adjust the number of blocks. Notice that this limitation does not apply to single-dimensional indexes as you can simply sort all the points to produce equi-sized partitions. However, in spatial data, other considerations should be taken into account such as the shape of the partitions (squares are preferred over rectangles) or the overlap between them (smaller overlap is preferred). The R-tree-family produce high-quality partitions in terms of spatial properties but completely ignore the size of the partitions. Below, we describe how the R-Grove overcomes this limitation while still inheriting the nice properties of R-trees and R*-trees.

Balanced partitions

The key idea in the R-Grove is to relax the limitation of m≤M/2, thus, we can set m=0.95 which makes the range [⌈S/M⌉, ⌊S/m⌋] tighter. For the same example we had above, if we adjust m=0.95, then the range of partitions would be [80,84] which is much tighter than the [80,263] that we had earlier with the R*-tree.

Example

Below is an example of two indexes built with the R*-tree and RR*-tree approaches. The colors indicate the size of the partitions which reflects the number of partitions in each one.

R*-tree-based (with m=0.3M)

While it avoids the thin and wide partitions in the STR-based index, the partition sizes are variant due to the wide range of [m,M].

R*-Grove (with m=0.95M)

By increasing the value of m in the R*-Grove, we can still maintain the high quality of the partitions while increasing the disk utilization for the partitions.

The key idea

The key idea is that after reading the sample, we build an R*-tree-like index in a top-down fashion. We start with one node that contains all the points and then recursively split it using the R*-tree node splitting algorithm until each node contains at most M records. However, this does not yet guarantee that each node will contain at least m records. Therefore, we further extend the node splitting algorithm by adding an additional test that ensures that each record contains at least m records. The test turns out to be very simple. All we need to ensure is that the size of each split satisfies the following inequality.
⌈S/M⌉≤⌊S/m⌋
The above test guarantees that the final partitions will have between m and M records even if m>M/2. Further details and proofs are available in the R-Grove paper [4]. The source code of the R*-Grove construction method is available in the method RStarTree#partitionPoints.

References

[1] Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Conference 1984: 47-57. doi>10.1145/971697.602266

[2] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger: The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. SIGMOD Conference 1990: 322-331. doi>10.1145/93605.98741
[3] Ahmed Eldawy, Mohamed F. Mokbel: SpatialHadoop: A MapReduce framework for spatial data. ICDE 2015: 1352-1363. doi> 10.1109/ICDE.2015.7113382
[4] Tin Vu and Ahmed Eldawy: R-Grove: Growing a Family of R-trees in the Big Data Forest. ACM SIGSPATIAL 2018, Seattle, WA. doi> 10.1145/3274895.3274984 [PDF]

Ahmed Eldawy

Thursday, November 8, 2018

In the big data forest, we grow groves not trees