Monday, November 30, 2015

Reducing the memory footprint of the spatial join operator in Hyracks

This is the fourth blog post in a series that describes how to build an efficient spatial join Hyracks operator in AsterixDB. You can refer to the previous posts below:
  1. An Introduction to Hyracks Operators in AsterixDB
  2. Your first Hyracks operator
  3. A Hyracks operator for plane-sweep join

Scope of this post

In the third post, I described how to implement an efficient plane-sweep join algorithm in a Hyracks operator. That implementation simply caches all data frame, or simply records, in memory before running the plane-sweep algorithm. As the input datasets go larger, this algorithm might require a huge memory footprint which is not desirable with the big data that is handled by AsterixDB. In this blog post, I will describe how to improve the previous operator to run with a limited memory capacity.

Tuesday, November 24, 2015

A Hyracks operator for plane-sweep join

This is the third blog post in a series of blog posts about creating an efficient Hyracks operator for spatial join. In the previous two posts, we gave an introduction to Hyracks operators and briefly described how to write a simple Hyracks operator. In this blog post, we describe how to make the previously created operator more efficient by using a plane-sweep spatial join algorithm instead of a naive nested loop algorithm.

Scope of this blog post

In this blog post, we will focus on improving the operator we created in the last blog post by replacing the nested-loop join subroutine with the more efficient plane-sweep join algorithm. In addition, we will do some minor code refactor to keep the code organized. For simplicity, we assume that the two inputs are already sorted on the x coordinate which can be done using one of the sort operators that ship with AsterixDB, e.g., ExternalSortOperatorDescriptor.

Monday, November 23, 2015

Your first Hyracks operator

In a previous blog post, I introduced Hyracks operators and briefly described how they work. In this blog post, I'll show how to create and use a very simple Hyracks operator that performs spatial join using a naive nested loop algorithm. Although there is already an existing nested loop join operator in AsterixDB, I provide a simpler, probably less efficient, implementation for the sake of demonstration. This blog post will focus on the steps needed to create a Hyracks operator while future blog posts will show how to improve this operator to be more efficient.

Friday, November 20, 2015

An Introduction to Hyracks Operators in AsterixDB

I had the opportunity to collaborate with the AsterixDB team led by Mike Carey and Chen Li in University of California, Irvine. The main objective of this collaboration is to introduce efficient ad-hoc spatial join query processing in AsterixDB. In this blog post, I will try to summarize my work for future developers and users of AsterixDB. I believe that this post could be very helpful as it is coming from an outsider. So, I will try to make it simple and will not assume and previous knowledge of AsterixDB. In fact, this is my first real interaction with AsterixDB system internals.

Sunday, November 1, 2015

Speeding up point-in-polygon query using Pigeon and SpatialHadoop

Point-in-polygon Query

A widely used function in GIS and spatial applications is the point-in-polygon query which finds whether a point is inside a polygon or not. Typically, this function is used as a spatial-join predicate to relate a large set of points to a smaller set of polygons, for example, associate a set of geo-tagged tweets to states in the whole world. If the polygons are relatively simple, e.g., a few tens of points per polygon, then we can use any simple spatial join algorithm such as plane-sweep or partition-based spatial merge join (PBSM). However, if the polygons are rather complex, e.g., tens of thousands of points per polygon, then a single point-in-polygon query becomes too expensive and regular solutions do not scale. In this blog post, I'll describe a simple technique that we applied using Pigeon and SpatialHadoop to speed up the point-in-polygon query in the case of very complex polygons.