Thursday, March 31, 2016

Around the world in one hour! (revisit)

In this blog post, we revisit an earlier blog post about extracting data from OpenStreetMap Planet.osm file. We still use the same extraction script in Pigeon but we make it modular and easier to reuse. We make use of the macro definitions in Pig to extract common code into a separate file. In the following part, we first describe the OSMX.pig file which contains the reusable macros. After that, we describe how to use it in your own Pig script.


The osmx.pig file contains all the common code that is used to extract points, ways, or relations from an OSM file. It contains the following functions.


This macro extracts all the nodes from an OSM file. It returns a dataset that contains tuples of the following format.



This macro returns all ways in the file. Each way is returned as a series of line segments which connect two consecutive nodes on the way. It returns a dataset with tuples of the following format.
segment_idlongA generated unique ID for each segment
id1longThe ID of the starting node
latitude1doubleLatitude of the starting node
longitude1doubleLongitude of the starting node
id2longThe ID of the ending node
latitude2doubleLatitude of the ending node
longitude2doubleLongitude of the ending node
way_idlongThe ID of the way that contains this segment
tagsmap[(chararray)]All the tags of the way


This macro returns all ways in the file. However, unlike LoadOSMWaysWithSegments, it returns one tuple for each segment which contains the entire geometry of the way. Each tuple is formatted as follows.

way_idlongThe ID of the way as it appears in the OSM file
first_node_idlongThe ID of the first node in this way
last_node_idlongThe ID of the last node in this way
geombytearrayThe geometry of the way
tagsmap[(chararray)]The tags of the way as they appear in the OSM file


This macro returns all objects in the OSM file. Objects can be one of two cases:
  1. First level relations: This contains relations that contain only ways.
  2. Dangled ways: This contains ways that are not part of any relations.
The returned dataset does not contain second level relations such as relations that contain other relations. The format of the returned dataset is as follows.
object_idlongThe ID of either the relation or the way
geombytearrayThe geometry of the object
tagsmap[(chararray)]The tags of either the way or the relation as they appear in the OSM file


The script planet-extractor.pig provides an example that extracts the datasets that are available on the SpatialHadoop datasets page. The header of this file imports the 'osmx.pig' file as well as the required JAR libraries.

REGISTER spatialhadoop-2.4.jar;
REGISTER pigeon-0.2.1.jar;
REGISTER esri-geometry-api-1.2.jar;
REGISTER jts-1.8.jar;
IMPORT 'osmx.pig';

The next two lines extracts all nodes and writes them to a file.

all_nodes = LoadOSMNodes('$input');

STORE all_nodes INTO '$output/all_nodes.bz2';

This makes it much easier than earlier code where the extraction is interleaved with writing the output.
Similarly, the following few lines extract the road network and writes it to the output.

-- Extract road network
road_network = LoadOSMWaysWithSegments('$input');
road_network = FILTER road_network BY edu.umn.cs.spatialHadoop.osm.HasTag(tags,
road_network = FOREACH road_network GENERATE segment_id,
               id1, latitude1, longitude1,
               id2, latitude2, longitude2,
               way_id, edu.umn.cs.spatialHadoop.osm.MapToJson(tags) AS tags;
STORE road_network INTO '$output/road_network.bz2';

Although the code looks a little bit ugly, it only contains four statements. The first one extracts all the ways as segments using the LoadOSMWaysWithSegments macro. The second statement filters the segments that are related to the road network using the tags attribute. The third statement removes unnecessary columns and the fourth statement writes the output.

Similar to the road network, the next few lines extracts and stores the buildings dataset.

all_objects = LoadOSMObjects('$input');
 buildings = 
FILTER all_objects BY edu.umn.cs.spatialHadoop.osm.HasTag(tags,
STORE buildings INTO '$output/buildings.bz2';

The first statement extracts all the objects from the file. The second statement filters the buildings using the tags attribute. Finally, the third statement stores the output.


This work was partially supported by an AWS in Education Grant.

External Resources

1 comment: