osmx.pigThe osmx.pig file contains all the common code that is used to extract points, ways, or relations from an OSM file. It contains the following functions.
LoadOSMNodesThis macro extracts all the nodes from an OSM file. It returns a dataset that contains tuples of the following format.
This macro returns all ways in the file. Each way is returned as a series of line segments which connect two consecutive nodes on the way. It returns a dataset with tuples of the following format.
|segment_id||long||A generated unique ID for each segment|
|id1||long||The ID of the starting node|
|latitude1||double||Latitude of the starting node|
|longitude1||double||Longitude of the starting node|
|id2||long||The ID of the ending node|
|latitude2||double||Latitude of the ending node|
|longitude2||double||Longitude of the ending node|
|way_id||long||The ID of the way that contains this segment|
|tags||map[(chararray)]||All the tags of the way|
This macro returns all ways in the file. However, unlike LoadOSMWaysWithSegments, it returns one tuple for each segment which contains the entire geometry of the way. Each tuple is formatted as follows.
|way_id||long||The ID of the way as it appears in the OSM file|
|first_node_id||long||The ID of the first node in this way|
|last_node_id||long||The ID of the last node in this way|
|geom||bytearray||The geometry of the way|
|tags||map[(chararray)]||The tags of the way as they appear in the OSM file|
This macro returns all objects in the OSM file. Objects can be one of two cases:
- First level relations: This contains relations that contain only ways.
- Dangled ways: This contains ways that are not part of any relations.
The returned dataset does not contain second level relations such as relations that contain other relations. The format of the returned dataset is as follows.
|object_id||long||The ID of either the relation or the way|
|geom||bytearray||The geometry of the object|
|tags||map[(chararray)]||The tags of either the way or the relation as they appear in the OSM file|
The script planet-extractor.pig provides an example that extracts the datasets that are available on the SpatialHadoop datasets page. The header of this file imports the 'osmx.pig' file as well as the required JAR libraries.
The next two lines extracts all nodes and writes them to a file.
all_nodes = LoadOSMNodes('$input');
STORE all_nodes INTO '$output/all_nodes.bz2';
This makes it much easier than earlier code where the extraction is interleaved with writing the output.
Similarly, the following few lines extract the road network and writes it to the output.
-- Extract road network
road_network = LoadOSMWaysWithSegments('$input');
road_network = FILTER road_network BY edu.umn.cs.spatialHadoop.osm.HasTag(tags,
road_network = FOREACH road_network GENERATE segment_id,
id1, latitude1, longitude1,
id2, latitude2, longitude2,
way_id, edu.umn.cs.spatialHadoop.osm.MapToJson(tags) AS tags;
STORE road_network INTO '$output/road_network.bz2';
Although the code looks a little bit ugly, it only contains four statements. The first one extracts all the ways as segments using the LoadOSMWaysWithSegments macro. The second statement filters the segments that are related to the road network using the tags attribute. The third statement removes unnecessary columns and the fourth statement writes the output.
Similar to the road network, the next few lines extracts and stores the buildings dataset.
all_objects = LoadOSMObjects('$input');
buildings = FILTER all_objects BY edu.umn.cs.spatialHadoop.osm.HasTag(tags,
STORE buildings INTO '$output/buildings.bz2';
The first statement extracts all the objects from the file. The second statement filters the buildings using the tags attribute. Finally, the third statement stores the output.
This work was partially supported by an AWS in Education Grant.
- Louai Alarabi, Ahmed Eldawy, Rami Alghamdi, Mohamed F. Mokbel, "TAREEG: A MapReduce-Based Web System for Extracting Spatial Data from OpenStreetMap", In Proceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, (SIGSPATIAL GIS 2014), Dallas, TX, November 2014
- Around the world in one hour!
- SpatialHadoop datasets