Friday, March 27, 2015

Around the world in one hour!


This blog post shows you how you can process the whole Planet file produced by OpenStreetMap in only one hour. We use SpatialHadoop, an extension to Hadoop that supports spatial data, along with its high level language, Pigeon, to distribute the work over 50 machines and get it done within one hour instead of a week. The program is only tens of lines of code and can be easily customized to produce different output datasets or do further processing to output a more suitable output.

Planet file

OpenStreetMap is a great source of maps and geographic data. It allows volunteers to contribute to maps all over the world and make all this data publicly available for use. This data is officially provided in one big XML file called the Planet.osm file. It contains all information from all over the world in one common XML schema. Since this XML file is not in a standard format that GIS software can deal with, you need to convert this file into a more standardized format. Osm2pgsql is one alternative that loads the whole file into a standard DBMS schema where you can further use PostGIS to process it. According to the benchmark, it takes around two days just to load the file into the database. Some optimizations along with a very powerful machine, can get this down to around 7 hours. I have been taking to people who tried it themselves and it takes up to seven days on a commodity machine. With the technique shown in this blog post, we area able to take this down to less than one hour. In the rest of this blog post, we will show how you can do it yourself.


Pigeon is an open source project that adds OGC-standard spatial functions to Pig Latin. This makes it capable of processing very large files efficiently and easily on a cluster of machines running Hadoop. We use it to parse the Planet file and produce the objects of interest out of it.


To run this script, you need the following.
  • A running Hadoop cluster operating on HDFS. Check the setup guide of Hadoop 2.6.0
  • Pig installed and configured to run with Hadoop. Check the setup guide of Pig 0.14.0
  • The script requires the jar of Pigeon to be present in the same folder as the script. Download the jar of Pigeon 0.2.0.
  • The JAR file of SpatialHadoop which contains some UDFs used to parse the Planet.xml file. This JAR file can be found in the binary package of SpatialHadoop 2.3.
  • You also need the JAR file of JTS 1.8 and ESRI-Geometry-API which also ship with SpatialHadoop and can be found in the binary package.
  • You need the JAR of piggybank which contains the XML parser. You will find this JAR file as part of your Pig installation.
  • You need to 'pigeon_import.pig' file which is downloaded here.
  • Finally, you need the extraction script which is available as part of the source code of SpatialHadoop on github.

How to Run

To run the extraction script, you need to place the osmx.pig script along with all required JAR files in the same folder. Then, you need to execute the script using Pig. For example, if your input file is called '/planet.osm.bz2' stored at the root of your HDFS, and you want to store the output to the path '/planet-datasets', you will execute the following command
pig -param input=/planet.osm.bz2 -param output=/planet-datasets osmx.pig
Pig will execute a series of MapReduce jobs depending on the type of data generated. You can track its progress from the job tracker or the resource manager. The next part of this blog post will show how the script runs and how it can be customized to generate data that better matches your need.
Notice that the code contains the keyword 'PARALLEL' in several places to adjust the parallelism according to number of reducers in the cluster. You can adjust this number according to your cluster size to tune up the performance.

Extraction Script

The script runs in three phases, node extraction, way extraction, and relation extraction.

Extracting nodes

The first phase extracts only the data in the nodes section. Each node contains an ID, latitude, longitude, and a set of tags represented as key-value list. The code of this part is shown below.
xml_nodes = LOAD '$input' USING XMLLoader('node') AS (node:chararray);
parsed_nodes = FOREACH xml_nodes GENERATE OSMNode(node) AS node;
If you're interested in extracting data in a specific region, this would be the best place to filter the data according to its location. For example, if you want to extract the data around the state of Minnesota, you can add the following line
parsed_nodes = FILTER parsed_nodes BY ST_Contains(ST_MakeBox(-97.2,43.5,-89.5,49.4), ST_MakePoint(node.lon,;

Extracting ways

In the second phase, ways are extracted from the second section in the Planet file. These ways are joined with nodes to produce shapes that connect the nodes together. This can either produce full objects, if they are very small, or partial objects, if they are too large for a way. The code for this part is shown below
xml_ways = LOAD '$input' USING XMLLoader('way') AS (way:chararray);
parsed_ways = FOREACH xml_ways GENERATE OSMWay(way) AS way;
flattened_ways = FOREACH parsed_ways
  GENERATE AS way_id, FLATTEN(way.nodes), way.tags AS tags;
joined_ways = JOIN node_locations BY id, flattened_ways BY node_id PARALLEL 70;
ways_with_nodes = GROUP joined_ways BY way_id PARALLEL 70;
ways_with_shapes = FOREACH ways_with_nodes {
  ordered = ORDER joined_ways BY pos;
  tags = FOREACH joined_ways GENERATE tags;
  GENERATE group AS way_id, ST_MakeLinePolygon(ordered.node_id, ordered.location) AS geom,
    FLATTEN(TOP(1, 0, tags)) AS tags;
The first two lines read and parse elements of the ways section. Each way is represented by an ID, tags, and a list of node IDs. Connecting the nodes with these IDs produce the shape of the way. Lines 3 and 4 flatten the ways so that each record contains one node ID and then join this with nodes to add the location of each node (latitude, longitude). After that, we perform a GROUP BY operation to group these objects back by way ID after they were annotated with locations. The method MakeLinePolygon creates a geometric shape out of a list of points. If the ID of the first and last points are the same, a polygon is created, otherwise, a linestring is created. If you are only interested in objects formed by ways, you can just write this result to the output file.

If you're interested in line segments instead of full shapes, you can use the following commands which generates the result as a collection of line segments, each connecting two points together.
roads_with_nodes = GROUP joined_ways BY way_id PARALLEL 70;
raod_segments = FOREACH roads_with_nodes {
  ordered = ORDER road_network BY pos;
  tags = FOREACH road_network GENERATE tags;
  GENERATE group AS way_id, ST_MakeSegments(ordered.node_id, ordered.location) AS geom,
    FLATTEN(TOP(1, 0, tags)) AS tags;
raod_segments = FOREACH raod_segments GENERATE way_id, FLATTEN(geom), tags;
It starts from the joined_ways and instead of calling the MakeLinePolygon function, it calls the MakeSegments function which generates a list of road segments for each two consecutive points. This can be used, for example, to generate the road network graph as a set of edges.

Extracting relations

Phase 3 extracts relations in a similar way of extracting ways. It reads and parses relations where each one is represented as a list of way IDs. It flattens it to produce one way ID per line, and then joins it with ways. Finally, the result is grouped again by relation ID and the function Connect is called. The Connect functions connects multiple linestrings together to produce a longer linestring or a polygon, if they form a closed ring. The code is shown below.
xml_relations = LOAD '$input' USING XMLLoader('relation') AS (relation:chararray);
parsed_relations = FOREACH xml_relations GENERATE OSMRelation(relation) AS relation;
flattened_relations = FOREACH filtered_relations
  GENERATE AS relation_id, FLATTEN(relation.members), relation.tags AS tags;
relations_join_ways = JOIN flattened_relations BY member_id RIGHT OUTER, ways_with_shapes BY way_id PARALLEL 70;
relations_with_shapes = FOREACH relations_by_role {
  tags = FOREACH relations_with_ways GENERATE tags;
  GENERATE group.relation_id AS relation_id, group.member_role AS way_role,
    ST_Connect(relations_with_ways.first_node_id, relations_with_ways.last_node_id, relations_with_ways.way_shape) AS shape,
    FLATTEN(TOP(1, 0, tags)) AS tags;


We tried this script on a cluster of around 50 nodes running Hadoop and Pig 0.12.1. It took around one hour to run all of the three phases and generate all relations from the file 'planet-150112.osm.bz2' with total size of 40GB.

Further Reads

[2]  Louai Alarabi, Ahmed Eldawy, Rami Alghamdi, Mohamed F. Mokbel, "TAREEG: A MapReduce-Based Web System for Extracting Spatial Data from OpenStreetMap", In Proceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, (SIGSPATIAL GIS 2014), Dallas, TX, November 2014
[3] Louai Alarabi, Ahmed Eldawy, Rami Alghamdi and Mohamed F. Mokbel, "TAREEG: A MapReduce-Based Web Service for Extracting Spatial Data from OpenStreetMap", In Proceedings of ACM SIGMOD Conference on Management of Data, (ACM SIGMOD 2014), Salt Lake City, UT, June, 2014

1 comment:

  1. XML also become a standardized method for the exchange of data as well as documents.
    So XML become a way for databases from different vendors to exchange of data across the Internet. For more information: xml file