RustProof Labs: blogging for education (logo)

PgOSM: Transform OpenStreetMap data in PostGIS

By Ryan Lambert -- Published January 25, 2019

My previous post, Load OpenStreetMap data to PostGIS, covered how to use osm2pgsql to load a sub-region export of OpenStreetMap data into PostGIS (PostgreSQL). With the data loaded to Postgres, one quickly finds out that it isn't very easy to jump in and use right away. To help solve this problem, the PgOSM project was created. The main purpose of PgOSM is to restructure the OpenStreetMap data into a more friendly format for relational databases.

This post starts by showing how the PgOSM project is used to transform our OpenStreetMap data and ends with showing why we do it this way.

Using PgOSM

This post assumes you followed instructions from Load OpenStreetMap data to PostGIS and have a region of OpenStreetMap data loaded into PostgreSQL.

To start, clone the PgOSM repo.

cd ~/git
git clone https://github.com/rustprooflabs/pgosm.git

Deploy schema and define layers

Deploying the table structure is done via sqitch.

sqitch deploy db:pg://your_db_user@localhost/pgosm

The transformation process uses data from the two tables: pgosm.layer_group and pgosm.layer_detail. The table pgosm.layer_group defines the logical groupings of data, such as roads, waterways, or buildings. The pgosm.layer_detail table defines how to filter the OSM tags, which columns to include, and so forth.

Load default layer groupings:

psql -U your_db_user -d pgosm -f ~/git/pgosm/db/data/layer_definitions.sql

Setup Python environment

The PgOSM transformations are processed through Python. The Python program (included in the pgosm repo cloned above) reads the data loaded from the layer_defintions.sql file and generates the SQL code required to create the transformed data model.

Virtualenv

Create a pgosm virtual environment for Python and install the required modules.

mkdir ~/venv
cd ~/venv
python3 -m venv pgosm
source ~/venv/pgosm/bin/activate
pip install -r ~/git/pgosm/requirements.txt
mkdir ~/git/pgosm/output

Env vars

The Python portion of the process needs to build a connection string to connect to the data and apply the transformations. Set these appropriately to connect to your pgosm database.

export DB_HOST=localhost
export DB_NAME=pgosm
export DB_USER=your_db_user
export DB_PW=NonyaBusine$s

Run PgOSM

It's time to run the transformation! These commands ensure the virtual environment is active and runs the process via Python.

source ~/venv/pgosm/bin/activate
cd ~/git/pgosm
python -c "import pgosm; pgosm.process_layers();"

You should get output similar to:

Starting processing.
File "output/create_pgosm_layers.sql" did not exist, nothing to remove.
24 layers returned
Processing layer place (layer_group_id=1).
Processing layer boundary (layer_group_id=2).
Processing layer admin_area (layer_group_id=3).
Stopping process_layer_classes(), no combined_filter.
Processing layer building (layer_group_id=4).
Processing layer public (layer_group_id=5).
Processing layer health (layer_group_id=6).
Processing layer leisure (layer_group_id=7).
Processing layer food (layer_group_id=8).
Processing layer accomodation (layer_group_id=9).
Processing layer shopping (layer_group_id=10).
Processing layer tourism (layer_group_id=11).
Processing layer miscpoi (layer_group_id=12).
Processing layer pofw (layer_group_id=13).
Processing layer natural (layer_group_id=14).
Processing layer transport (layer_group_id=15).
Stopping process_layer_classes(), no combined_filter.
Processing layer road (layer_group_id=16).
Processing layer traffic (layer_group_id=17).
Processing layer railway (layer_group_id=18).
Processing layer power (layer_group_id=19).
Processing layer powerline (layer_group_id=20).
Processing layer landuse (layer_group_id=21).
Processing layer waterway (layer_group_id=22).
Processing layer water (layer_group_id=23).
Processing layer coastline (layer_group_id=24).
Executing SQL Script...
Executing SQL Script completed.
Finished processing.  Total time elapsed: 46.1 seconds.

The pgosm database now has a schema named osm. At the time of writing it generated 38 tables.

Export results

Remember, in the first post I clearly stated this Postgres instance was configured for temporary purposes.

"The Postgres instance we are loading data into is for temporary purposes only."

Use pg_dump to extract the transformed OSM data.

pg_dump -U your_db_user -d pgosm --schema osm --no-owner -f osm_colorado.sql

Downside: File size

If you dump the raw SQL output it will be a few times larger than the original .pbf (969 MB vs 165 MB). gzip helps with that, getting the SQL version down to 234 MB. Still 42% larger than the original.

gzip osm_colorado.sql

Why change the structure?

As my previous post showed, OSM data loaded via osm2pgsql is divided into three tables based on geometry type: public.planet_osm_point (586k rows), public.planet_osm_line (1.13M rows), public.planet_osm_polygon (650k rows). While this structure makes sense for the very flexible OpenStreetMap project, it is far from great for an analyst using a tool like QGIS. It's pretty bad from a DBA's perspective too.

To help understand why this original structure isn't ideal, let's examine what we have in those 3 tables. With the handy PgDD data dictionary installed I can run a query showing the number of columns per table.

SELECT t.s_name, t.t_name, t.size_pretty, t.rows::BIGINT, COUNT(c.column_name) AS column_count
    FROM dd.tables t
    INNER JOIN dd.columns c ON t.s_name = c.s_name AND t.t_name = c.t_name
    WHERE t.s_name = 'public' AND t.t_name IN ('planet_osm_line', 'planet_osm_point', 'planet_osm_polygon')
    GROUP BY t.s_name, t.t_name, t.size_pretty, t.rows
;
+--------+--------------------+-------------+---------+--------------+
| s_name | t_name             | size_pretty | rows    | column_count |
+--------+--------------------+-------------+---------+--------------+
| public | planet_osm_line    | 520 MB      | 1135458 | 48           |
| public | planet_osm_point   | 62 MB       | 594772  | 34           |
| public | planet_osm_polygon | 235 MB      | 655846  | 48           |
+--------+--------------------+-------------+---------+--------------+

Two of the three tables have 48 columns, and the other has 34 columns. Extremely wide tables is database design anti-pattern #3:

"In general, tables typically shouldn't have more than 30 or 40 columns in them, with most tables having under 20 columns."

The table with 34 columns is in the the gray area (30-40 columns), but the other two exceed that guideline with 48 columns. A closer look at the column names in the _line table helps understand what those columns are.

SELECT t_name, column_name, data_type
    FROM dd.columns c
    WHERE c.s_name = 'public' AND c.t_name = 'planet_osm_line'
LIMIT 10;
+-----------------+--------------------+-----------+
| t_name          | column_name        | data_type |
+-----------------+--------------------+-----------+
| planet_osm_line | osm_id             | int8      |
| planet_osm_line | access             | text      |
| planet_osm_line | addr:housename     | text      |
| planet_osm_line | addr:housenumber   | text      |
| planet_osm_line | addr:interpolation | text      |
| planet_osm_line | admin_level        | text      |
| planet_osm_line | aerialway          | text      |
| planet_osm_line | aeroway            | text      |
| planet_osm_line | amenity            | text      |
| planet_osm_line | barrier            | text      |
+-----------------+--------------------+-----------+

The first red flag is that there are column names including a colon (:) character. Those are terrible names in a database because you must enclose those names with double quotes (e.g. SELECT "addr:housenumber", ew!). Use of special characters violates common best practices in database object naming conventions.

The above is a look at 10 of the 48 columns in that table, and the incredible diversity of topics (addresses vs aeroways vs amenity) is immediately apparent. A line that represents a barrier (and would have data in the barrier column) isn't likely to also be an aeroway or a street address. This means the columns are sparsely populated, or in other words, they have lots of missing values. The following query calculates the percentage of rows with values in a few of these attribute columns and shows exactly that.

SELECT ROUND(COUNT(access) * 1.0 / COUNT(*), 4) AS access, 
        ROUND(COUNT("addr:housenumber") * 1.0 / COUNT(*), 4) AS addr_housenumber,
        ROUND(COUNT(amenity) * 1.0 / COUNT(*), 4) AS amenity,
        ROUND(COUNT(bicycle) * 1.0 / COUNT(*), 4) AS bicycle,
        ROUND(COUNT(boundary) * 1.0 / COUNT(*), 4) AS boundary,
        ROUND(COUNT(building) * 1.0 / COUNT(*), 4) AS building,
        ROUND(COUNT(foot) * 1.0 / COUNT(*), 4) AS foot,
        ROUND(COUNT(highway) * 1.0 / COUNT(*), 4) AS highway,
        ROUND(COUNT(leisure) * 1.0 / COUNT(*), 4) AS leisure,
        ROUND(COUNT("natural") * 1.0 / COUNT(*), 4) AS nature,
        ROUND(COUNT(place) * 1.0 / COUNT(*), 4) AS place
    FROM public.planet_osm_line
;
-[ RECORD 1 ]----+----------------
access           | 0.0430
addr_housenumber | 0.0002
amenity          | 0.0000
bicycle          | 0.0336
boundary         | 0.0064
foot             | 0.0270
highway          | 0.5477
leisure          | 0.0005
nature           | 0.0034
place            | 0.0012

PSQL Tip: Use \x to enable "extended display" when looking at a single row of data.

What the above results show is that most of these attribute columns have data on a tiny fraction (if any) of the rows. Then there's the highway column, accounting for 55% of the rows!

New Structure

The above query showing columns per table is used again, but this time we look at the new osm schema. We quickly see that we have a lot more tables. Each table is a nice, reasonable chunk of data, logically organized in a "layer" fashion suitable for loading quickly into QGIS. Most of the polygons end up in the building_polygon table, and most of the lines end up in road_line.

SELECT t.s_name, t.t_name, t.size_pretty, t.rows::BIGINT, COUNT(c.column_name) AS column_count
    FROM dd.tables t
    INNER JOIN dd.columns c ON t.s_name = c.s_name AND t.t_name = c.t_name
    WHERE t.s_name = 'osm'
    GROUP BY t.s_name, t.t_name, t.size_pretty, t.rows
+--------+----------------------+-------------+--------+--------------+
| s_name | t_name               | size_pretty | rows   | column_count |
+--------+----------------------+-------------+--------+--------------+
| osm    | accomodation_point   | 232 kB      | 2475   | 7            |
| osm    | accomodation_polygon | 648 kB      | 2166   | 7            |
| osm    | boundary_line        | 3616 kB     | 3312   | 6            |
| osm    | boundary_polygon     | 2208 kB     | 486    | 6            |
| osm    | building_point       | 744 kB      | 4158   | 9            |
| osm    | building_polygon     | 128 MB      | 447380 | 9            |
| osm    | food_point           | 560 kB      | 5652   | 8            |
| osm    | food_polygon         | 448 kB      | 1794   | 8            |
| osm    | health_point         | 80 kB       | 668    | 6            |
| osm    | health_polygon       | 152 kB      | 383    | 6            |
| osm    | landuse_polygon      | 40 MB       | 94347  | 8            |
| osm    | leisure_point        | 152 kB      | 1516   | 7            |
| osm    | leisure_polygon      | 5800 kB     | 16955  | 7            |
| osm    | miscpoi_point        | 584 kB      | 7591   | 4            |
| osm    | miscpoi_polygon      | 176 kB      | 733    | 4            |
| osm    | natural_line         | 656 kB      | 1523   | 6            |
| osm    | natural_point        | 19 MB       | 262709 | 6            |
| osm    | natural_polygon      | 152 kB      | 215    | 6            |
| osm    | place_point          | 4536 kB     | 49773  | 5            |
| osm    | place_polygon        | 21 MB       | 32471  | 5            |
| osm    | pofw_point           | 112 kB      | 893    | 5            |
| osm    | pofw_polygon         | 320 kB      | 764    | 5            |
| osm    | power_point          | 4440 kB     | 58992  | 5            |
| osm    | power_polygon        | 232 kB      | 1199   | 5            |
| osm    | powerline_line       | 1864 kB     | 5532   | 5            |
| osm    | public_point         | 368 kB      | 3580   | 6            |
| osm    | public_polygon       | 1056 kB     | 2707   | 6            |
| osm    | railway_line         | 280 kB      | 871    | 7            |
| osm    | road_line            | 163 MB      | 605950 | 7            |
| osm    | shopping_point       | 504 kB      | 5077   | 7            |
| osm    | shopping_polygon     | 640 kB      | 2406   | 7            |
| osm    | tourism_point        | 352 kB      | 3850   | 7            |
| osm    | tourism_polygon      | 312 kB      | 1013   | 7            |
| osm    | traffic_line         | 5808 kB     | 35451  | 7            |
| osm    | traffic_point        | 11 MB       | 140509 | 7            |
| osm    | traffic_polygon      | 7904 kB     | 30555  | 7            |
| osm    | water_polygon        | 27 MB       | 48176  | 7            |
| osm    | waterway_line        | 163 MB      | 399394 | 7            |
+--------+----------------------+-------------+--------+--------------+

This structure makes it easy to pull just the natural data in line format out (osm.natural_line), all 1,523 rows of them.
This way you don't have to first filter out the other 1,133,935 rows (99.87%) from public.planet_osm_lines. It also includes only the columns relevant to those types of data points, like "natural" and elevation. Columns it doesn't relate to (e.g. powerline, highway, and building) are excluded from this table.

See "Row Count Matters" section for more examples of why managing your spatial data properly is important!

Summary

The first part of this post completed the tutorial of how to load OpenStreetMap data into a PostGIS-enabled PostgreSQL database. The second part explained a little of the logic behind why I load and transform the data in the way I do. The PgOSM project started in 2015 as a private project and only recently was cleaned up and released on GitHub. I intend to continue to improve and maintain this project as time moves forward.

A future post on this topic will explore how the layer definitions tables are populated and how to customize the process by creating your own layer definition data files.

Need help with your PostgreSQL servers or databases? Contact us to start the conversation!

By Ryan Lambert
Published January 25, 2019
Last Updated January 25, 2019