PgOSM: Transform OpenStreetMap data in PostGIS
My previous post, Load OpenStreetMap data to PostGIS,
covered how to use osm2pgsql
to load a sub-region export of OpenStreetMap data into PostGIS (PostgreSQL).
With the data loaded to Postgres, one quickly finds out that it isn't very easy to jump in and use right away.
To help solve this problem,
the PgOSM project was created.
The main purpose of PgOSM
is to restructure the OpenStreetMap data into a more friendly format
for relational databases.
This post starts by showing how the PgOSM project is used to transform our OpenStreetMap data and ends with showing why we do it this way.
Using PgOSM
This post assumes you followed instructions from Load OpenStreetMap data to PostGIS and have a region of OpenStreetMap data loaded into PostgreSQL.
To start, switch to the postgres
user (in Linux) and
clone the PgOSM
repo.
sudo su - postgres
cd ~/git
git clone https://github.com/rustprooflabs/pgosm.git
Deploy schema and define layers
Deploying the table structure is done via sqitch.
cd ~/git/pgosm/db
sqitch deploy db:pg:pgosm
The transformation process uses data from the two tables: pgosm.layer_group
and pgosm.layer_detail
.
The table pgosm.layer_group
defines the logical groupings of data, such as roads, waterways, or buildings.
The pgosm.layer_detail
table defines how to filter the OSM tags, which columns to include, and so forth.
Load default layer groupings:
psql -d pgosm -f ~/git/pgosm/db/data/layer_definitions.sql
Setup Python environment
The PgOSM transformations are processed through Python. The Python program (included in the pgosm repo cloned above)
reads the data loaded from the layer_defintions.sql
file and generates the SQL code required to create the
transformed data model.
Virtualenv
Create a pgosm
virtual environment for Python and install the required modules.
mkdir ~/venv
cd ~/venv
python3 -m venv pgosm
source ~/venv/pgosm/bin/activate
pip install -r ~/git/pgosm/requirements.txt
mkdir ~/git/pgosm/output
Env vars
The Python portion of the process needs to build a connection string to connect to the data and apply the
transformations. Set these appropriately to connect to your pgosm
database.
export DB_HOST=localhost
export DB_NAME=pgosm
export DB_USER=your_db_user
This assumes you have a password setup in ~/.pgpass
. If you can't do that (or don't want
to) you can add:
export DB_PW=NonyaBusine$s
Run PgOSM
It's time to run the transformation! These commands ensure the virtual environment is active and runs the process via Python.
source ~/venv/pgosm/bin/activate
cd ~/git/pgosm
python -c "import pgosm; pgosm.process_layers();"
Note:
pgosm.process_layers()
has an optionalschema
parameter to override the defaultosm
schema name. Usepgosm.process_layers(schema='osm_co')
to create the transformations in a schema namedosm_co
.
You should get output similar to:
Starting processing.
File "output/create_pgosm_layers.sql" did not exist, nothing to remove.
24 layers returned
Processing layer place (layer_group_id=1).
Processing layer boundary (layer_group_id=2).
Processing layer admin_area (layer_group_id=3).
Stopping process_layer_classes(), no combined_filter.
Processing layer building (layer_group_id=4).
Processing layer public (layer_group_id=5).
Processing layer health (layer_group_id=6).
Processing layer leisure (layer_group_id=7).
Processing layer food (layer_group_id=8).
Processing layer accomodation (layer_group_id=9).
Processing layer shopping (layer_group_id=10).
Processing layer tourism (layer_group_id=11).
Processing layer miscpoi (layer_group_id=12).
Processing layer pofw (layer_group_id=13).
Processing layer natural (layer_group_id=14).
Processing layer transport (layer_group_id=15).
Stopping process_layer_classes(), no combined_filter.
Processing layer road (layer_group_id=16).
Processing layer traffic (layer_group_id=17).
Processing layer railway (layer_group_id=18).
Processing layer power (layer_group_id=19).
Processing layer powerline (layer_group_id=20).
Processing layer landuse (layer_group_id=21).
Processing layer waterway (layer_group_id=22).
Processing layer water (layer_group_id=23).
Processing layer coastline (layer_group_id=24).
Executing SQL Script...
Executing SQL Script completed.
Finished processing. Total time elapsed: 46.1 seconds.
The pgosm
database now has a schema named osm
. At the time of writing it generated 38 tables.
Export results
Remember, in the first post I clearly stated this Postgres instance was configured for temporary purposes.
"The Postgres instance we are loading data into is for temporary purposes only."
Use pg_dump
to extract the transformed OSM data.
pg_dump -d pgosm --schema osm --no-owner -f osm_colorado.sql
Downside: File size
If you dump the raw SQL output it will be a few times larger than the original .pbf
(969 MB vs 165 MB). gzip
helps with that, getting the SQL version down to 234 MB. Still 42% larger than the original.
gzip osm_colorado.sql
Why change the structure?
As my previous post showed, OSM data loaded via osm2pgsql
is divided into three tables based on geometry type:
public.planet_osm_point
(586k rows), public.planet_osm_line
(1.13M rows), public.planet_osm_polygon
(650k rows).
While this structure makes sense for the very flexible OpenStreetMap project, it is far from great for an analyst
using a tool like QGIS. It's pretty bad from a DBA's perspective too.
To help understand why this original structure isn't ideal, let's examine what we have in those 3 tables.
With the handy PgDD
data dictionary installed I can run a query
showing the number of columns per table.
SELECT t.s_name, t.t_name, t.size_pretty, t.rows::BIGINT, COUNT(c.column_name) AS column_count
FROM dd.tables t
INNER JOIN dd.columns c ON t.s_name = c.s_name AND t.t_name = c.t_name
WHERE t.s_name = 'public' AND t.t_name IN ('planet_osm_line', 'planet_osm_point', 'planet_osm_polygon')
GROUP BY t.s_name, t.t_name, t.size_pretty, t.rows
;
+--------+--------------------+-------------+---------+--------------+
| s_name | t_name | size_pretty | rows | column_count |
+--------+--------------------+-------------+---------+--------------+
| public | planet_osm_line | 520 MB | 1135458 | 48 |
| public | planet_osm_point | 62 MB | 594772 | 34 |
| public | planet_osm_polygon | 235 MB | 655846 | 48 |
+--------+--------------------+-------------+---------+--------------+
Two of the three tables have 48 columns, and the other has 34 columns. Extremely wide tables is database design anti-pattern #3:
"In general, tables typically shouldn't have more than 30 or 40 columns in them, with most tables having under 20 columns."
The table with 34 columns is in the the gray area (30-40 columns), but the other two exceed that guideline with 48 columns. A closer look at the column names in the _line
table helps understand what those columns are.
SELECT t_name, column_name, data_type
FROM dd.columns c
WHERE c.s_name = 'public' AND c.t_name = 'planet_osm_line'
LIMIT 10;
+-----------------+--------------------+-----------+
| t_name | column_name | data_type |
+-----------------+--------------------+-----------+
| planet_osm_line | osm_id | int8 |
| planet_osm_line | access | text |
| planet_osm_line | addr:housename | text |
| planet_osm_line | addr:housenumber | text |
| planet_osm_line | addr:interpolation | text |
| planet_osm_line | admin_level | text |
| planet_osm_line | aerialway | text |
| planet_osm_line | aeroway | text |
| planet_osm_line | amenity | text |
| planet_osm_line | barrier | text |
+-----------------+--------------------+-----------+
The first red flag is that there are column names including a colon (:
) character. Those are terrible names in
a database because you must enclose those names with double quotes
(e.g. SELECT "addr:housenumber"
, ew!). Use of special characters
violates common best practices in database object naming conventions.
The above is a look at 10 of the 48 columns in that table, and the incredible diversity of topics
(addresses vs aeroways vs amenity) is immediately apparent. A line that represents a barrier (and would have
data in the barrier
column) isn't likely to also be an aeroway or a street address. This means the columns are
sparsely populated, or in other words, they have lots of missing values.
The following query calculates the percentage of rows with values in a few of these attribute columns and shows
exactly that.
SELECT ROUND(COUNT(access) * 1.0 / COUNT(*), 4) AS access,
ROUND(COUNT("addr:housenumber") * 1.0 / COUNT(*), 4) AS addr_housenumber,
ROUND(COUNT(amenity) * 1.0 / COUNT(*), 4) AS amenity,
ROUND(COUNT(bicycle) * 1.0 / COUNT(*), 4) AS bicycle,
ROUND(COUNT(boundary) * 1.0 / COUNT(*), 4) AS boundary,
ROUND(COUNT(building) * 1.0 / COUNT(*), 4) AS building,
ROUND(COUNT(foot) * 1.0 / COUNT(*), 4) AS foot,
ROUND(COUNT(highway) * 1.0 / COUNT(*), 4) AS highway,
ROUND(COUNT(leisure) * 1.0 / COUNT(*), 4) AS leisure,
ROUND(COUNT("natural") * 1.0 / COUNT(*), 4) AS nature,
ROUND(COUNT(place) * 1.0 / COUNT(*), 4) AS place
FROM public.planet_osm_line
;
-[ RECORD 1 ]----+----------------
access | 0.0430
addr_housenumber | 0.0002
amenity | 0.0000
bicycle | 0.0336
boundary | 0.0064
foot | 0.0270
highway | 0.5477
leisure | 0.0005
nature | 0.0034
place | 0.0012
PSQL Tip: Use
\x
to enable "extended display" when looking at a single row of data.
What the above results show is that most of these attribute columns have data on a tiny fraction (if any) of the rows. Then there's the highway column, accounting for 55% of the rows!
New Structure
The above query showing columns per table is used again, but this time we look at the new osm
schema. We quickly
see that we have a lot more tables.
Each table is a nice, reasonable chunk of data, logically organized in a "layer" fashion suitable for loading
quickly into QGIS.
Most of the polygons end up in the building_polygon
table, and most of the lines end up in road_line
.
SELECT t.s_name, t.t_name, t.size_pretty, t.rows::BIGINT, COUNT(c.column_name) AS column_count
FROM dd.tables t
INNER JOIN dd.columns c ON t.s_name = c.s_name AND t.t_name = c.t_name
WHERE t.s_name = 'osm'
GROUP BY t.s_name, t.t_name, t.size_pretty, t.rows
+--------+----------------------+-------------+--------+--------------+
| s_name | t_name | size_pretty | rows | column_count |
+--------+----------------------+-------------+--------+--------------+
| osm | accomodation_point | 232 kB | 2475 | 7 |
| osm | accomodation_polygon | 648 kB | 2166 | 7 |
| osm | boundary_line | 3616 kB | 3312 | 6 |
| osm | boundary_polygon | 2208 kB | 486 | 6 |
| osm | building_point | 744 kB | 4158 | 9 |
| osm | building_polygon | 128 MB | 447380 | 9 |
| osm | food_point | 560 kB | 5652 | 8 |
| osm | food_polygon | 448 kB | 1794 | 8 |
| osm | health_point | 80 kB | 668 | 6 |
| osm | health_polygon | 152 kB | 383 | 6 |
| osm | landuse_polygon | 40 MB | 94347 | 8 |
| osm | leisure_point | 152 kB | 1516 | 7 |
| osm | leisure_polygon | 5800 kB | 16955 | 7 |
| osm | miscpoi_point | 584 kB | 7591 | 4 |
| osm | miscpoi_polygon | 176 kB | 733 | 4 |
| osm | natural_line | 656 kB | 1523 | 6 |
| osm | natural_point | 19 MB | 262709 | 6 |
| osm | natural_polygon | 152 kB | 215 | 6 |
| osm | place_point | 4536 kB | 49773 | 5 |
| osm | place_polygon | 21 MB | 32471 | 5 |
| osm | pofw_point | 112 kB | 893 | 5 |
| osm | pofw_polygon | 320 kB | 764 | 5 |
| osm | power_point | 4440 kB | 58992 | 5 |
| osm | power_polygon | 232 kB | 1199 | 5 |
| osm | powerline_line | 1864 kB | 5532 | 5 |
| osm | public_point | 368 kB | 3580 | 6 |
| osm | public_polygon | 1056 kB | 2707 | 6 |
| osm | railway_line | 280 kB | 871 | 7 |
| osm | road_line | 163 MB | 605950 | 7 |
| osm | shopping_point | 504 kB | 5077 | 7 |
| osm | shopping_polygon | 640 kB | 2406 | 7 |
| osm | tourism_point | 352 kB | 3850 | 7 |
| osm | tourism_polygon | 312 kB | 1013 | 7 |
| osm | traffic_line | 5808 kB | 35451 | 7 |
| osm | traffic_point | 11 MB | 140509 | 7 |
| osm | traffic_polygon | 7904 kB | 30555 | 7 |
| osm | water_polygon | 27 MB | 48176 | 7 |
| osm | waterway_line | 163 MB | 399394 | 7 |
+--------+----------------------+-------------+--------+--------------+
This structure makes it easy to pull just the natural data in line format out (osm.natural_line
), all 1,523 rows of them.
This way you don't have to first filter out the other 1,133,935 rows (99.87%) from public.planet_osm_lines
.
It also includes only the columns relevant to those types of data points, like "natural" and elevation. Columns it
doesn't relate to (e.g. powerline, highway, and building) are excluded from this table.
See "Row Count Matters" section for more examples of why managing your spatial data properly is important!
Summary
The first part of this post completed the tutorial of how to load OpenStreetMap data into a PostGIS-enabled PostgreSQL database. The second part explained a little of the logic behind why I load and transform the data in the way I do. The PgOSM project started in 2015 as a private project and only recently was cleaned up and released on GitHub. I intend to continue to improve and maintain this project as time moves forward.
The next post on this topic, PgOSM transformations explained explores how the layer definitions tables are populated and how to customize the process by creating your own layer definitions.
Need help with your PostgreSQL servers or databases? Contact us to start the conversation!