RustProof Labs: blogging for education (logo)
My book Mastering PostGIS and OpenStreetMap is available!

Better OpenStreetMap data using PgOSM Flex 0.6.0

By Ryan Lambert -- Published October 04, 2022

In late 2020 when osm2pgsql released the flex output I eagerly jumped on that bandwagon. The osm2pgsql flex output enabled the type of data structure and cleanup abilities I had always wanted from osm2pgsql. By January 2021 the PgOSM Flex project was up and running and I was phasing out my legacy OpenStreetMap processes. Since then, I have written more than a dozen posts exploring different improvements and use cases for the OpenStreetMap data loaded via PgOSM Flex. This post looks at a few notable improvements to version 0.6.0 over prior versions. The two areas of focus are:

Data quality improvements

The set of improvements that gave me the idea for this post were made in PgOSM Flex versions 0.5.1 and 0.6.0. Version 0.5.1 took advantage of the long awaited addition of multilinestring support to osm2pgsql. Adding that feature in osm2pgsql allowed relations of lines to be added in the same manner that relations of polygons had used. Without the multilinestring support, relations such as 13642053, shown in the following screenshot, were being skipped by the PgOSM Flex import. This improvement targeted roads, waterways, and public transport layers.

Screenshot from DBeaver showing a blue segment representing OSM relation 13642053, a roughly 33 kilometer stretch of road that had previously been excluded from data loaded by PgOSM Flex.

While I was making the changes to add the multilinestring support I realized I had not added relations to some other polygon layers. Those improvements weren't blocked by anything other than I hadn't noticed this issue yet. A rather prominent example of missing data in PgOSM Flex's osm.water_polygon table was Dillon Reservoir. The following query selects the data from a Colorado data set imported using PgOSM Flex 0.5.0. We get one result for Old Dillon Reservoir, which is a small reservoir next to what we know as Dillon Reservoir today.

SELECT osm_id, osm_type, osm_subtype, name,
        ST_Area(ST_Transform(geom, 2773)) AS area_m,
        geom
    FROM osm_0_5_0.water_polygon 
    WHERE name ILIKE '%Dillon Reservoir%'
;

┌──────────┬──────────┬─────────────┬──────────────────────┬──────────┐
│  osm_id  │ osm_type │ osm_subtype │         name         │  area_m  │
╞══════════╪══════════╪═════════════╪══════════════════════╪══════════╡
│ 39410606 │ natural  │ water       │ Old Dillon Reservoir │ 58278.22 │
└──────────┴──────────┴─────────────┴──────────────────────┴──────────┘

Screenshot from DBeaver showing the small Old Dillon Reservoir next to the large lake that is Dillon Reservoir.

Querying the data loaded by PgOSM Flex 0.6.0 shows the main reservoir is now included in the results. The calculated area_m column shows the noticeable difference in size!

SELECT osm_id, osm_type, osm_subtype, name,
        ST_Area(ST_Transform(geom, 2773)) AS area_m,
        geom
    FROM osm.water_polygon 
    WHERE name ILIKE '%Dillon Reservoir%'
;

┌──────────┬──────────┬─────────────┬──────────────────────┬─────────────┐
│  osm_id  │ osm_type │ osm_subtype │         name         │   area_m    │
╞══════════╪══════════╪═════════════╪══════════════════════╪═════════════╡
│ -1954708 │ natural  │ water       │ Dillon Reservoir     │ 13060398.62 │
│ 39410606 │ natural  │ water       │ Old Dillon Reservoir │    58278.22 │
└──────────┴──────────┴─────────────┴──────────────────────┴─────────────┘

Follow-up improvements and breaking changes

The above improvements went into PgOSM Flex 0.5.1. Of course, shortly after version 0.5.1 was out there I found bugs and other long term issues that needed addressing. The bugs were due to my quick implementation and were easy enough to address. The other long term issues revolved around my original decision to use a materialized view (osm.vplace_polygon) in order to remove child geometries included by a parent relation. Reconsidering this approach resulted in a ⚠️ breaking change ⚠️, hence the bump to 0.6.0 instead of 0.5.2.

The materialized views osm.vplace_polygon and osm.places_in_relations are not created in PgOSM Flex 0.6.0. The de-duplication is now handled by deleting the extra rows from the source table during import.

The trouble with the materialized view approach became apparent when I started using the same approach for the osm.road_line and osm.water_line tables. Creating a materialized view writes the data to disk, which results in two copies of most of that data. Further, the materialized views need additional indexes, which uses even more disk space.

The following query shows why I didn't really mind the overhead before on the place_polygon table, and why it started to matter with the water_line and road_line tables.

SELECT t_name, size_pretty, rows
    FROM dd.tables
    WHERE s_name = 'osm'
        AND t_name IN ('road_line', 'water_line', 'place_polygon')
    ORDER BY rows
;

┌───────────────┬─────────────┬────────┐
│    t_name     │ size_pretty │  rows  │
╞═══════════════╪═════════════╪════════╡
│ place_polygon │ 9448 kB     │   3987 │
│ water_line    │ 187 MB      │ 443561 │
│ road_line     │ 260 MB      │ 901777 │
└───────────────┴─────────────┴────────┘

For Colorado, the place data is only 10MB while the road line data is 260 MB. As the size of the region grows (e.g. North America, Europe, Planet) these sizes become considerably larger. Continuing with this approach would significantly increase disk consumption and add non-negligible time to the process of loading and refreshing the OpenStreetMap data. Nobody wants that!

Ease of use

Making PgOSM Flex easy to use has been a big focus. Easy-to-use is a great feature, and I think PgOSM Flex is successful on that front. This is the one project I maintain where I think using Docker is the best approach for both ease of use and reliability. A combination of improvements made the Docker based approach so functional.

Key improvements

The key improvement is that osm2pgsql RAM usage is predictable. Before osm2pgsql 1.5.0, processing required an excessive amount of RAM and it was impossible to really predict how much. With a formula available for RAM consumption, the osm2pgsql-tuner project become a reality to determine the best parameters to use with osm2pgsql.

The PgOSM Flex Docker image uses the logic from osm2pgsql-tuner combined with a --ram input and the file size of the downloaded .osm.pbf file to determine the command to run. This makes the Docker image easy to use for all varieties of region sizes and hardware combinations.

The osm2pgsql-tuner project has a free website and API. You can also use it directly from Python using pip install osm2pgsql-tuner.

Easy to customize

It is easy to customize how you use PgOSM Flex via Docker. You can specify an --input-file instead of using a file downloaded from Geofabrik. You can define custom layersets with --layerset and --layerset-path to either modify built-in styles, or completely replace them with your own styles and post-processing SQL.

If you don't want to use the Docker image you do not have to. The manual import steps cover those details.

See the list of markdown documentation available for more helpful information on using and customizing PgOSM Flex.

Summary

The data provided by PgOSM Flex 0.6.0 is its best quality yet, and it will continue to get better. Now that my book is published I am hoping to have some time to spend looking at the improvements made to geometry processing in osm2pgsql 1.7.0.

Need help with your PostgreSQL servers or databases? Contact us to start the conversation!

By Ryan Lambert
Published October 04, 2022
Last Updated October 04, 2022