RustProof Labs: blogging for education (logo)
My book Mastering PostGIS and OpenStreetMap is available!

PgOSM Flex for Production OpenStreetMap data

By Ryan Lambert -- Published April 21, 2023

The PgOSM Flex Project is looking forward to the 0.8.0! If you aren't familiar with PgOSM Flex, it is a tool that loads high quality OpenStreetMap datasets to PostGIS using osm2pgsql. I have a few examples of using OpenStreetMap data loaded this way.

I am extremely excited about PgOSM Flex 0.8.0 because the project as a whole is really starting to feel "production ready." While I have been using PgOSM Flex in production for more than 2 years, there have been a few rough edges over that time. However, the improvements over the past year have brought a number of amazing components together.

PgOSM Flex 0.8.0 does include a few ⚠️ breaking changes! ⚠️ Read the release notes for full details.

PgOSM Flex in production

What does "in production" mean for a tool in a data pipeline?

This post covers why I think PgOSM Flex meets all of those requirements.

Reliable

After osm2pgsql reworked its usage of memory (RAM) it was finally possible to create the osm2pgsql-tuner Python package. Adding osm2pgsql-tuner into the mix made it possible to programmatically determine which osm2pgsql command to craft for a give size of hardware, data, and additional considerations (i.e. replication). The improvements to osm2pgsql opened up the ability to package all the "messy bits" into a tidy Docker image.

The net sum is an easy way to run osm2pgsql with a low risk of crashing your server, and a good chance of choosing the options for best performance for the given data size and hardware. Other than Docker, you don't have to install a ton of prerequisites. There's no cloning of repositories, figuring out which set of options to configure appropriately, and hoping you don't crash the server. Everything seems to... just work!

You can still crash the process if you provide a --ram value higher than the amount of RAM available. Typically this value should be set lower than the max available RAM since 100% allocation is never expected.

Easy to try out

PgOSM Flex is easy to try out. There is a simple Quick Start guide in the PgOSM Flex documentation. The commands create a directory for the OpenStreetMap data, set a couple environment variables, then pull, run and execute a Docker container. The example show shows loading the district-of-columbia subregion within north-america/us. It only takes a few minutes to follow the steps in the Quick Start section. When the docker exec completes, a Postgres/PostGIS instance is running on port 5433 with OpenStreetMap data in the osm schema.

PgOSM Flex downloads the required data from Geofabrik. It's easy to change the region loaded by adjusting the --region and --subregion values. To change the process to load Ukraine, change --region to europe and change subregion to ukraine.

docker exec -it \
    pgosm python3 docker/pgosm_flex.py \
    --ram=8 \
    --region=europe  \
    --subregion=ukraine

Beyond changing the region, there are a handful of common customizations to consider.

Easy to use in production

The built in Postgres/PostGIS instance is handy for testing and quick exploration, it is not intended for production use. The standard way to use PgOSM Flex for production is to connect to an existing external Postgres instance. Using PgOSM Flex this way puts the Docker container to work for the data processing while loading the data to your database.

To use PgOSM Flex with your external Postgres instance you will want to to setup the database with a user configured with appropriate permissions. These initial steps are documented on the Postgres Permissions page.

Sure, it's possible (and arguably easier) to use a superuser login role. Don't do that!

With a Postgres role setup with appropriate permissions configure the Docker container for the external connection. If it's a one time data load that is all you need. If you want to update your data occasionally check out the additional details in the replication section.

External Connection and Replication

This section shows the commands I run to update the OpenStreetMap data in my production databases using PgOSM Flex.

Running source ~/.pgosm-db-myproject sets the necessary environment variables, the contents of the file are explained in the documentation. To switch between projects/databases, I adjust the first line pointing the the environment variables. The other adjustments to make to this example are for the region and subregion to load, and often changing the --layerset.

One nice part about this process is the same commands can be used for both the first import and every subsequent update via replication.

source ~/.pgosm-db-myproject

docker run --name pgosm -d --rm \
    -v ~/pgosm-data:/app/output \
    -v /etc/localtime:/etc/localtime:ro \
    -e POSTGRES_USER=$POSTGRES_USER \
    -e POSTGRES_PASSWORD=$POSTGRES_PASSWORD \
    -e POSTGRES_HOST=$POSTGRES_HOST \
    -e POSTGRES_DB=$POSTGRES_DB \
    -e POSTGRES_PORT=$POSTGRES_PORT \
    -p 5433:5432 -d rustprooflabs/pgosm-flex:0.7.2

docker exec -it \
    pgosm python3 docker/pgosm_flex.py \
    --ram=8 \
    --region=north-america/us \
    --subregion=colorado \
    --replication

PgOSM Flex imports are tracked in the target database using the osm.pgosm_flex table. The 0.8.0 release is bringing some improvements to the osm.pgosm_flex table. The following query shows the new layerset column, and the improved import_status column (new in 0.7.1).

SELECT osm_date, region, pgosm_flex_version, layerset, import_status
    FROM osm.pgosm_flex
    ORDER BY imported DESC
;

Results

┌────────────┬───────────────────────────┬─────────────────────┬──────────┬───────────────┐
│  osm_date  │          region           │ pgosm_flex_version  │ layerset │ import_status │
╞════════════╪═══════════════════════════╪═════════════════════╪══════════╪═══════════════╡
│ 2023-04-18 │ north-america/us-colorado │ 0.8.0-dev.1-dcd6454 │ default  │ Completed     │
│ 2023-03-26 │ north-america/us-colorado │ 0.7.2-0b91c81       │ ¤        │ ¤             │
└────────────┴───────────────────────────┴─────────────────────┴──────────┴───────────────┘

Low friction upgrades

With PgOSM Flex, everything is bundled into the Docker image. The process to update to the latest of PgOSM Flex is a simple docker pull command. If you want a specific tagged version, swap out latest for the version number.

docker pull rustprooflabs/pgosm-flex:latest

If you are using --replication you will need to pay attention to your version numbers (aka don't use :latest!) and watch the release notes for manual steps required in the "Upgrade notes for --replication". Some changes to the structure created by PgOSM Flex require manual steps to keep replication running smoothly as PgOSM Flex is updated.

Beware max_connections

My only real concern with running PgOSM Flex against production instances is related to connection volume with --replication. Replication should reduce load times significantly of subsequent updates. However, the current osm2pgsql connection model opens and holds 3 + (1 + number of threads) * number of tables for the duration of processing. This works great with the legacy 3-table OpenStreetMap data structure (point, line and polygon tables). By splitting data out in meaningful subgroups, this results in quite a large overhead in terms of available connections.

SELECT s_name, COUNT(*) AS table_count,
        3 + (1 + 1) * COUNT(*) AS required_connections_lower,
        3 + (1 + 4) * COUNT(*) AS required_connections
    FROM dd.tables 
    WHERE s_name = 'osm'
    GROUP BY s_name
;

┌────────┬─────────────┬────────────────────────────┬──────────────────────┐
│ s_name │ table_count │ required_connections_lower │ required_connections │
╞════════╪═════════════╪════════════════════════════╪══════════════════════╡
│ osm    │          41 │                         85 │                  208 │
└────────┴─────────────┴────────────────────────────┴──────────────────────┘

The above query uses the dd.tables view from our PgDD extension to calculate how many connections are needed with the default layerset. The default layerset uses 41 tables, so needs 85 - 208 connections depending on 1 to 4 threads. Considering the Postgres default for max_connections = 100 this will catch folks off guard. Some positive news is Postgres versions 14 and newer continue to improve performance with higher connection volumes, so hopefully this isn't too big of an issue.

Check your connection limit with SHOW max_connections; and consider increasing this value to give a bit more headroom! If resource constrained, you may want to consider skipping replication and using --pg-dump to transfer data into production.

Unfortunately, the bulk of the connections are not even actively used during the majority of the processing time. The vast majority appear to be used a brief amount at the very beginning, and then gain at the very end of processing. The remainder of the time is simply holding idle connections. The following chart from this comment shows one example of what I found related to connection handling.

Line chart showing active and idle connections during osm2pgsql processing with only 1 active connection and  40 idle connections during the vast majority of the processing time.

Summary

PgOSM Flex is an easy to use tool to load OpenStreetMap data into PostGIS. The data loaded is high quality, ready to use, and if it doesn't suit your exact needs out of the box, customization is in mind at every level.

I'm obviously biased, though I think you will find it a helpful tool as well!

Need help with your PostgreSQL servers or databases? Contact us to start the conversation!

By Ryan Lambert
Published April 21, 2023
Last Updated April 23, 2023