RustProof Labs: blogging for education (logo)

Why Partition OpenStreetMap data?

By Ryan Lambert -- Published February 14, 2021

This post covers the first part of my path in considering native Postgres partitioning and how it might be helpful to my work with OpenStreetMap data in PostGIS. Partitioning tables in Postgres can have significant benefits when working with larger data sets, and OpenStreetMap data as a whole is generally considered a large data set. The post following this one will outline the steps I am taking to implement partitioning with data loaded by PgOSM-Flex. A third post is planned to dive into the impacts on performance this change has.

Table partitioning is not an architecture that should be implemented casually without planning and good reason. The consequences of a poorly planned and implemented partitioning scheme can be severe. This is why it is worth the extra time to plan, evaluate and test this option before making any lasting implementation decisions. This post starts by examining the work flow I have used with OpenStreetMap data, challenges with my legacy approach, and highlights where I think Postgres partitioning can provide serious improvement. My next post shows how I am approaching the task of partitioning OpenStreetMap data.

At the time of writing this post I have not decided if this is a path I will continue down for production use. I have not started testing and collecting data for the 3rd post. I will likely make the "Go / No-Go" decision while I am collecting data for the performance related post.

OpenStreetMap for Analytics

The main way I use OpenStreetMap data is within analytics style projects. Routing, travel times, watersheds, urban growth, and land usage are all easily within scope for OpenStreetMap data in PostGIS. Reproducibility is an important detail when producing results from data. To achieve reproducibility in analytics projects we need to retain two main elements:

With the original data and the code used (SQL, Python, etc.) it is easy to reproduce results. With this in mind, my OpenStreetMap data recency and retention needs can be summarized in a few points.

Project lifecycle

At the beginning of a project I load the latest OpenStreetMap data for the region of interest, we'll assume Colorado for now. The data ends up in a schema named osm_co and I write some queries against the data.

SELECT COUNT(*) AS road_count
    FROM osm_co.road_line;

Fast forward through a bunch of querying, polishing, refining and an end result is produced. The SQL code is archived and time moves along. A new project comes in, again focused in Colorado and I need to load new data. The previous osm_co needs to be retained so is renamed to make way for an updated Colorado data set. If I'm being good to my future self I'll leave a comment to tie it to the project; if not, "Future Ryan" will end up doing a bit of digging to link data to projects.

ALTER SCHEMA osm_co RENAME TO osm_co_20210115;
COMMENT ON SCHEMA osm_co_20210115 IS 'Schema for Project ZYX...';

Downsides

The "archived schema" approach is easy to implement but is not perfect. A downside to this approach is if I need to use the archived code against the archived data to reproduce results, I first need to update the code for the new schema name. Updating schema names in code is trivial for simple examples but in complex code this introduces its own set of possible errors. From the perspective of reproducible results these potential errors can be a big problem.

SELECT COUNT(*) AS road_count
    FROM osm_co_20210115.road_line;

Using search_path is another option. I'm not a fan of that option because it obscures the source of the data from the user reading the SQL. I strongly prefer fully qualified object names (schema.table) for readability.

Another downside to this simple approach comes into play if you want to perform analysis over time. With each historic snapshot in its own schema, queries like this are eventually required. Even a trivial COUNT(*) query starts to become unnecessarily complex and error prone.

SELECT '20190715' AS osm_date, COUNT(*) AS road_count
    FROM osm_co_20190715.road_line
UNION
SELECT '20200715' AS osm_date, COUNT(*) AS road_count
    FROM osm_co_20200715.road_line
UNION
SELECT '20210115' AS osm_date, COUNT(*) AS road_count
    FROM osm_co_20210115.road_line;

Typos in queries like the above UNION query are difficult to spot. This makes it all too easy to introduce accidental errors.

At this point (without partitioning) it may seem desirable to create a stacked table to store all the data with a new column osm_date. If you are using only a single small sub-region like Colorado, and only occasional updates, that might be a good solution. Though long term maintenance still can cause headaches even with only Colorado. The full Colorado data set I load is roughly 1 GB, it doesn't take too many data loads of this data set to start reaching a size where partitioning might be a good solution. Inserts, deletes, updates, indexes and vacuums can get cumbersome depending on your usage and hardware. When working with larger regions such as North America, Europe or the full planet these bottlenecks can approach quickly.

Variable Regions

For me it isn't just a matter of using Colorado over time, the region I load also varies by project. A lot of the time I can get away with only loading Colorado, after all we are located in Colorado and our clients are predominantly in Colorado. On the other hand, a project focused around Wray, CO would likely require Colorado, Nebraska and Kansas all being loaded as one data set due to the close proximity to those borders. Other times I may load the US-West region, or even the full North America or Europe regions. It all depends on the project at hand and its requirements.

This requirement for variable regions, along with the date of the load, is important to my potential partitioning scheme.

Geofabrik's download server makes getting the right-sized region for your project an easy problem to solve!

Why Partition?

Now to answering the question: Why partition OpenStreetMap data? The Postgres documentation lists four main benefits for partitioning. The main reason I am considering partitioning is for the 3rd benefit listed (emphasis mine).

"Bulk loads and deletes can be accomplished by adding or removing partitions, if the usage pattern is accounted for in the partitioning design. Dropping an individual partition using DROP TABLE, or doing ALTER TABLE DETACH PARTITION, is far faster than a bulk operation. These commands also entirely avoid the VACUUM overhead caused by a bulk DELETE."

Improving the performance and ease of data loads, removals, and maintenance is a big deal for my OpenStreetMap data sets. I want the benefit of having a single table to query (including all the versions of data) but the maintenance downsides of doing this in a traditional stacked table were reason enough for me to continue using the archived schema approach. Using a partitioned table approach seems like it will be worthwhile for these goals.

Another possible benefit to table partitioning is improved query time performance. With a properly designed partitioning scheme Postgres can perform partition pruning when the planner is certain that the data must live in a specific partition. This can make certain types of queries significantly faster than the same operation on one single big table. Faster query performance is always a happy benefit to have, but it is not my main goal with this. This would be a bonus benefit!

Postgres' table partitioning has been built-in since Postgres 10 with improvements made in each subsequent version of Postgres. Now that partitioning has been around for 4 major versions, it should be quite mature and stable.

Summary

This post has covered details behind why I am considering partitioning OpenStreetMap data in Postgres. My main goal is to improve long term maintainability and ease of querying for my analytic projects. At this phase of the project I am optimistic that testing will support my hypothesis and I will move forward with partitioning at least some of the largest portions of OpenStreetMap data in my PostGIS databases.

My next post covers the how of implementing partitioning for OpenStreetMap data. The final post is still to come!

Need help with your PostgreSQL servers or databases? Contact us to start the conversation!

By Ryan Lambert
Published February 14, 2021
Last Updated February 16, 2021