OpenStreetMap to PostGIS is getting lighter
If you have ever wanted OpenStreetMap data in Postgres/PostGIS, you are probably familiar with the osm2pgsql tool. Lately I have been writing about the osm2pgsql developments with the new Flex output and how it is enabling improved data quality. This post changes focus away from the flex output and examines the performance of the osm2pgsql load itself.
One challenge with osm2pgsql over the years has been generic
recommendations have been difficult to make. The safest recommendation
for nearly any combination of hardware and source data size was
to use osm2pgsql --slim --drop
to put most of the intermediate data
into Postgres instead of relying directly on RAM, which it needed a lot of.
This choice has offsetting costs of putting all that data into Postgres (only to be deleted) in terms of disk usage and I/O performance.
A few days ago, a pull request from Jochen Topf to create a new RAM middle caught my eye. The text that piqued my interest (emphasis mine):
When not using two-stage processing the memory requirements are much much smaller than with the old ram middle. Rule of thumb is, you'll need about 1GB plus 2.5 times the size of the PBF file as memory. This makes it possible to import even continent-sized data on reasonably-sized machines.
Wait... what?! Is this for real??
(Webinar) OpenStreetMap to PostGIS: Easier and Better!
This page has the resources and recording for the OpenStreetMap to PostGIS: Easier and Better! webinar from Wednesday March 31, 2021.
Downloads for session
Scripts used for the demo:
Round Two: Partitioning OpenStreetMap
A few weeks ago I decided to seriously consider Postgres' declarative table partitioning for our OpenStreetMap data. Once the decision was made to investigate this option, I outlined our use case with requirements to keep multiple versions of OpenStreetMap data over time. That process helped draft my initial plan for how to create and manage the partitioned data. When I put the initial code to the test I found a snag and adjusted the plan.
This post shows a working example of how to partition OpenStreetMap data loaded using PgOSM-Flex.
TLDR
Spoiler alert!
It works, I love it! I am moving forward with the plan outlined in this post. Some highlights from testing with Colorado sized data:
- Bulk import generates 17% less WAL
- Bulk delete generates 99.8% less WAL
- Simple aggregate query runs 75% faster
First Review of Partitioning OpenStreetMap
My previous two posts set the stage to evaluate declarative Postgres partitioning for OpenStreetMap data. This post outlines what I found when I tested my plan and outlines my next steps. The goal with this series is to determine if partitioning is a path worth going down, or if the additional complexity outweighs any benefits. The first post on partitioning outlined my use case and why I thought partitioning would be a potential benefit. The maintenance aspects of partitioning are my #1 hope for improvement, with easy and fast loading and removal of entire data sets being a big deal for me.
The second post detailed my approach to partitioning to allow me to partition based on date and region. In that post I even bragged that a clever workaround was a suitable solution.
"No big deal, creating the
osmp.pgosm_flex_partition
table gives eachosm_date
+region
a single ID to use to define list partitions." -- Arrogant MeRead on to see where that assumption fell apart and my planned next steps.
I was hoping to have made a "Go / No-Go" decision by this point... I am currently at a solid "Probably!"
Load data
For testing I simulated Colorado data being loaded once per month on the 1st of each month and North America once per year on January 1. This was conceptually easier to implement and test than trying to capture exactly what I described in my initial post. This approach resulted in 17 snapshots of OpenStreetMap being loaded, 15 with Colorado and two with North America. I loaded this data twice, once using the planned partitioned setup and the other using a simple stacked table to compare performance against.
Partition OpenStreetMap data in PostGIS
This post continues my quest to explore Postgres native
partitioning and determine if it is a good fit for my OpenStreetMap data in PostGIS.
I show how I am planning to implement a partitioning
scheme in a way that a) works well for my use case, and b) is easy
to implement and maintain.
My previous post covered why
I think partitioning will be a benefit
in our databases.
The following steps are the result of a few iterations of ideas, including asking the osm2pgsql team about one idea I had. Ultimately, I think it is good that osm2pgsql will not support the idea I had asked about there. It forced me to rethink my approach and end up at a better solution. The reality is that partitioning only make sense if the partitioning scheme supports the end use, and end uses are quite varied. Trying to automate partitioning directly in the PgOSM-Flex project would have greatly increased costs of maintaining that project, and likely wasted a ton of time.
The post after this one shows that the plan outlined in this post is not perfect, though it shows great promise. There will be at least one more post to outline how everything all work out!
Prepare for partitioning
I am using OpenStreetMap data loaded by
PgOSM-Flex,
testing with the osm.road_line
table
(see more posts on PgOSM-Flex). Eventually I plan
to partition a few tables from the imported data, but to start with I am working
only with the roads.