Partition OpenStreetMap data in PostGIS
This post continues my quest to explore Postgres native
partitioning and determine if it is a good fit for my OpenStreetMap data in PostGIS.
I show how I am planning to implement a partitioning
scheme in a way that a) works well for my use case, and b) is easy
to implement and maintain.
My previous post covered why I think partitioning will be a benefit in our databases.
The following steps are the result of a few iterations of ideas, including asking the osm2pgsql team about one idea I had. Ultimately, I think it is good that osm2pgsql will not support the idea I had asked about there. It forced me to rethink my approach and end up at a better solution. The reality is that partitioning only make sense if the partitioning scheme supports the end use, and end uses are quite varied. Trying to automate partitioning directly in the PgOSM-Flex project would have greatly increased costs of maintaining that project, and likely wasted a ton of time.
The post after this one shows that the plan outlined in this post is not perfect, though it shows great promise. There will be at least one more post to outline how everything all work out!
Prepare for partitioning
I am using OpenStreetMap data loaded by
testing with the
(see more posts on PgOSM-Flex). Eventually I plan
to partition a few tables from the imported data, but to start with I am working
only with the roads.
Data loaded by PgOSM-Flex is already grouped into a schema
in Postgres and I plan to take advantage of that for partitioning.
As my previous post showed,
my work flow already uses a schema-based approach so that should not cause
The PgOSM-Flex project automatically tracks the
region, lucky for us these are the details needed for my planned
partitioning scheme. (Ok, fine... I added them in to support this idea!)
It also tracks the version of PgOSM-Flex used (
making it possible to track and manage what version of data transformations
were used to perform any given osm2pgsql import.
The particular import shown below is from an older OpenStreetMap extract from December 2018 that I just now loaded in.
SELECT osm_date, region, pgosm_flex_version FROM osm.pgosm_flex ; ┌─[ RECORD 1 ]───────┬────────────────────────────┐ │ osm_date │ 2018-12-10 │ │ region │ north-america/us--colorado │ │ pgosm_flex_version │ 0.1.1-f488d7b │ └────────────────────┴────────────────────────────┘
The tables for the partitioned objects are being created in a new
dedicated schema. Create the schema for the partitioned OSM data,
osmp. This avoids the standard
osm name used by PgOSM-Flex
while still being easy to type at query time.
CREATE SCHEMA osmp;
The partitioning scheme needs to include
gives us 3 methods
for defining partitions,
HASH. I want to have each region/date in
a dedicated partition. It seems that the
LIST scheme is the best option
for my use case,
but unfortunately you cannot use two columns in a
LIST partition scheme.
No big deal, creating the
osmp.pgosm_flex_partition table gives each
region a single ID to use to define list partitions.
CREATE TABLE osmp.pgosm_flex_partition ( id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, osm_date DATE NOT NULL, region TEXT NOT NULL, CONSTRAINT uq_osmp_pgosm_flex_partition UNIQUE (osm_date, region) );
Add a row to
osmp.pgosm_flex_partition for the new partition.
This takes advantage of the data loaded to the
meta table by PgOSM-Flex.
INSERT statement returns the ID for the partition, this will be needed
in the final steps of this process.
-- Add record, this id is used later as pgosm_flex_partition_id INSERT INTO osmp.pgosm_flex_partition (osm_date, region) SELECT osm_date, region FROM osm.pgosm_flex RETURNING id ; ┌────┐ │ id │ ╞════╡ │ 1 │ └────┘
Create partitioned table
Next is to create the parent partitioned table. This is the table that
users will write queries against but does not store any actual data.
This partitioned table has all of the columns in the
(partitioned tables must have identical columns and data types)
plus the additional
includes a foreign key back to
This ensures partitions cannot be added if they do not have the associated
record in the partition table.
CREATE TABLE osmp.road_line ( osm_id BIGINT NOT NULL, osm_type TEXT NOT NULL, "name" TEXT NULL, "ref" TEXT NULL, maxspeed INT NULL, oneway SMALLINT NULL, layer INT NOT NULL, tunnel TEXT NULL, bridge TEXT NULL, major BOOLEAN NOT NULL, route_foot BOOLEAN NULL, route_cycle BOOLEAN NULL, route_motor BOOLEAN NULL, geom GEOMETRY(LINESTRING, 3857) NULL, pgosm_flex_partition_id BIGINT NOT NULL REFERENCES osmp.pgosm_flex_partition (id), CONSTRAINT pk_osm_road_line_osm_id PRIMARY KEY (pgosm_flex_partition_id, osm_id) ) PARTITION BY LIST (pgosm_flex_partition_id) ;
We can query the partitioned table and verify there is no data in there yet.
SELECT pgosm_flex_partition_id, COUNT(*) FROM osmp.road_line GROUP BY pgosm_flex_partition_id ; ┌─────────────────────────┬───────┐ │ pgosm_flex_partition_id │ count │ ╞═════════════════════════╪═══════╡ └─────────────────────────┴───────┘
Prepare data schema
The data schema is the schema with tables that actually contain the data for
a given region and date. These tables are attached as a partition of the
partitioned table. PgOSM-Flex defaults to using
osm for the data schema's name
and that is used in the following code for convenience.
osm schema is renamed to a non-conflicting name after it is
attached as a partition. Saving that step to the end makes the code
below as reusable as possible.
To get the
osm.road_line table ready to be attached as a partition, alter the
table to include the required ID and populate it with the proper value.
This query again leverages the data in the
to easily get the proper partition ID.
ALTER TABLE osm.road_line ADD pgosm_flex_partition_id BIGINT NULL; UPDATE osm.road_line r SET pgosm_flex_partition_id = p.id FROM osm.pgosm_flex f INNER JOIN osmp.pgosm_flex_partition p ON f.osm_date = p.osm_date AND f.region = p.region WHERE r.pgosm_flex_partition_id IS NULL ;
pgosm_flex_partition_id was originally created
NULLable, update it
ALTER TABLE osm.road_line ALTER COLUMN pgosm_flex_partition_id SET NOT NULL;
You cannot create a
PRIMARY KEYon a column allowing
PRIMARY KEY created by PgOSM-Flex (on
osm_id) must be
dropped and a new
PRIMARY KEY is needed to include the partitioning column.
ALTER TABLE osm.road_line DROP CONSTRAINT pk_osm_road_line_osm_id ; ALTER TABLE osm.road_line ADD CONSTRAINT pk_osm_road_line_partition_id_osm_id PRIMARY KEY (pgosm_flex_partition_id, osm_id) ;
The table with data (
osm.road_line) can now be attached to the partitioned table
bulk load operations such as this to create a short lived
CHECK constraint on the
partitioning key. This will provide the best performance.
CHECKconstraint, the table will be scanned to validate the partition constraint while holding both an
ACCESS EXCLUSIVElock on that partition and a
SHARE UPDATE EXCLUSIVElock on the parent table."
CHECK constraint. Here you need the ID value returned from the
ALTER TABLE osm.road_line ADD CONSTRAINT ck_osm_co_road_line_partition CHECK ( pgosm_flex_partition_id = 1 ) ;
Attach the partition, again using the appropriate partition ID.
ALTER TABLE osmp.road_line ATTACH PARTITION osm.road_line FOR VALUES IN ( 1 ) ;
Remove the now unnecessary
ALTER TABLE osm.road_line DROP CONSTRAINT ck_osm_co_road_line_partition;
The IDs are manually used in the above commands because DDL (Data Definition Language) statements cannot take advantage of a dynamic query like previous steps did. Attempting to use a subquery in the DDL results in an error such as
SQL Error [0A000]: ERROR: cannot use subquery in check constraint
Verify data is in partition
Verify that querying the parent partitioned table,
the child data from the partition. This query joins to the
osmp.pgosm_flex_partition to make it easy to see what data is loaded
SELECT p.region, p.osm_date, COUNT(*) FROM osmp.road_line r INNER JOIN osmp.pgosm_flex_partition p ON r.pgosm_flex_partition_id = p.id GROUP BY p.region, p.osm_date ; ┌────────────────────────────┬────────────┬────────┐ │ region │ osm_date │ count │ ╞════════════════════════════╪════════════╪════════╡ │ north-america/us--colorado │ 2018-12-10 │ 618874 │ └────────────────────────────┴────────────┴────────┘
The value for
regionis automatically generated by the PgOSM-Flex Docker process. The format is
<region>--<sub-region>. This value can be manually set via the
PGOSM_REGIONenvironment variable without the Docker image.
Rinse and repeat
The final step to adding a single partition is to
ALTER the schema to change the
default schema name (
osm) to a more identifying name (
This updated name helps avoid name collisions with future imports.
One difference between
my old method
and this method is I anticipate renaming the partition at the end of processing
instead of waiting until new data is being brought in.
ALTER SCHEMA osm RENAME TO osm_co_20181210;
I ran the above import and partition three (3) more times against other Colorado OpenStreetMap extracts I have handy. Running the above query again gives us these results. It is easy to see the growth of OpenStreetMap data over time!
SELECT p.region, p.osm_date, COUNT(*) FROM osmp.road_line r INNER JOIN osmp.pgosm_flex_partition p ON r.pgosm_flex_partition_id = p.id GROUP BY p.region, p.osm_date ; ┌────────────────────────────┬────────────┬────────┐ │ region │ osm_date │ count │ ╞════════════════════════════╪════════════╪════════╡ │ north-america/us--colorado │ 2018-12-10 │ 618874 │ │ north-america/us--colorado │ 2020-11-20 │ 793082 │ │ north-america/us--colorado │ 2021-01-13 │ 802671 │ │ north-america/us--colorado │ 2021-02-14 │ 812456 │ └────────────────────────────┴────────────┴────────┘
The post before this showed an example query of how
cropped into trend reporting queries when using my legacy schema based approach.
Comparing the above query to the adapted example from the
illustrates a number of improvements over the old
SELECT '20181210' AS osm_date, COUNT(*) AS road_count FROM osm_co_20181210.road_line UNION SELECT '20201120' AS osm_date, COUNT(*) AS road_count FROM osm_co_20201120.road_line UNION SELECT '20210113' AS osm_date, COUNT(*) AS road_count FROM osm_co_20210113.road_line UNION SELECT '20210214' AS osm_date, COUNT(*) AS road_count FROM osm_co_20210214.road_line;
UNION query is more difficult to read and more difficult to maintain
compared to the query that simply joins two tables. Need to change which regions
and/or dates are included? It's a simple update in the
WHERE clause using
the partitioned table where the old method would require completely changing the
The new partitioned method still keeps the data separated in a easy-to-manage schema based approach, but it makes writing queries significantly easier.
This post is the second part of evaluating Postgres native partitioning and if it will help me better manage my OpenStreetMap data. So far the implementation is looking really good. With the partitioning scheme defined and the basic process worked out, adding new partitions should be a trivial operation.
At this point I still have not fully decided that I will move forward with implementing partitioning. Though if I had to bet: I think I will find it to be a significant and worthwhile improvement!
The next post reviews the structure outlined here and identifies one query-time issue.
Need help with your PostgreSQL servers or databases? Contact us to start the conversation!
Published February 16, 2021
Last Updated February 21, 2021