Partition OpenStreetMap data in PostGIS
This post continues my quest to explore Postgres native
partitioning and determine if it is a good fit for my OpenStreetMap data in PostGIS.
I show how I am planning to implement a partitioning
scheme in a way that a) works well for my use case, and b) is easy
to implement and maintain.
My previous post covered why
I think partitioning will be a benefit
in our databases.
The following steps are the result of a few iterations of ideas, including asking the osm2pgsql team about one idea I had. Ultimately, I think it is good that osm2pgsql will not support the idea I had asked about there. It forced me to rethink my approach and end up at a better solution. The reality is that partitioning only make sense if the partitioning scheme supports the end use, and end uses are quite varied. Trying to automate partitioning directly in the PgOSM-Flex project would have greatly increased costs of maintaining that project, and likely wasted a ton of time.
The post after this one shows that the plan outlined in this post is not perfect, though it shows great promise. There will be at least one more post to outline how everything all work out!
Prepare for partitioning
I am using OpenStreetMap data loaded by
PgOSM-Flex,
testing with the osm.road_line
table
(see more posts on PgOSM-Flex). Eventually I plan
to partition a few tables from the imported data, but to start with I am working
only with the roads.
Data loaded by PgOSM-Flex is already grouped into a schema
in Postgres and I plan to take advantage of that for partitioning.
As my previous post showed,
my work flow already uses a schema-based approach so that should not cause
any issue.
The PgOSM-Flex project automatically tracks the osm_date
and region
, lucky for us these are the details needed for my planned
partitioning scheme. (Ok, fine... I added them in to support this idea!)
It also tracks the version of PgOSM-Flex used (pgosm_flex_version='0.1.1-f488d7b'
),
making it possible to track and manage what version of data transformations
were used to perform any given osm2pgsql import.
The particular import shown below is from an older OpenStreetMap extract from December 2018 that I just now loaded in.
SELECT osm_date, region, pgosm_flex_version
FROM osm.pgosm_flex ;
┌─[ RECORD 1 ]───────┬────────────────────────────┐
│ osm_date │ 2018-12-10 │
│ region │ north-america/us--colorado │
│ pgosm_flex_version │ 0.1.1-f488d7b │
└────────────────────┴────────────────────────────┘
The tables for the partitioned objects are being created in a new
dedicated schema. Create the schema for the partitioned OSM data,
named osmp
. This avoids the standard osm
name used by PgOSM-Flex
while still being easy to type at query time.
CREATE SCHEMA osmp;
The partitioning scheme needs to include osm_date
and region
.
Postgres
gives us 3 methods
for defining partitions,
RANGE
, LIST
and HASH
. I want to have each region/date in
a dedicated partition. It seems that the LIST
scheme is the best option
for my use case,
but unfortunately you cannot use two columns in a LIST
partition scheme.
No big deal, creating the osmp.pgosm_flex_partition
table gives each
osm_date
+ region
a single ID to use to define list partitions.
CREATE TABLE osmp.pgosm_flex_partition
(
id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
osm_date DATE NOT NULL,
region TEXT NOT NULL,
CONSTRAINT uq_osmp_pgosm_flex_partition UNIQUE (osm_date, region)
);
Add a row to osmp.pgosm_flex_partition
for the new partition.
This takes advantage of the data loaded to the osm.pgosm_flex
meta table by PgOSM-Flex.
The INSERT
statement returns the ID for the partition, this will be needed
in the final steps of this process.
-- Add record, this id is used later as pgosm_flex_partition_id
INSERT INTO osmp.pgosm_flex_partition (osm_date, region)
SELECT osm_date, region
FROM osm.pgosm_flex
RETURNING id
;
┌────┐
│ id │
╞════╡
│ 1 │
└────┘
Create partitioned table
Next is to create the parent partitioned table. This is the table that
users will write queries against but does not store any actual data.
This partitioned table has all of the columns in the road_line
table
(partitioned tables must have identical columns and data types)
plus the additional pgosm_flex_partition_id
column.
The pgosm_flex_partition_id
column
includes a foreign key back to osmp.pgosm_flex_partition
table.
This ensures partitions cannot be added if they do not have the associated
record in the partition table.
CREATE TABLE osmp.road_line (
osm_id BIGINT NOT NULL,
osm_type TEXT NOT NULL,
"name" TEXT NULL,
"ref" TEXT NULL,
maxspeed INT NULL,
oneway SMALLINT NULL,
layer INT NOT NULL,
tunnel TEXT NULL,
bridge TEXT NULL,
major BOOLEAN NOT NULL,
route_foot BOOLEAN NULL,
route_cycle BOOLEAN NULL,
route_motor BOOLEAN NULL,
geom GEOMETRY(LINESTRING, 3857) NULL,
pgosm_flex_partition_id BIGINT NOT NULL
REFERENCES osmp.pgosm_flex_partition (id),
CONSTRAINT pk_osm_road_line_osm_id
PRIMARY KEY (pgosm_flex_partition_id, osm_id)
) PARTITION BY LIST (pgosm_flex_partition_id)
;
We can query the partitioned table and verify there is no data in there yet.
SELECT pgosm_flex_partition_id, COUNT(*)
FROM osmp.road_line
GROUP BY pgosm_flex_partition_id
;
┌─────────────────────────┬───────┐
│ pgosm_flex_partition_id │ count │
╞═════════════════════════╪═══════╡
└─────────────────────────┴───────┘
Prepare data schema
The data schema is the schema with tables that actually contain the data for
a given region and date. These tables are attached as a partition of the
partitioned table. PgOSM-Flex defaults to using osm
for the data schema's name
and that is used in the following code for convenience.
The osm
schema is renamed to a non-conflicting name after it is
attached as a partition. Saving that step to the end makes the code
below as reusable as possible.
To get the osm.road_line
table ready to be attached as a partition, alter the
table to include the required ID and populate it with the proper value.
This query again leverages the data in the osm.pgosm_flex
table
to easily get the proper partition ID.
ALTER TABLE osm.road_line ADD pgosm_flex_partition_id BIGINT NULL;
UPDATE osm.road_line r
SET pgosm_flex_partition_id = p.id
FROM osm.pgosm_flex f
INNER JOIN osmp.pgosm_flex_partition p
ON f.osm_date = p.osm_date AND f.region = p.region
WHERE r.pgosm_flex_partition_id IS NULL
;
The pgosm_flex_partition_id
was originally created NULL
able, update it
to NOT NULL
.
ALTER TABLE osm.road_line
ALTER COLUMN pgosm_flex_partition_id
SET NOT NULL;
You cannot create a
PRIMARY KEY
on a column allowingNULL
values.
The original PRIMARY KEY
created by PgOSM-Flex (on osm_id
) must be
dropped and a new PRIMARY KEY
is needed to include the partitioning column.
ALTER TABLE osm.road_line
DROP CONSTRAINT pk_osm_road_line_osm_id
;
ALTER TABLE osm.road_line
ADD CONSTRAINT pk_osm_road_line_partition_id_osm_id
PRIMARY KEY (pgosm_flex_partition_id, osm_id)
;
Attach partition
The table with data (osm.road_line
) can now be attached to the partitioned table
(osmp.road_line
).
The
documentation recommends
bulk load operations such as this to create a short lived CHECK
constraint on the
partitioning key. This will provide the best performance.
"Without the
CHECK
constraint, the table will be scanned to validate the partition constraint while holding both anACCESS EXCLUSIVE
lock on that partition and aSHARE UPDATE EXCLUSIVE
lock on the parent table."
Create the CHECK
constraint. Here you need the ID value returned from the
INSERT
statement.
ALTER TABLE osm.road_line
ADD CONSTRAINT ck_osm_co_road_line_partition
CHECK ( pgosm_flex_partition_id = 1 )
;
Attach the partition, again using the appropriate partition ID.
ALTER TABLE osmp.road_line
ATTACH PARTITION osm.road_line
FOR VALUES IN ( 1 )
;
Remove the now unnecessary CHECK
constraint.
ALTER TABLE osm.road_line
DROP CONSTRAINT ck_osm_co_road_line_partition;
The IDs are manually used in the above commands because DDL (Data Definition Language) statements cannot take advantage of a dynamic query like previous steps did. Attempting to use a subquery in the DDL results in an error such as
SQL Error [0A000]: ERROR: cannot use subquery in check constraint
Verify data is in partition
Verify that querying the parent partitioned table, osmp.road_line
contains
the child data from the partition. This query joins to the
osmp.pgosm_flex_partition
to make it easy to see what data is loaded
by partition.
SELECT p.region, p.osm_date, COUNT(*)
FROM osmp.road_line r
INNER JOIN osmp.pgosm_flex_partition p
ON r.pgosm_flex_partition_id = p.id
GROUP BY p.region, p.osm_date
;
┌────────────────────────────┬────────────┬────────┐
│ region │ osm_date │ count │
╞════════════════════════════╪════════════╪════════╡
│ north-america/us--colorado │ 2018-12-10 │ 618874 │
└────────────────────────────┴────────────┴────────┘
The value for
region
is automatically generated by the PgOSM-Flex Docker process. The format is<region>--<sub-region>
. This value can be manually set via thePGOSM_REGION
environment variable without the Docker image.
Rinse and repeat
The final step to adding a single partition is to ALTER
the schema to change the
default schema name (osm
) to a more identifying name (osm_co_20181210
).
This updated name helps avoid name collisions with future imports.
One difference between
my old method
and this method is I anticipate renaming the partition at the end of processing
instead of waiting until new data is being brought in.
ALTER SCHEMA osm RENAME TO osm_co_20181210;
I ran the above import and partition three (3) more times against other Colorado OpenStreetMap extracts I have handy. Running the above query again gives us these results. It is easy to see the growth of OpenStreetMap data over time!
SELECT p.region, p.osm_date, COUNT(*)
FROM osmp.road_line r
INNER JOIN osmp.pgosm_flex_partition p
ON r.pgosm_flex_partition_id = p.id
GROUP BY p.region, p.osm_date
;
┌────────────────────────────┬────────────┬────────┐
│ region │ osm_date │ count │
╞════════════════════════════╪════════════╪════════╡
│ north-america/us--colorado │ 2018-12-10 │ 618874 │
│ north-america/us--colorado │ 2020-11-20 │ 793082 │
│ north-america/us--colorado │ 2021-01-13 │ 802671 │
│ north-america/us--colorado │ 2021-02-14 │ 812456 │
└────────────────────────────┴────────────┴────────┘
The post before this showed an example query of how UNION
queries
cropped into trend reporting queries when using my legacy schema based approach.
Comparing the above query to the adapted example from the
previous post
(below)
illustrates a number of improvements over the old UNION
method.
SELECT '20181210' AS osm_date, COUNT(*) AS road_count
FROM osm_co_20181210.road_line
UNION
SELECT '20201120' AS osm_date, COUNT(*) AS road_count
FROM osm_co_20201120.road_line
UNION
SELECT '20210113' AS osm_date, COUNT(*) AS road_count
FROM osm_co_20210113.road_line
UNION
SELECT '20210214' AS osm_date, COUNT(*) AS road_count
FROM osm_co_20210214.road_line;
This UNION
query is more difficult to read and more difficult to maintain
compared to the query that simply joins two tables. Need to change which regions
and/or dates are included? It's a simple update in the WHERE
clause using
the partitioned table where the old method would require completely changing the
UNION
query.
The new partitioned method still keeps the data separated in a easy-to-manage schema based approach, but it makes writing queries significantly easier.
Summary
This post is the second part of evaluating Postgres native partitioning and if it will help me better manage my OpenStreetMap data. So far the implementation is looking really good. With the partitioning scheme defined and the basic process worked out, adding new partitions should be a trivial operation.
At this point I still have not fully decided that I will move forward with implementing partitioning. Though if I had to bet: I think I will find it to be a significant and worthwhile improvement!
The next post reviews the structure outlined here and identifies one query-time issue.
Need help with your PostgreSQL servers or databases? Contact us to start the conversation!