RustProof Labs: blogging for education (logo)

Partition OpenStreetMap data in PostGIS

By Ryan Lambert -- Published February 16, 2021

This post continues my quest to explore Postgres native partitioning and determine if it is a good fit for my OpenStreetMap data in PostGIS. I show how I am planning to implement a partitioning scheme in a way that a) works well for my use case, and b) is easy to implement and maintain.
My previous post covered why I think partitioning will be a benefit in our databases.

The following steps are the result of a few iterations of ideas, including asking the osm2pgsql team about one idea I had. Ultimately, I think it is good that osm2pgsql will not support the idea I had asked about there. It forced me to rethink my approach and end up at a better solution. The reality is that partitioning only make sense if the partitioning scheme supports the end use, and end uses are quite varied. Trying to automate partitioning directly in the PgOSM-Flex project would have greatly increased costs of maintaining that project, and likely wasted a ton of time.

The post after this one shows that the plan outlined in this post is not perfect, though it shows great promise. There will be at least one more post to outline how everything all work out!

Prepare for partitioning

I am using OpenStreetMap data loaded by PgOSM-Flex, testing with the osm.road_line table (see more posts on PgOSM-Flex). Eventually I plan to partition a few tables from the imported data, but to start with I am working only with the roads.

Data loaded by PgOSM-Flex is already grouped into a schema in Postgres and I plan to take advantage of that for partitioning. As my previous post showed, my work flow already uses a schema-based approach so that should not cause any issue. The PgOSM-Flex project automatically tracks the osm_date and region, lucky for us these are the details needed for my planned partitioning scheme. (Ok, fine... I added them in to support this idea!) It also tracks the version of PgOSM-Flex used (pgosm_flex_version='0.1.1-f488d7b'), making it possible to track and manage what version of data transformations were used to perform any given osm2pgsql import. The particular import shown below is from an older OpenStreetMap extract from December 2018 that I just now loaded in.

SELECT osm_date, region, pgosm_flex_version
    FROM osm.pgosm_flex ;

┌─[ RECORD 1 ]───────┬────────────────────────────┐
│ osm_date           │ 2018-12-10                 │
│ region             │ north-america/us--colorado │
│ pgosm_flex_version │ 0.1.1-f488d7b              │
└────────────────────┴────────────────────────────┘

The tables for the partitioned objects are being created in a new dedicated schema. Create the schema for the partitioned OSM data, named osmp. This avoids the standard osm name used by PgOSM-Flex while still being easy to type at query time.

CREATE SCHEMA osmp;

The partitioning scheme needs to include osm_date and region. Postgres gives us 3 methods for defining partitions, RANGE, LIST and HASH. I want to have each region/date in a dedicated partition. It seems that the LIST scheme is the best option for my use case, but unfortunately you cannot use two columns in a LIST partition scheme. No big deal, creating the osmp.pgosm_flex_partition table gives each osm_date + region a single ID to use to define list partitions.

CREATE TABLE osmp.pgosm_flex_partition
(
    id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    osm_date DATE NOT NULL,
    region TEXT NOT NULL,
    CONSTRAINT uq_osmp_pgosm_flex_partition UNIQUE (osm_date, region)
);

Add a row to osmp.pgosm_flex_partition for the new partition. This takes advantage of the data loaded to the osm.pgosm_flex meta table by PgOSM-Flex. The INSERT statement returns the ID for the partition, this will be needed in the final steps of this process.

-- Add record, this id is used later as pgosm_flex_partition_id
INSERT INTO osmp.pgosm_flex_partition (osm_date, region)
SELECT osm_date, region 
    FROM osm.pgosm_flex
    RETURNING id
;
┌────┐
│ id │
╞════╡
│  1 │
└────┘

Create partitioned table

Next is to create the parent partitioned table. This is the table that users will write queries against but does not store any actual data. This partitioned table has all of the columns in the road_line table (partitioned tables must have identical columns and data types) plus the additional pgosm_flex_partition_id column. The pgosm_flex_partition_id column includes a foreign key back to osmp.pgosm_flex_partition table. This ensures partitions cannot be added if they do not have the associated record in the partition table.

CREATE TABLE osmp.road_line (
    osm_id BIGINT NOT NULL,
    osm_type TEXT NOT NULL,
    "name" TEXT NULL,
    "ref" TEXT NULL,
    maxspeed INT NULL,
    oneway SMALLINT NULL,
    layer INT NOT NULL,
    tunnel TEXT NULL,
    bridge TEXT NULL,
    major BOOLEAN NOT NULL,
    route_foot BOOLEAN NULL,
    route_cycle BOOLEAN NULL,
    route_motor BOOLEAN NULL,
    geom GEOMETRY(LINESTRING, 3857) NULL,
    pgosm_flex_partition_id BIGINT NOT NULL
        REFERENCES osmp.pgosm_flex_partition (id),
    CONSTRAINT pk_osm_road_line_osm_id
        PRIMARY KEY (pgosm_flex_partition_id, osm_id)
) PARTITION BY LIST (pgosm_flex_partition_id)
;

We can query the partitioned table and verify there is no data in there yet.

SELECT pgosm_flex_partition_id, COUNT(*)
    FROM osmp.road_line
    GROUP BY pgosm_flex_partition_id
;
┌─────────────────────────┬───────┐
│ pgosm_flex_partition_id │ count │
╞═════════════════════════╪═══════╡
└─────────────────────────┴───────┘

Prepare data schema

The data schema is the schema with tables that actually contain the data for a given region and date. These tables are attached as a partition of the partitioned table. PgOSM-Flex defaults to using osm for the data schema's name and that is used in the following code for convenience. The osm schema is renamed to a non-conflicting name after it is attached as a partition. Saving that step to the end makes the code below as reusable as possible.

To get the osm.road_line table ready to be attached as a partition, alter the table to include the required ID and populate it with the proper value. This query again leverages the data in the osm.pgosm_flex table to easily get the proper partition ID.

ALTER TABLE osm.road_line ADD pgosm_flex_partition_id BIGINT NULL;

UPDATE osm.road_line r
    SET pgosm_flex_partition_id = p.id
    FROM osm.pgosm_flex f 
    INNER JOIN osmp.pgosm_flex_partition p
        ON f.osm_date = p.osm_date AND f.region = p.region
    WHERE r.pgosm_flex_partition_id IS NULL
;

The pgosm_flex_partition_id was originally created NULLable, update it to NOT NULL.

ALTER TABLE osm.road_line
    ALTER COLUMN pgosm_flex_partition_id
    SET NOT NULL;

You cannot create a PRIMARY KEY on a column allowing NULL values.

The original PRIMARY KEY created by PgOSM-Flex (on osm_id) must be dropped and a new PRIMARY KEY is needed to include the partitioning column.

ALTER TABLE osm.road_line
    DROP CONSTRAINT pk_osm_road_line_osm_id
;
ALTER TABLE osm.road_line
    ADD CONSTRAINT pk_osm_road_line_partition_id_osm_id
    PRIMARY KEY (pgosm_flex_partition_id, osm_id)
;

Attach partition

The table with data (osm.road_line) can now be attached to the partitioned table (osmp.road_line). The documentation recommends bulk load operations such as this to create a short lived CHECK constraint on the partitioning key. This will provide the best performance.

"Without the CHECK constraint, the table will be scanned to validate the partition constraint while holding both an ACCESS EXCLUSIVE lock on that partition and a SHARE UPDATE EXCLUSIVE lock on the parent table."

Create the CHECK constraint. Here you need the ID value returned from the INSERT statement.

ALTER TABLE osm.road_line
    ADD CONSTRAINT ck_osm_co_road_line_partition
    CHECK ( pgosm_flex_partition_id = 1 )
;

Attach the partition, again using the appropriate partition ID.

ALTER TABLE osmp.road_line
    ATTACH PARTITION osm.road_line
    FOR VALUES IN ( 1 )
;

Remove the now unnecessary CHECK constraint.

ALTER TABLE osm.road_line
    DROP CONSTRAINT ck_osm_co_road_line_partition;

The IDs are manually used in the above commands because DDL (Data Definition Language) statements cannot take advantage of a dynamic query like previous steps did. Attempting to use a subquery in the DDL results in an error such as SQL Error [0A000]: ERROR: cannot use subquery in check constraint

Verify data is in partition

Verify that querying the parent partitioned table, osmp.road_line contains the child data from the partition. This query joins to the osmp.pgosm_flex_partition to make it easy to see what data is loaded by partition.

SELECT p.region, p.osm_date, COUNT(*)
    FROM osmp.road_line r
    INNER JOIN osmp.pgosm_flex_partition p
        ON r.pgosm_flex_partition_id = p.id
    GROUP BY p.region, p.osm_date
;

┌────────────────────────────┬────────────┬────────┐
│           region           │  osm_date  │ count  │
╞════════════════════════════╪════════════╪════════╡
│ north-america/us--colorado │ 2018-12-10 │ 618874 │
└────────────────────────────┴────────────┴────────┘

The value for region is automatically generated by the PgOSM-Flex Docker process. The format is <region>--<sub-region>. This value can be manually set via the PGOSM_REGION environment variable without the Docker image.

Rinse and repeat

The final step to adding a single partition is to ALTER the schema to change the default schema name (osm) to a more identifying name (osm_co_20181210). This updated name helps avoid name collisions with future imports. One difference between my old method and this method is I anticipate renaming the partition at the end of processing instead of waiting until new data is being brought in.

ALTER SCHEMA osm RENAME TO osm_co_20181210;

I ran the above import and partition three (3) more times against other Colorado OpenStreetMap extracts I have handy. Running the above query again gives us these results. It is easy to see the growth of OpenStreetMap data over time!

SELECT p.region, p.osm_date, COUNT(*)
    FROM osmp.road_line r
    INNER JOIN osmp.pgosm_flex_partition p
        ON r.pgosm_flex_partition_id = p.id
    GROUP BY p.region, p.osm_date
;

┌────────────────────────────┬────────────┬────────┐
│           region           │  osm_date  │ count  │
╞════════════════════════════╪════════════╪════════╡
│ north-america/us--colorado │ 2018-12-10 │ 618874 │
│ north-america/us--colorado │ 2020-11-20 │ 793082 │
│ north-america/us--colorado │ 2021-01-13 │ 802671 │
│ north-america/us--colorado │ 2021-02-14 │ 812456 │
└────────────────────────────┴────────────┴────────┘

The post before this showed an example query of how UNION queries cropped into trend reporting queries when using my legacy schema based approach. Comparing the above query to the adapted example from the previous post (below) illustrates a number of improvements over the old UNION method.

SELECT '20181210' AS osm_date, COUNT(*) AS road_count
    FROM osm_co_20181210.road_line
UNION
SELECT '20201120' AS osm_date, COUNT(*) AS road_count
    FROM osm_co_20201120.road_line
UNION
SELECT '20210113' AS osm_date, COUNT(*) AS road_count
    FROM osm_co_20210113.road_line
UNION
SELECT '20210214' AS osm_date, COUNT(*) AS road_count
    FROM osm_co_20210214.road_line;

This UNION query is more difficult to read and more difficult to maintain compared to the query that simply joins two tables. Need to change which regions and/or dates are included? It's a simple update in the WHERE clause using the partitioned table where the old method would require completely changing the UNION query.

The new partitioned method still keeps the data separated in a easy-to-manage schema based approach, but it makes writing queries significantly easier.

Summary

This post is the second part of evaluating Postgres native partitioning and if it will help me better manage my OpenStreetMap data. So far the implementation is looking really good. With the partitioning scheme defined and the basic process worked out, adding new partitions should be a trivial operation.

At this point I still have not fully decided that I will move forward with implementing partitioning. Though if I had to bet: I think I will find it to be a significant and worthwhile improvement!

The next post reviews the structure outlined here and identifies one query-time issue.

Need help with your PostgreSQL servers or databases? Contact us to start the conversation!

By Ryan Lambert
Published February 16, 2021
Last Updated February 21, 2021