RustProof Labs: blogging for education (logo)

H3 indexes for performance with PostGIS data

By Ryan Lambert -- Published June 24, 2022

I recently started using the H3 hex grid extension in Postgres with the goal of making some not-so-fast queries faster. My previous post, Using Uber's H3 hex grid in PostGIS, has an introduction to the H3 extension. The focus in that post, admittedly, is a PostGIS focused view instead of an H3 focused view. This post takes a closer look at using the H3 extension to enhance performance of spatial searches.

The two common spatial query patterns considered in this post are:

Setup and Point of Focus

This post uses two tables to examine performance. The following queries add an h3_ix column to the osm.natural_point and osm.building_polygon tables. This approach uses GENERATED columns and adds an index to the column. Going through these steps allow us to remove the need for PostGIS joins at query time for rough distance searches. See my previous post for details about installing the H3 extension and the basics of how it works.

The data used in this post is Colorado OpenStreetMap data loaded using PgOSM Flex.

First, add the H3 index to the osm.natural_point table. It takes about 3 seconds to calculate the H3 index and create an index on the osm.natural_point table.

ALTER TABLE osm.natural_point ADD h3_ix H3INDEX
    GENERATED ALWAYS AS (
        h3_geo_to_h3(ST_Transform(geom, 4326), 7)
    ) STORED
;
CREATE INDEX gix_natural_point_h3_ix
    ON osm.natural_point (h3_ix);

To create the H3 index on the osm.building_polygon table we need to add the ST_Centroid() step into the definition. It takes about 12 seconds to run these steps on the building polygon table.

ALTER TABLE osm.building_polygon ADD h3_ix H3INDEX
    GENERATED ALWAYS AS (
        h3_geo_to_h3(ST_Transform(ST_Centroid(geom), 4326), 7)
    ) STORED
;
CREATE INDEX gix_building_polygon_h3_ix
    ON osm.building_polygon (h3_ix);

The H3 indexes are now calculated for these two tables. My next step is to pick a building to use for a point of focus for the coming spatial search examples. I select a building, Pangea Coffee Roasters, in Golden, Colorado using its osm_id. The following query returns the h3_ix, the point is shown on the map in the following screenshot.

SELECT osm_id, name, h3_ix, geom
    FROM osm.building_polygon
    WHERE osm_id = 392853178
;

┌───────────┬────────────────────────┬─────────────────┐
│  osm_id   │          name          │      h3_ix      │
╞═══════════╪════════════════════════╪═════════════════╡
│ 392853178 │ Pangea Coffee Roasters │ 872681364ffffff │
└───────────┴────────────────────────┴─────────────────┘

Screenshot from DBeaver's spatial viewer showing Pangea Coffee Roasters as a blue point in Downtown Golden Colorado

The query above returned the h3_ix value (872681364ffffff) for the resolution 7 hex containing Pangea Coffee Roasters. We can see the polygon of the hex using the h3_to_geo_boundary_geometry() function.

SELECT h3_to_geo_boundary_geometry('872681364ffffff');

Screenshot from DBeaver's spatial viewer showing the H3 hexagon for index 872681364ffffff, containing Pangea Coffee Roasters

Using H3's grid traversal functions it is easy to find a basic relationship between the number of rings of hexagons and the size of radius. The following query uses the h3_k_ring() function to include 3 rings of hexes surrounding the hex of focus (872681364ffffff). It just so happens, that this 3-ring hex buffer roughly lines up with a 10 kilometer buffer calculated using SRID 3857. The 10 km buffer is also shown for comparison.

-- Resolution 7, looking out 3 rings -- Roughly 10 km radius!
SELECT b.osm_id,
        ST_Buffer(b.geom, 10000) AS buffer_10km,
        b.geom,
        h3_to_geo_boundary_geometry(
            h3_k_ring(h3_geo_to_h3(ST_Transform(b.geom, 4326), 7), 3)
            ) AS ix_ring
    FROM osm.vbuilding_all b
    WHERE b.osm_id = 392853178
;

Screenshot from DBeaver's spatial viewer Pangea Coffee Roasters as a blue point in Downtown Golden Colorado

The h3_k_ring() is one of many functions offered by the H3 index. Check out the full API reference for a look at the rest of the functionality.

To get a radius 4 times larger we can multiply both values by 4 to use a 40 km buffer and a 12-ring hex buffer in h3_k_ring() function. This results in a grid of 469 hexagons.

-- Looking out 3*4 (12) rings -- Roughly 10*4 km (40km) radius!
SELECT b.osm_id,
        ST_Buffer(b.geom, 10000*4) AS buffer_10km,
        b.geom,
        h3_to_geo_boundary_geometry(
            h3_k_ring(h3_geo_to_h3(ST_Transform(b.geom, 4326), 7), 3*4)
            ) AS ix_ring
    FROM osm.vbuilding_all b
    WHERE b.osm_id = 392853178
;

Screenshot from DBeaver's spatial viewer Pangea Coffee Roasters as a blue point in Downtown Golden Colorado

The above examples show that I was able to quickly establish a rough relationship between a measurement we know (10 km) with a more abstract measurement (3 hex rings at resolution 7). It is helpful for us humans and usability to have a somewhat familiar frame of reference.

Nearest Neighbor Style Searches

This example uses PostGIS's ST_DWithin() function to find trees within 10 kilometers from the focus point. This query with a traditional spatial join takes 14 - 25 ms to run. It returns around 12,000 rows.

EXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT JSON)
SELECT b.osm_id, b.name, b.height, n.geom
    FROM osm.building_polygon b
    INNER JOIN osm.natural_point n
        ON n.osm_type = 'tree' AND ST_DWithin(b.geom, n.geom, 10000)
    WHERE b.osm_id = 392853178
;

The query plan visualized by pgMustard shows this query is decently efficient. The following screenshot shows the main overview of the nodes with the total time and the node timings. The GIST index on natural_point was used, taking 8ms of the query time.

Screenshot from pgMustard showing the index plan used for the spatial query identifying trees within 10 km of the selected building.

If you look closely at row estimates (open that link to pgMustard!) you'll notice that the estimated rows returned from natural_point was significantly off from the actual number of rows. In the case of this query, it was "out by a factor of 390.4." Spatial joins commonly have poor row estimates used by the query planner and are often the cause of poor spatial query performance.

The next query uses the H3 index from the selected building, and passes h3_ix into the h3_k_ring() function. This creates the 3-ring hex buffer to use for the join to the osm.natural_point table, replacing the spatial join. The query with the H3 index is 73% faster, returning in 4 - 6 ms. The query plan shows that two indexes are used again. This query also returns around 12,000 rows.

EXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT JSON)
WITH rng AS (
SELECT h3_k_ring(h3_ix, 3)
            AS ix_ring
    FROM osm.building_polygon b
    WHERE b.osm_id = 392853178
)
SELECT *
    FROM rng b
    INNER JOIN osm.natural_point n
        ON n.osm_type = 'tree'
            AND n.h3_ix = b.ix_ring
;

Screenshot from pgMustard showing the index plan used for the H3 query identifying trees within 3-ring hex grid of the selected building at resolution 7.

Increasing result size

This section switches the query to look for buildings within a 40 km radius instead of trees within 10 km. With this change the following queries return roughly 340,000 rows each. The spatial join for with ST_DWithin() and a 40km radius returns in 459 ms. Using a 12-ring hex buffer returns in about 100 ms, 78% faster.

SELECT b.osm_id, b.name, b.height, n.geom
    FROM osm.building_polygon b
    INNER JOIN osm.building_polygon n
        ON ST_DWithin(b.geom, n.geom, 10000*4)
    WHERE b.osm_id = 392853178
;

WITH rng AS (
SELECT h3_k_ring(h3_ix, 3*4)
            AS ix_ring
    FROM osm.building_polygon b
    WHERE b.osm_id = 392853178
)
SELECT *
    FROM rng b
    INNER JOIN osm.building_polygon n
        ON n.h3_ix = b.ix_ring
;

Regional Analysis

The queries above use the pattern of finding "those" within "a distance" of "this." This section switches to the a query to answer "show me this pattern across a region" questions. My post Find missing crossings in OpenStreetMap with PostGIS is an example that type of analysis.

Since my test database already has H3 indexes available for trees and buildings, this example calculates the number of trees per building within each hex. The first query uses spatial joins between the feature table and the h3.hex table.

The h3.hex table was created to cover Colorado following the steps I outlined in my previous post under "Save the hexes."

WITH bldgs AS (
SELECT h.ix AS h3_ix, 'building' AS src, COUNT(*) AS cnt
    FROM osm.building_polygon b
    INNER JOIN h3.hex h ON ST_Contains(h.geom_3857, ST_Centroid(b.geom))
    GROUP BY h.ix
), trs AS (
SELECT h.ix AS h3_ix, n.osm_type AS src, COUNT(*) AS cnt
    FROM osm.natural_point n
    INNER JOIN h3.hex h ON ST_Contains(h.geom_3857, n.geom)
    WHERE n.osm_type = 'tree'
    GROUP BY h.ix, n.osm_type
)
SELECT COALESCE(b.h3_ix, t.h3_ix) AS h3_ix,
        CASE WHEN b.cnt > 0
            THEN 1.0 * t.cnt / b.cnt
            ELSE NULL
            END AS tree_per_bldg
    FROM bldgs b
    FULL JOIN trs t ON b.h3_ix = t.h3_ix
;

The next query uses the H3 indexes. By using these indexes, the aggregate queries do not need to know anything about the "where" of the data, only the H3 index. The h3_ix column is basically just a short piece of text.

WITH bldgs AS (
SELECT h3_ix, 'building' AS src, COUNT(*) AS cnt
    FROM osm.building_polygon
    GROUP BY h3_ix
), trs AS (
SELECT h3_ix, osm_type AS src, COUNT(*) AS cnt
    FROM osm.natural_point
    WHERE osm_type = 'tree'
    GROUP BY h3_ix, osm_type
)
SELECT COALESCE(b.h3_ix, t.h3_ix) AS h3_ix,
        CASE WHEN b.cnt > 0
            THEN 1.0 * t.cnt / b.cnt
            ELSE NULL
            END AS tree_per_bldg
    FROM bldgs b
    FULL JOIN trs t ON b.h3_ix = t.h3_ix
;

The query using the spatial join took 22 seconds (query plan).

The query using the H3 indexes took only 182 ms... That's 99% faster! (query plan)

Sum of buffers: measure of work

When looking at the previous query plan using spatial joins in pgMustard, the "Sum of buffers" detail caught my attention. The following screenshot shows a reported 51 GB of buffers used!

Screenshot from pgMustard showing the Sum of buffers calculation reporting 51 GB of buffers, in total, were used.

I knew the total of the tables involved was well under 1 GB in size. This made me curious, so I asked Michael Christofides, co-founder of pgMustard, what was behind the sum of buffers calculation.

"pgMustard does some arithmetic to calculate (and display) these on a per node basis, in the Buffers section. It also now a works out roughly how much data each of them represents in kB/MB/GB/TB, assuming an 8kB block size.

The sum is simply a crude addition of all of these per-node numbers put together! Naturally blocks can be read more than once for the same query, and reads and writes to disk are double counted, but since I see it as a very rough measure of “work done”, I thought the sum worked surprisingly well. It’s not perfect, and I still focus more on timings, but I think it helps folks understand why some things are slower than they expected!" -- Michael Christofides

Admittedly, the total number by itself threw me off at first. After looking at this value on a number of query plans, it has grown on me! It seems to be a decent indicator how much data a query is using and that is a good thing to have an eye on. Another great addition to an already solid tool.

Summary

This post has explored using the H3 extension in Postgres, with the goal of improving performance for specific types of spatial queries.

The two queries I tested for the nearest neighbor searches performed 73% - 77% faster with H3 indexes. The regional analysis was 99% faster. These are huge wins, especially as data sizes grow and the complexity of spatial queries increases. It is common to have queries using these patterns that can take a few minutes to run. Cutting out 70-99% of the query time sounds good to me!

Need help with your PostgreSQL servers or databases? Contact us to start the conversation!

By Ryan Lambert
Published June 24, 2022
Last Updated June 24, 2022