RustProof Labs: blogging for education (logo)
My book Mastering PostGIS and OpenStreetMap is available!

Identify OpenStreetMap changes with Postgres

By Ryan Lambert -- Published August 30, 2021

The data in the main OpenStreetMap database is constantly changing. Folks around the world are almost certainly saving changes via JOSM, iD, and other editors as you read these words. With change constantly occurring in the data, it is often desirable to have an idea of what has actually changed in the data. This post explores one approach to tracking changes to the tags attribute data once it has been loaded to Postgres.

The topic of this post surfaced while I was working on refreshing a project involving travel times (routing). In the process I noticed a few instances where the analysis had shifted significantly. My first hunch was that entire segments of road had been added or removed, but that was not the cause. It became apparent that tags in the area had been improved. It was easy to specifically point to the value associated with the highway key but I also knew there were other changes happening, I just wasn't sure what all was involved and at what scale.

Calculate tag hash

The database I am working in has five (5) Colorado snapshots loaded spanning back to 2018. The tags data is loaded to a table named osmts.tags, read my post Timescale, Compression and OpenStreetMap Tags for how this table was created. The tags table has one row for every OpenStreetMap feature and stores the full key/value attribute data in a JSONB column (osmts.tags.tags). A relatively simple way to detecting change in data is to calculate the hash for each feature's key/value data. Comparing hashes for any change will identify rows that had changes to their attribute data.

For this post I filter the tags down to features with a highway tag (tags->>'highway' IS NOT NULL) and limit the dates compared to the two August 16th dates from 2020 and 2021. The filter geom_type = 'W' excludes nodes (points) but will include lines and polygons. The MD5 hash of each row's attribute data is calculated in the tags_md5 column by casting the JSONB to TEXT inside the md5() function. The 1.6 million highway rows are saved to the highway_hashes temporary table.

DROP TABLE IF EXISTS highway_hashes;
CREATE TEMP TABLE highway_hashes AS
SELECT osm_date, region, geom_type, osm_id,
        md5(t.tags::TEXT) AS tags_md5,
        t.tags
    FROM osmts.tags t
    WHERE geom_type = 'W'
        AND osm_date IN ('2020-08-16', '2021-08-16')
        AND region = 'colorado'
        AND tags->>'highway' IS NOT NULL
;

SELECT 1613111
Time: 8138.200 ms (00:08.138)

Note: MD5 hashes are used with the understanding there's a low probability of hash collisions in this non-security related application. A hash collision is where two different inputs produce the same hash output. We are comparing hashes linked on osm_id and geom_type so we are only comparing two hashes at a time. The accidental clash rate for two arbitrary values hashed with MD5 is 2^-128.

The highway_hashes table is filled with rows such as this.

┌────────────┬──────────┬───────────┬─────────┬──────────────────────────────────┬────────────────────────────────────────────────┐
│  osm_date  │  region  │ geom_type │ osm_id  │             tags_md5             │                      tags                      │
╞════════════╪══════════╪═══════════╪═════════╪══════════════════════════════════╪════════════════════════════════════════════════╡
│ 2020-08-16 │ colorado │ W         │ 4271726 │ aee4dc0d594b82ee885743702a10584a │ {"name": "30th Street", "lanes": "1", "oneway"…│
│            │          │           │         │                                  │…: "yes", "bicycle": "designated", "highway": "…│
│            │          │           │         │                                  │…tertiary", "surface": "asphalt", "maxspeed": "…│
│            │          │           │         │                                  │…20 mph", "cycleway:right": "shared_lane", "sou…│
│            │          │           │         │                                  │…rce:maxspeed": "City of Boulder Ordinance (Jun…│
│            │          │           │         │                                  │…e 2020)"}                                      │
└────────────┴──────────┴───────────┴─────────┴──────────────────────────────────┴────────────────────────────────────────────────┘

Examine changes

I use the LAG() window function to line the two versions of each OpenStreetMap record side by side. The results from this query are saved in a highway_tag_changes temporary table, only saving the rows where the hashed value has changed (WHERE tags_md5 <> prior_hash).

DROP TABLE IF EXISTS highway_tag_changes;
CREATE TEMP TABLE highway_tag_changes AS
WITH current_prior AS (
SELECT osm_date,
        LAG(osm_date, 1) OVER
        (PARTITION BY region, geom_type, osm_id ORDER BY osm_date)
            AS prior_date,
        region, geom_type, osm_id, tags_md5,
        LAG(tags_md5, 1) OVER
        (PARTITION BY region, geom_type, osm_id ORDER BY osm_date)
            AS prior_hash,
        tags,
        LAG(tags, 1) OVER
        (PARTITION BY region, geom_type, osm_id ORDER BY osm_date)
            AS tags_prior
    FROM highway_hashes
)
SELECT *
    FROM current_prior
    WHERE tags_md5 <> prior_hash
;

SELECT 57170
Time: 8156.579 ms (00:08.157)

The highway_tag_changes now contains records that had changes detected in the MD5 hash.

When testing queries it can be handy to identify one row to work with before attempting to scale it to the entire data set. I picked the way 16990294 for my test record. The difference in the hashes are not particularly exciting to see.

SELECT tags_md5, prior_hash
    FROM highway_tag_changes 
    WHERE osm_id = 16990294
;

┌──────────────────────────────────┬──────────────────────────────────┐
│             tags_md5             │            prior_hash            │
╞══════════════════════════════════╪══════════════════════════════════╡
│ 87811e0d29912a11c68ed5791e46a12a │ b5a464e16e2f8c1a13bfdc4062960d31 │
└──────────────────────────────────┴──────────────────────────────────┘

More interesting is looking at the two versions of the tags data. We can see the tags_prior had 4 keys while the tags has 6 keys. The value of the highway tag also changed.

SELECT jsonb_pretty(tags_prior) AS tags_prior,
        jsonb_pretty(tags) AS tags
    FROM highway_tag_changes
    WHERE osm_id = 16990294
;

┌─────────────────────────────────────┬─────────────────────────────────────┐
│             tags_prior              │                tags                 │
╞═════════════════════════════════════╪═════════════════════════════════════╡
│ {                                  ↵│ {                                  ↵│
│     "highway": "residential",      ↵│     "name": "Monaghan Road",       ↵│
│     "tiger:cfcc": "A41",           ↵│     "highway": "tertiary",         ↵│
│     "tiger:county": "Arapahoe, CO",↵│     "surface": "paved",            ↵│
│     "tiger:reviewed": "no"         ↵│     "tiger:cfcc": "A41",           ↵│
│ }                                   │     "tiger:county": "Arapahoe, CO",↵│
│                                     │     "tiger:reviewed": "no"         ↵│
│                                     │ }                                   │
└─────────────────────────────────────┴─────────────────────────────────────┘

When the highway key changes from residential to tertiary, the assumed speed limit changes from 25 mph to 45 mph. Obviously, a travel time analysis will be impacted when a road's classification has changed. This type of change explains most of the variance I saw in the travel time analysis.

SELECT osm_type, maxspeed_mph 
    FROM pgosm.road
    WHERE osm_type IN ('residential', 'tertiary')
;

┌─────────────┬──────────────┐
│  osm_type   │ maxspeed_mph │
╞═════════════╪══════════════╡
│ tertiary    │        45.00 │
│ residential │        25.00 │
└─────────────┴──────────────┘

There are a number of ways to track changes in the OpenStreetMap data, including using the Overpass API.

Comparing JSON

The previous section showed how to see the full JSON of both versions, knowing there are differences. The downside is it requires visual inspection to identify what actually changed. What would be handy is to have a query show me the differences between the JSON objects. A quick search turned up Dmitry Savinkov's function in an answer on StackOverflow. I have modified the function here slightly.

CREATE OR REPLACE FUNCTION osmts.jsonb_diff(val1 JSONB, val2 JSONB)
RETURNS JSONB
LANGUAGE plpgsql
AS
$$
DECLARE
  result JSONB;
  v RECORD;
BEGIN
    result = val1;
    FOR v IN
    SELECT *
        FROM jsonb_each(val2)
    LOOP
        IF result @> jsonb_build_object(v.key, v.value)
            THEN result = result - v.key;
        ELSIF result ? v.key
            THEN CONTINUE;
        ELSE
            result = result || jsonb_build_object(v.key, '<missing>');
        END IF;
    END LOOP;
    RETURN result;
END;
$$;

One of the changes I made is to avoid reporting 'null' as a string. I do not like having "null" strings represent actual NULL values as I have at least one database with a legitimate "NULL" string in a name. The value <missing> is both clear and is unlikely to appear as a valid string anywhere in data.

The following query uses the osmts.jsonb_diff() function twice, once to calculate the changed_from value and another to calculate the changed_to value. This query makes it easy to see the name and surface tags are both new and the highway tag was changed from "residential" to "tertiary". The unchanged keys are not reported.

SELECT 
        jsonb_pretty(osmts.jsonb_diff(tags_prior, tags)) AS changed_from,
        jsonb_pretty(osmts.jsonb_diff(tags, tags_prior)) AS changed_to
    FROM highway_tag_changes c
    WHERE osm_id = 16990294
;

┌───────────────────────────────┬──────────────────────────────┐
│         changed_from          │          changed_to          │
╞═══════════════════════════════╪══════════════════════════════╡
│ {                            ↵│ {                           ↵│
│     "name": "<missing>",     ↵│     "name": "Monaghan Road",↵│
│     "highway": "residential",↵│     "highway": "tertiary",  ↵│
│     "surface": "<missing>"   ↵│     "surface": "paved"      ↵│
│ }                             │ }                            │
└───────────────────────────────┴──────────────────────────────┘

To get a listing of just the keys that were changed without their associated values, the jsonb_object_keys() function can be used.

SELECT osm_id,
        jsonb_object_keys(osmts.jsonb_diff(c.tags, c.tags_prior)) AS changed_tag
    FROM highway_tag_changes c
    WHERE osm_id = 16990294
;

┌──────────┬─────────────┐
│  osm_id  │ changed_tag │
╞══════════╪═════════════╡
│ 16990294 │ name        │
│ 16990294 │ highway     │
│ 16990294 │ surface     │
└──────────┴─────────────┘

Changes by key

The highway tag is only one of the keys that can impact our travel time analysis and I need to get an idea of what other changes are happening. The following query shows the five (5) most commonly changed keys in the highway data, excluding the tiger% keys. The tiger% keys are common, but go unused by this analysis so are ignored here.

WITH changed_tags AS (
SELECT osm_id,
        jsonb_object_keys(osmts.jsonb_diff(c.tags, c.tags_prior)) AS changed_tag
    FROM highway_tag_changes c
)
SELECT changed_tag, COUNT(osm_id) AS osm_objects_changed
    FROM changed_tags
    WHERE changed_tag NOT LIKE 'tiger%'
    GROUP BY changed_tag
    ORDER BY 2 DESC
    LIMIT 5
;

Looking at the top 5 changed tags, 4 of them affect routing and travel times. The name tag is the only key shown below that is not important for routing.

┌─────────────┬─────────────────────┐
│ changed_tag │ osm_objects_changed │
╞═════════════╪═════════════════════╡
│ surface     │               27356 │
│ highway     │               10597 │
│ maxspeed    │                5408 │
│ name        │                3470 │
│ access      │                3340 │
└─────────────┴─────────────────────┘

The most common change was the surface key with 27,356 changes reported. The following query creates some aggregates using the surface tag to give an idea of how the data has changed.

WITH changed_tags AS (
SELECT osm_id,
        jsonb_object_keys(osmts.jsonb_diff(c.tags, c.tags_prior)) AS changed_tag
    FROM highway_tag_changes c
), surface_changes AS (
SELECT c.osm_id, c.changed_tag, tags->>'surface' AS surface_current,
        tags_prior->>'surface' AS surface_prior
    FROM changed_tags c
    INNER JOIN highway_tag_changes t ON c.osm_id = t.osm_id
    WHERE c.changed_tag = 'surface'
)
SELECT changed_tag, COUNT(osm_id), 
        COUNT(*) FILTER (WHERE surface_prior IS NULL) AS new_tag,
        COUNT(*) FILTER (WHERE surface_current IS NULL) AS removed_tag,
        COUNT(*) FILTER (WHERE surface_current <> surface_prior) AS changes
    FROM surface_changes
    GROUP BY changed_tag
;

The results from the above query show that 25,722 of the changes in the surface tag are new additions of the tags.

┌─────────────┬───────┬─────────┬─────────────┬─────────┐
│ changed_tag │ count │ new_tag │ removed_tag │ changes │
╞═════════════╪═══════╪═════════╪═════════════╪═════════╡
│ surface     │ 27356 │   25772 │          48 │    1536 │
└─────────────┴───────┴─────────┴─────────────┴─────────┘

Versions

The following query shows the versions of Postgres, PostGIS, PgOSM Flex and osm2pgsql used for this post.

SELECT version(),
        postgis_version(),
        pgosm_flex_version, osm2pgsql_version
    FROM osm_co20210816.pgosm_flex
;
┌─[ RECORD 1 ]───────┬─────────────────────────────────────────────────────────────────────────────────────┐
│ version            │ PostgreSQL 13.4 (Ubuntu 13.4-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc…│
│                    │… (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit                                       │
│ postgis_version    │ 3.1 USE_GEOS=1 USE_PROJ=1 USE_STATS=1                                               │
│ pgosm_flex_version │ 0.2.1-0d6772b                                                                       │
│ osm2pgsql_version  │ 1.5.0                                                                               │
└────────────────────┴─────────────────────────────────────────────────────────────────────────────────────┘

Summary

Change in our data is a constant expectation regardless of the data source. This post has covered an approach to tracking changes in OpenStreetMap attribute data in Postgres. My focus here was to identify the changes that caused a shift in analysis and this method quickly focused my attention on the related changes in the data. The basic concept of this comparison can be expanded to look for any type of changes across the OpenStreetMap data set.

Need help with your PostgreSQL servers or databases? Contact us to start the conversation!

By Ryan Lambert
Published August 30, 2021
Last Updated August 30, 2021