Identify OpenStreetMap changes with Postgres
The data in the main OpenStreetMap database is constantly changing. Folks around the world are almost certainly saving changes via JOSM, iD, and other editors as you read these words. With change constantly occurring in the data, it is often desirable to have an idea of what has actually changed in the data. This post explores one approach to tracking changes to the tags attribute data once it has been loaded to Postgres.
The topic of this post surfaced while I was working on refreshing
a project involving travel times (routing). In the process I noticed a
few instances where the analysis had shifted significantly.
My first hunch was that entire segments of road had been
added or removed, but that was not
the cause. It became apparent that tags in the area had been
improved. It was easy to specifically point to the value associated
with the highway
key but I also knew there were other changes
happening, I just wasn't sure what all was involved and at what scale.
Calculate tag hash
The database I am working in has five (5) Colorado
snapshots loaded spanning back to 2018. The tags data is loaded
to a table named osmts.tags
, read my post
Timescale, Compression and OpenStreetMap Tags
for how this table was created. The tags
table has one row for
every OpenStreetMap feature and stores the full key/value attribute data
in a JSONB column (osmts.tags.tags
).
A relatively simple way to detecting change in data is to
calculate the hash for each feature's key/value data. Comparing
hashes for any change will identify rows that had changes to their
attribute data.
For this post I filter the tags down to features with a highway
tag (tags->>'highway' IS NOT NULL
) and limit the dates compared
to the two August 16th dates from 2020 and 2021.
The filter geom_type = 'W'
excludes nodes (points) but will
include lines and polygons.
The MD5 hash of each row's attribute data is calculated in the tags_md5
column by casting the JSONB
to TEXT
inside the md5()
function.
The 1.6 million highway rows are saved to the highway_hashes
temporary table.
DROP TABLE IF EXISTS highway_hashes;
CREATE TEMP TABLE highway_hashes AS
SELECT osm_date, region, geom_type, osm_id,
md5(t.tags::TEXT) AS tags_md5,
t.tags
FROM osmts.tags t
WHERE geom_type = 'W'
AND osm_date IN ('2020-08-16', '2021-08-16')
AND region = 'colorado'
AND tags->>'highway' IS NOT NULL
;
SELECT 1613111
Time: 8138.200 ms (00:08.138)
Note: MD5 hashes are used with the understanding there's a low probability of hash collisions in this non-security related application. A hash collision is where two different inputs produce the same hash output. We are comparing hashes linked on
osm_id
andgeom_type
so we are only comparing two hashes at a time. The accidental clash rate for two arbitrary values hashed with MD5 is 2^-128.
The highway_hashes
table is filled with rows such as this.
┌────────────┬──────────┬───────────┬─────────┬──────────────────────────────────┬────────────────────────────────────────────────┐
│ osm_date │ region │ geom_type │ osm_id │ tags_md5 │ tags │
╞════════════╪══════════╪═══════════╪═════════╪══════════════════════════════════╪════════════════════════════════════════════════╡
│ 2020-08-16 │ colorado │ W │ 4271726 │ aee4dc0d594b82ee885743702a10584a │ {"name": "30th Street", "lanes": "1", "oneway"…│
│ │ │ │ │ │…: "yes", "bicycle": "designated", "highway": "…│
│ │ │ │ │ │…tertiary", "surface": "asphalt", "maxspeed": "…│
│ │ │ │ │ │…20 mph", "cycleway:right": "shared_lane", "sou…│
│ │ │ │ │ │…rce:maxspeed": "City of Boulder Ordinance (Jun…│
│ │ │ │ │ │…e 2020)"} │
└────────────┴──────────┴───────────┴─────────┴──────────────────────────────────┴────────────────────────────────────────────────┘
Examine changes
I use the
LAG()
window function
to line the two versions of each OpenStreetMap record side by side.
The results from this query are saved in a highway_tag_changes
temporary table, only saving the rows where the hashed value has
changed (WHERE tags_md5 <> prior_hash
).
DROP TABLE IF EXISTS highway_tag_changes;
CREATE TEMP TABLE highway_tag_changes AS
WITH current_prior AS (
SELECT osm_date,
LAG(osm_date, 1) OVER
(PARTITION BY region, geom_type, osm_id ORDER BY osm_date)
AS prior_date,
region, geom_type, osm_id, tags_md5,
LAG(tags_md5, 1) OVER
(PARTITION BY region, geom_type, osm_id ORDER BY osm_date)
AS prior_hash,
tags,
LAG(tags, 1) OVER
(PARTITION BY region, geom_type, osm_id ORDER BY osm_date)
AS tags_prior
FROM highway_hashes
)
SELECT *
FROM current_prior
WHERE tags_md5 <> prior_hash
;
SELECT 57170
Time: 8156.579 ms (00:08.157)
The highway_tag_changes
now contains records
that had changes detected in the MD5 hash.
When testing queries it can be handy to identify one row to work with before attempting to scale it to the entire data set. I picked the way 16990294 for my test record. The difference in the hashes are not particularly exciting to see.
SELECT tags_md5, prior_hash
FROM highway_tag_changes
WHERE osm_id = 16990294
;
┌──────────────────────────────────┬──────────────────────────────────┐
│ tags_md5 │ prior_hash │
╞══════════════════════════════════╪══════════════════════════════════╡
│ 87811e0d29912a11c68ed5791e46a12a │ b5a464e16e2f8c1a13bfdc4062960d31 │
└──────────────────────────────────┴──────────────────────────────────┘
More interesting is looking at the two versions of the tags
data.
We can see the tags_prior
had 4 keys while the tags
has 6 keys.
The value of the highway
tag also changed.
SELECT jsonb_pretty(tags_prior) AS tags_prior,
jsonb_pretty(tags) AS tags
FROM highway_tag_changes
WHERE osm_id = 16990294
;
┌─────────────────────────────────────┬─────────────────────────────────────┐
│ tags_prior │ tags │
╞═════════════════════════════════════╪═════════════════════════════════════╡
│ { ↵│ { ↵│
│ "highway": "residential", ↵│ "name": "Monaghan Road", ↵│
│ "tiger:cfcc": "A41", ↵│ "highway": "tertiary", ↵│
│ "tiger:county": "Arapahoe, CO",↵│ "surface": "paved", ↵│
│ "tiger:reviewed": "no" ↵│ "tiger:cfcc": "A41", ↵│
│ } │ "tiger:county": "Arapahoe, CO",↵│
│ │ "tiger:reviewed": "no" ↵│
│ │ } │
└─────────────────────────────────────┴─────────────────────────────────────┘
When the highway key changes from residential
to tertiary
,
the assumed speed limit changes from 25 mph to 45 mph.
Obviously, a travel time analysis will be impacted when a road's
classification has changed. This type of change explains most of
the variance I saw in the travel time analysis.
SELECT osm_type, maxspeed_mph
FROM pgosm.road
WHERE osm_type IN ('residential', 'tertiary')
;
┌─────────────┬──────────────┐
│ osm_type │ maxspeed_mph │
╞═════════════╪══════════════╡
│ tertiary │ 45.00 │
│ residential │ 25.00 │
└─────────────┴──────────────┘
There are a number of ways to track changes in the OpenStreetMap data, including using the Overpass API.
Comparing JSON
The previous section showed how to see the full JSON of both versions, knowing there are differences. The downside is it requires visual inspection to identify what actually changed. What would be handy is to have a query show me the differences between the JSON objects. A quick search turned up Dmitry Savinkov's function in an answer on StackOverflow. I have modified the function here slightly.
CREATE OR REPLACE FUNCTION osmts.jsonb_diff(val1 JSONB, val2 JSONB)
RETURNS JSONB
LANGUAGE plpgsql
AS
$$
DECLARE
result JSONB;
v RECORD;
BEGIN
result = val1;
FOR v IN
SELECT *
FROM jsonb_each(val2)
LOOP
IF result @> jsonb_build_object(v.key, v.value)
THEN result = result - v.key;
ELSIF result ? v.key
THEN CONTINUE;
ELSE
result = result || jsonb_build_object(v.key, '<missing>');
END IF;
END LOOP;
RETURN result;
END;
$$;
One of the changes I made is to avoid reporting
'null'
as a string. I do not like having "null" strings represent actualNULL
values as I have at least one database with a legitimate "NULL" string in a name. The value<missing>
is both clear and is unlikely to appear as a valid string anywhere in data.
The following query uses the osmts.jsonb_diff()
function twice,
once to calculate the changed_from
value and another to calculate the
changed_to
value.
This query makes it easy to see the name
and surface
tags are both
new and the highway
tag was changed from "residential" to "tertiary".
The unchanged keys are not reported.
SELECT
jsonb_pretty(osmts.jsonb_diff(tags_prior, tags)) AS changed_from,
jsonb_pretty(osmts.jsonb_diff(tags, tags_prior)) AS changed_to
FROM highway_tag_changes c
WHERE osm_id = 16990294
;
┌───────────────────────────────┬──────────────────────────────┐
│ changed_from │ changed_to │
╞═══════════════════════════════╪══════════════════════════════╡
│ { ↵│ { ↵│
│ "name": "<missing>", ↵│ "name": "Monaghan Road",↵│
│ "highway": "residential",↵│ "highway": "tertiary", ↵│
│ "surface": "<missing>" ↵│ "surface": "paved" ↵│
│ } │ } │
└───────────────────────────────┴──────────────────────────────┘
To get a listing of just the keys that were changed without
their associated values, the jsonb_object_keys()
function can be used.
SELECT osm_id,
jsonb_object_keys(osmts.jsonb_diff(c.tags, c.tags_prior)) AS changed_tag
FROM highway_tag_changes c
WHERE osm_id = 16990294
;
┌──────────┬─────────────┐
│ osm_id │ changed_tag │
╞══════════╪═════════════╡
│ 16990294 │ name │
│ 16990294 │ highway │
│ 16990294 │ surface │
└──────────┴─────────────┘
Changes by key
The highway
tag is only one of the keys
that can impact our travel time analysis and I need to get an
idea of what other changes are happening. The following query
shows the five (5) most commonly changed keys in the highway data,
excluding the tiger%
keys. The tiger%
keys are common,
but go unused by this analysis so are ignored here.
WITH changed_tags AS (
SELECT osm_id,
jsonb_object_keys(osmts.jsonb_diff(c.tags, c.tags_prior)) AS changed_tag
FROM highway_tag_changes c
)
SELECT changed_tag, COUNT(osm_id) AS osm_objects_changed
FROM changed_tags
WHERE changed_tag NOT LIKE 'tiger%'
GROUP BY changed_tag
ORDER BY 2 DESC
LIMIT 5
;
Looking at the top 5 changed tags, 4 of them affect routing and travel
times. The name
tag is the only key shown below that is not important
for routing.
┌─────────────┬─────────────────────┐
│ changed_tag │ osm_objects_changed │
╞═════════════╪═════════════════════╡
│ surface │ 27356 │
│ highway │ 10597 │
│ maxspeed │ 5408 │
│ name │ 3470 │
│ access │ 3340 │
└─────────────┴─────────────────────┘
The most common change was the surface
key with 27,356 changes reported.
The following query creates some aggregates using the surface tag
to give an idea of how the data has changed.
WITH changed_tags AS (
SELECT osm_id,
jsonb_object_keys(osmts.jsonb_diff(c.tags, c.tags_prior)) AS changed_tag
FROM highway_tag_changes c
), surface_changes AS (
SELECT c.osm_id, c.changed_tag, tags->>'surface' AS surface_current,
tags_prior->>'surface' AS surface_prior
FROM changed_tags c
INNER JOIN highway_tag_changes t ON c.osm_id = t.osm_id
WHERE c.changed_tag = 'surface'
)
SELECT changed_tag, COUNT(osm_id),
COUNT(*) FILTER (WHERE surface_prior IS NULL) AS new_tag,
COUNT(*) FILTER (WHERE surface_current IS NULL) AS removed_tag,
COUNT(*) FILTER (WHERE surface_current <> surface_prior) AS changes
FROM surface_changes
GROUP BY changed_tag
;
The results from the above query show that 25,722 of the changes in the
surface
tag are new additions of the tags.
┌─────────────┬───────┬─────────┬─────────────┬─────────┐
│ changed_tag │ count │ new_tag │ removed_tag │ changes │
╞═════════════╪═══════╪═════════╪═════════════╪═════════╡
│ surface │ 27356 │ 25772 │ 48 │ 1536 │
└─────────────┴───────┴─────────┴─────────────┴─────────┘
Versions
The following query shows the versions of Postgres, PostGIS, PgOSM Flex and osm2pgsql used for this post.
SELECT version(),
postgis_version(),
pgosm_flex_version, osm2pgsql_version
FROM osm_co20210816.pgosm_flex
;
┌─[ RECORD 1 ]───────┬─────────────────────────────────────────────────────────────────────────────────────┐
│ version │ PostgreSQL 13.4 (Ubuntu 13.4-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc…│
│ │… (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit │
│ postgis_version │ 3.1 USE_GEOS=1 USE_PROJ=1 USE_STATS=1 │
│ pgosm_flex_version │ 0.2.1-0d6772b │
│ osm2pgsql_version │ 1.5.0 │
└────────────────────┴─────────────────────────────────────────────────────────────────────────────────────┘
Summary
Change in our data is a constant expectation regardless of the data source. This post has covered an approach to tracking changes in OpenStreetMap attribute data in Postgres. My focus here was to identify the changes that caused a shift in analysis and this method quickly focused my attention on the related changes in the data. The basic concept of this comparison can be expanded to look for any type of changes across the OpenStreetMap data set.
Need help with your PostgreSQL servers or databases? Contact us to start the conversation!