RustProof Labs: blogging for education (logo)
My book Mastering PostGIS and OpenStreetMap is available!

Working with GPS data in PostGIS

By Ryan Lambert -- Published December 18, 2023

One of the key elements to using PostGIS is having spatial data to work with! Lucky for us, one big difference today compared to the not-so-distant past is that essentially everyone is carrying a GPS unit with them nearly everywhere. This makes it easy to create your own GPS data that you can then load into PostGIS! This post explores some basics of loading GPS data to PostGIS and cleaning it for use. It turns out, GPS data fr om nearly any GPS-enabled device comes with some... character. Getting from the raw input to usable spatial data takes a bit of effort.

This post starts with using ogr2ogr to load the .gpx data to PostGIS. Once the data is in PostGIS then we actually want to do something with it. Before the data is completely usable, we should spend some time cleaning the data first. Technically you can start querying the data right away, however, I have found there is always data cleanup and processing involved first to make the data truly useful.

This is especially true when using data collected over longer periods of time with a variety of users and data sources.

Travel database project

Before I started writing this post I had assumed that all of the code would be contained in the body of this post. It turned out that to get to the quality I wanted, I had to create a new travel database project (MIT licensed) to share the code. This is already a long post without including a few hundred more lines of code! The travel-db project creates a few tables, a view, and a stored procedure in the travel schema. The stored procedure travel.load_bad_elf_points is responsible for cleaning an importing the data and weighs in at nearly 500 lines of code itself.

Create database and deploy schema

To follow along you need a Postgres database with PostGIS and convert extensions installed. Connect to your instance as a superuser with psql and run the following commands.

CREATE DATABASE travel_demo;
\c travel_demo

CREATE EXTENSION IF NOT EXISTS postgis;

-- See: https://github.com/rustprooflabs/convert
CREATE EXTENSION IF NOT EXISTS convert;

Some queries in this post use the PgDD extension (GitHub). This extension is not required for using the travel database, install if you want.

CREATE EXTENSION IF NOT EXISTS pgdd;

To deploy the travel database schema, clone the project and deploy with Sqitch.

git clone git@github.com:rustprooflabs/travel-db.git
cd travel-db/db
sqitch db:pg:travel_demo deploy

The travel schema is created with a few tables, a view, and a stored procedure to clean and load the data from the ogr2ogr tables. The code in the project currently is targeted to the Bad Elf GPS unit used to the data I've used recently.

Data source

The data sources I'm using for this post are personal GPS traces from a few trips we have taken in 2023. The data was recorded by a Bad Elf GNSS Surveyor unit. What's especially fun about this project for me is it includes data from our travels (already fun!) with four (4) travel methods, by foot, train, car, and airplane. With multiple modes of travel, and knowing each mode of travel has unique nuances, that means I get write a lot of fun queries for different scenarios! 🤓

While I used a Bad Elf unit, there are plenty of apps that can be installed on your phone or table to track your location over time without an additional GPS unit. The apps I have used make it easy to send yourself the trace in .gpx format.

GPX data to PostGIS

This section shows how to load gpx data into PostGIS using the powerful ogr2ogr tool. The first step I take to load GPX data is to drop and recreate the staging schema. This step assumes that the only processes using the staging schema will auto-create their tables, which is easy enough when using ogr2ogr.

psql -d travel -c "DROP SCHEMA IF EXISTS staging CASCADE; CREATE SCHEMA staging;"

The following command uses ogr2ogr to load the Flight-to-PASS-2023.gpx file to the travel database into the staging schema. You can download this GPX file (2.5 MB) for a real-world file to use along with this post.

ogr2ogr -t_srs EPSG:3857 \
    -f "PostgreSQL" PG:"host=localhost dbname=travel" \
    -overwrite -lco SCHEMA=staging \
    /data/badelf/Travel/Seattle-PASS-2023/Flight-to-PASS-2023.gpx

The above command creates a handful of tables, though I am only using one of those tables here. The staging.track_points table has everything I want to work with for this post. The following query using PgDD shows the table has 10,347 rows from the one .gpx file imported.

SELECT s_name, t_name, rows, size_plus_indexes
    FROM dd.tables
    WHERE s_name = 'staging'
        AND t_name = 'track_points'
;

┌─────────┬──────────────┬───────┬───────────────────┐
│ s_name  │    t_name    │ rows  │ size_plus_indexes │
╞═════════╪══════════════╪═══════╪═══════════════════╡
│ staging │ track_points │ 10347 │ 1360 kB           │
└─────────┴──────────────┴───────┴───────────────────┘

At this point, you have some data in the staging schema. Now what? If you visualize the first couple hundred rows in DBeaver you'd notice it appears I was wandering around on the tarmac outside the airport terminal. I promise, I was sitting safely inside in a crowded nook on an uncomfortable and over-worn seat! Instead, the following image skips over those initial points to visualize the points from when the airplane was taxiing and taking off. The following query selects these points based on the ogc_fid column.

SELECT *
    FROM staging.track_points
    WHERE ogc_fid > 1500 AND ogc_fid < 2300
;

Screenshot showing points from GPX trace of taxiing and takeoff from Denver International Airport, headed toward Seattle.

The above screenshot shows the first hint of data quality issues that need to be dealt with: GPS units are far from perfect especially when real humans are using them. There is a 35 second gap between ogc_fid = 2185 and ogc_fid = 2186 during takeoff. This can be seen on the above screenshot in the lower right corner as an obvious gap of missing blue dots. During that time we accelerated from 61.9 m/s to 103.5 m/s (138 mph to 231 mph) and climbed just over 200 meters in elevation. My suspicion is that some combination of the severe acceleration and change in pitch during takeoff caused the Bad Elf to lose track of the satellites.

My Bad Elf was secured in a side pocket of my laptop bag, which was underneath the seat in front of me.

Setting up Travel DB

The previous section showed how to use ogr2ogr to import GPS traces from GPX format into PostGIS. The next sections show how to setup and use our new travel database to import data in way to track more than one trip, with a bit higher data quality and usability.

My goals for this project were to support any number of trips where each trip has one or more legs of travel. Each leg of travel can have one or more steps. To illustrate these components, we'll look closer at our trip to the PASS Summit in November. That trip consisted of everything between leaving home and returning home. The first leg of the trip involved driving from from home to Denver International Airport (DIA), walking through DIA and onto a plane, flying to Seattle's airport (SEATAC), and walking through the airport to get to a taxi to get us to the hotel. I recorded that as four steps to make up one leg of this trip.

In the travel database, the travel.trip detail defines the main trip record. The travel.trip_step table captures the detail of the various legs and steps, where "to Seattle" is a leg, and each mode of travel (car, foot, airplane) is represented as a different step.

Setup records for trips and steps

The following queries setup records in the travel.trip and travel.trip_step tables. The following code uses an anonymous code block to enable setting and re-using the _trip_id variable. The travel.trip_step table defines a timeframe column as a TSTZRANGE. This timeframe column is very important. It is used to match data from the input files (.gpx) to the trip steps. This is helpful in most cases, but may cause points from the input file to be unexpectedly dropped.

DO $$
    DECLARE
        _trip_id BIGINT;
    BEGIN

    WITH add_trip AS (
    INSERT INTO travel.trip (trip_name, timeframe, trip_desc)
        VALUES ('PASS 2023', TSTZRANGE('2023-11-12 00:00:00 -07'::TIMESTAMPTZ,
                                    '2023-12-16 23:59:59 -07'::TIMESTAMPTZ)
                , 'Trip to PASS 2023 to present PostGIS pre-conference session!')
        RETURNING trip_id
    )
    SELECT trip_id INTO _trip_id
        FROM add_trip
    ;

    INSERT INTO travel.trip_step (trip_id, leg_name, step_name, timeframe, travel_mode_id)
    SELECT _trip_id, 'To Seattle', 'At DIA' ,
            '["2023-11-12 10:00:00 -07", "11/12/23 10:39:43 -07")'::TSTZRANGE,
            tm.travel_mode_id
        FROM travel.travel_mode tm
        WHERE tm.travel_mode_name = 'foot'
    UNION
    SELECT _trip_id, 'To Seattle', 'Flight to Seattle',
            '["11/12/23 10:39:43-07", "2023-11-12 13:37:27.000 -07")'::TSTZRANGE,
            tm.travel_mode_id
        FROM travel.travel_mode tm
        WHERE tm.travel_mode_name = 'airplane'

    <add additional UNION queries to setup the remaining steps...>
    ;

END
$$
;

The above code is abbreviated to only show two of the travel.trip_step records. The detail of the other two steps from the shared .gpx file is shown in the following table. If you are following along you'll need to add the other two records to the query above to get the same results.

┌──────────────┬─────────┬────────────┬─────────────────────┬─────────────────────────────────────────────────────┐
│ trip_step_id │ trip_id │  leg_name  │      step_name      │                      timeframe                      │
╞══════════════╪═════════╪════════════╪═════════════════════╪═════════════════════════════════════════════════════╡
│            6 │       1 │ To Seattle │ At DIA              │ ["2023-11-12 10:00:00-07","2023-11-12 10:39:43-07") │
│            8 │       1 │ To Seattle │ Flight to Seattle   │ ["2023-11-12 10:39:43-07","2023-11-12 13:37:27-07") │
│            5 │       1 │ To Seattle │ SEATAC to Taxi      │ ["2023-11-12 13:37:27-07","2023-11-12 13:45:44-07") │
│            7 │       1 │ To Seattle │ Taxi to Grand Hyatt │ ["2023-11-12 13:45:44-07","2023-11-12 14:26:04-07") │
└──────────────┴─────────┴────────────┴─────────────────────┴─────────────────────────────────────────────────────┘

The way I run this step is to create a .sql script containing the commands, then use psql's handy \i command to run the .sql file. The following example illustrates this.

([local] 🐘) ryanlambert@travel_demo=# \i /data/badelf/Travel/Seattle-PASS-2023/trip--setup-pass-2023-records.sql

Run procedure to import

Running the travel.load_bad_elf_points() procedure cleans and imports the raw data from the staging tables into the tables in the travel schema. The following command runs the procedure for trip_id = 1.

CALL travel.load_bad_elf_points(1);

If everything has been setup properly, you should see output similar to the following.

NOTICE:  Ensure data is populated to travel.trip AND travel.trip_step before running this procedure.
NOTICE:  Loading Bad Elf data to points table for Trip ID: 1 (PASS 2023)
NOTICE:  table "trip_cleanup" does not exist, skipping
NOTICE:  table "dup_cleanup_ids" does not exist, skipping
NOTICE:  table "dup_cleanup" does not exist, skipping
NOTICE:  table "dup_remove_ids" does not exist, skipping
NOTICE:  Initial quality check passed...
NOTICE:  table "trip_cleanup_2" does not exist, skipping
NOTICE:  Timestamp gap analysis...
9378 rows
30 greater than 1 second gap
4 greater than 10 seconds gap
1 greater than 60 seconds gap
NOTICE:  Adding data to travel.trip_point
NOTICE:  Building line data and updating travel.trip_step geom.
NOTICE:  Import summary:
9378 points added
Data included covers 4 steps from trip_step. 
4 steps were updated with line geometry
CALL
Time: 2597.017 ms (00:02.597)

Data loaded

The next examples use my local database with four (4) trips loaded and a total of 15 trip steps. The travel.trip_point table now has nearly 100,000 rows, compared to nearly 9,300 from the examples above.

SELECT t_name, rows
    FROM dd.tables
    WHERE s_name = 'travel'
    ORDER BY t_name
;

┌──────────────────────────┬───────┐
│          t_name          │ rows  │
╞══════════════════════════╪═══════╡
│ import_duplicate_cleanup │    51 │
│ travel_mode              │     4 │
│ trip                     │     4 │
│ trip_point               │ 96941 │
│ trip_step                │    15 │
└──────────────────────────┴───────┘

The following image gives an idea of the other travels I am tracking in my database so far.

Image of a map showing flights to/from Seattle, to Nebraska, and around Colorado.

Aggregated data

The records in travel.trip_step are updated during the import with aggregates from the detailed point data. The following query further aggregates those to grouped by travel_mode_name, sorted by highest max elevation. Air travel naturally was the highest, at nearly 38,000 feet. Motor travel came in second at 11,191 feet, which was from Eisenhower Tunnel. Note, my GPS recorded me driving 33 feet above the highest point.

SELECT tm.travel_mode_name, COUNT(DISTINCT trip_step_id) AS trip_step_count,
        SUM(ST_NPoints(ts.geom)) AS point_count,
        convert.speed_m_s_to_mph(AVG(ts.speed_avg)) AS speed_avg_mph,
        convert.speed_m_s_to_mph(AVG(ts.speed_avg_moving)) AS speed_avg_moving_mph,
        convert.dist_m_to_ft(MIN(ts.ele_min)) AS ele_min_ft,
        convert.dist_m_to_ft(MAX(ts.ele_max)) AS ele_max_ft
    FROM travel.trip_step ts
    INNER JOIN travel.travel_mode tm ON ts.travel_mode_id = tm.travel_mode_id
    GROUP BY tm.travel_mode_name
    ORDER BY ele_max_ft DESC
;

┌──────────────────┬─────────────────┬─────────────┬───────────────┬──────────────────────┬────────────┬────────────┐
│ travel_mode_name │ trip_step_count │ point_count │ speed_avg_mph │ speed_avg_moving_mph │ ele_min_ft │ ele_max_ft │
╞══════════════════╪═════════════════╪═════════════╪═══════════════╪══════════════════════╪════════════╪════════════╡
│ airplane         │               2 │        9705 │         265.6 │                283.0 │      -90.9 │    37621.4 │
│ motor            │               7 │       77326 │          40.6 │                 46.5 │       -5.2 │    11191.3 │
│ foot             │               5 │        8236 │           1.4 │                  1.5 │      -64.3 │     5859.9 │
│ lightrail        │               1 │        1674 │          26.6 │                 29.2 │      -29.2 │      582.0 │
└──────────────────┴─────────────────┴─────────────┴───────────────┴──────────────────────┴────────────┴────────────┘

Detailed data

The next query shows the same basic aggregates can be calculated using the full point data. The easiest way is to use the point data is to query the travel.trip_point_detail view. Queries against the full point data will be slower than the pre-aggregated data from travel.trip_step, though gives you more control when necessary.

SELECT travel_mode_name,
        COUNT(*) AS point_count,
        AVG(speed_mph) AS speed_mph_avg,
        AVG(speed_mph) FILTER (WHERE travel_mode_status <> 'Stopped') AS speed_moving_mph_avg,
        convert.dist_m_to_ft(AVG(ele)) AS ele_avg_ft,
        convert.dist_m_to_ft(MAX(ele)) AS ele_max_ft
    FROM travel.trip_point_detail
    GROUP BY travel_mode_name
    ORDER BY ele_max_ft DESC
;

┌──────────────────┬─────────────┬───────────────┬──────────────────────┬────────────┬────────────┐
│ travel_mode_name │ point_count │ speed_mph_avg │ speed_moving_mph_avg │ ele_avg_ft │ ele_max_ft │
╞══════════════════╪═════════════╪═══════════════╪══════════════════════╪════════════╪════════════╡
│ airplane         │        9705 │         271.9 │                  290 │    16088.0 │    37621.4 │
│ motor            │       77326 │          50.6 │                   58 │     5035.4 │    11191.3 │
│ foot             │        8236 │           1.3 │                    1 │     2538.9 │     5859.9 │
│ lightrail        │        1674 │          26.6 │                   29 │      103.1 │      582.0 │
└──────────────────┴─────────────┴───────────────┴──────────────────────┴────────────┴────────────┘

Errors in GPS data

When using real-world data you will run into real-world data issues. There are plenty of ways that errors manifest in GPS trace data. This is the character I referred to in the intro paragraph. With GPS data, you should expect a variety of types subtle nuances within your data. These include:

Data quality improvements

As mentioned earlier, the travel.load_bad_elf_points() includes a significant amount of data cleanup. Feel free to look at these 260 lines of code for a more complete idea of the cleanup, I'll briefly explain a few highlights.

The cleanup includes removing basic duplication on timestamps and will fail with a warning if duplication remains after cleanup. This avoids violations of uniqueness in subsequent steps. If these are encountered you'll need to add a manual cleanup step between loading with ogr2ogr and running the import procedure to resolve duplicates.

After the initial cleanup is completed, there are a few columns calculated with lead/lag window functions. This is done on geometry and timestamp columns enabling calculations of rolling distances and rolling speeds. One use for these steps is printing this output during the import.

NOTICE:  Timestamp gap analysis...
9378 rows
30 greater than 1 second gap
4 greater than 10 seconds gap
1 greater than 60 seconds gap

These lead/lag results are also used when calculating the travel_mode_status column in these steps. That section produces values such as Accelerating, Braking, Cruising, and so on.

Summary

This post has explored how I load GPS tracks from .gpx format into PostGIS, clean the data, and prepare it for use. There is plenty more I want to show about using this data, however this post has gotten long enough already. Until next time... 🐘

Need help with your PostgreSQL servers or databases? Contact us to start the conversation!

By Ryan Lambert
Published December 18, 2023
Last Updated December 18, 2023