RustProof Labs: blogging for education (logo)
My book Mastering PostGIS and OpenStreetMap is available!

Setup Geocoder with PostGIS and Tiger/LINE

By Ryan Lambert -- Published October 08, 2023

Geocoding addresses is the process of taking a street address and converting it to its location on a map. This post shows how to create a PostGIS geocoder using the U.S. Census Bureau's TIGER/Line data set. This is part one of a series of posts exploring geocoding addresses. The next post illustrates how to geocode in bulk with a focus on evaluating the accuracy of the resulting geometry data.

Before diving in, let's look at an example of geocoding. The address for Union Station (see on OpenStreetMap) is 1701 Wynkoop Street, Denver, CO, 80202. This address was the input to geocode. The blue point shown in the following screenshot is the resulting point from the PostGIS geocode() function. The pop-up dialog shows the address, a rating of 0, and the calculated distance away from the OpenStreetMap polygon representing that address (13 meters), shown in red under the pop-up dialog.

Screenshot showing the source polygon for Union Station next to the geocoded point using the street address and the PostGIS Geocoder function.

The 0 rating score is the best score, indicating it is likely to be accurate. The query used to generate the above image is below.

SELECT b.osm_id, b.address, gc.rating, 
        ST_Distance(ST_Transform(b.geom, 2773),
                ST_Transform(gc.geomout, 2773)
            )::INT AS distance,
        ST_Transform(gc.geomout, 3857) AS geom_geocoded,
        b.geom
    FROM osm.building_polygon b
    LEFT JOIN LATERAL geocode(b.address, 1) gc ON True
    WHERE b.osm_id = 25650822
;

Setting up PostGIS with TIGER/Line geocoding

My process to setup the geocoder is based on Michele Tobias' tutorial from 2017. The steps I show here mainly outline the changes made to run on Ubuntu 22.04 instead of Windows.

At the time of writing, the 2022 TIGER/Line data was the latest available. I check this URL for newer data by adjusting the year: https://www2.census.gov/geo/tiger/TIGER2022. The 2021 and 2022 data had dates in September of their respective years, so I expect the 2023 data will be available soon.

Create Role and Database

The next thing to do is create a Postgres role to use for the geocoder process. The resulting database and table will be owned by this role. The load process involves creating extensions, schemas, and tables, so granting the superuser role is an easy solution. My approach to these needs has been to create the role initially with superuser and login access. These privileges are later revoked when no longer necessary.

sudo su - postgres
psql

CREATE ROLE geocode_user
    WITH LOGIN SUPERUSER PASSWORD 'GeocodeUserPassword'
    ;

\q

Create the geocode2022 database and set the geocode_user to the owner.

psql -U geocode_user -d postgres \
    -c  "CREATE DATABASE geocode2022 WITH OWNER geocode_user ; "

Connect to the geocode2022 database as the geocode_user role.

psql -U geocode_user -d geocode2022

Create the required extensions.

CREATE EXTENSION postgis;
CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION postgis_tiger_geocoder;
CREATE EXTENSION address_standardizer;

Update loader platform

The next steps are to configure the details for the postgis_tiger_geocoder extension. This is done via the tiger.loader_platform and tiger.loader_variables tables. First, the loader_platform is modified for the os = 'sh' row. The PGBIN path should be updated if necessary, as well as all of the Postgres connection details.

UPDATE tiger.loader_platform
    SET declare_sect = 'TMPDIR="${staging_fold}/temp/"
        UNZIPTOOL=unzip
        WGETTOOL="/usr/bin/wget"
        export PGBIN=/usr/lib/postgresql/15/bin
        export PGPORT=5432
        export PGHOST=localhost
        export PGUSER=geocode_user
        export PGPASSWORD=GeocodeUserPassword
        export PGDATABASE=geocode2022
        PSQL=${PGBIN}/psql
        SHP2PGSQL=shp2pgsql
        cd ${staging_fold}
        '
    WHERE os = 'sh';

WARNING: The above example stores the geocode_user's password in plain text in the tiger.loader_platform table. This risk is mitigated in the cleanup steps below by revoking both superuser and login from this role.

Update the tiger.loader_variables table to set the year, the website path (including year) and the path to the staging folder.

UPDATE tiger.loader_variables
    SET tiger_year = '2022',
        website_root = 'https://www2.census.gov/geo/tiger/TIGER2022',
        staging_fold = '/gisdata'
;

The above command sets the staging_fold to the /gisdata path. The following commands create the folder and set ownership to the postgres role for the next steps.

sudo mkdir /gisdata
sudo chown postgres:postgres /gisdata

Prepare and run geocoder scripts

There are two geocoder scripts that need to be created and ran. Creating the scripts is done as the postgres Linux user.

sudo su - postgres
cd /gisdata

Create the nation script using the Loader_Generate_Nation_Script() function. The --no-psqlrc -tA avoids formatting issues in the generated shell script.

psql -c "SELECT Loader_Generate_Nation_Script('sh')" \
    -d geocode2022 \
    --no-psqlrc -tA > /gisdata/nation_script_load.sh

Take a look at the generated script. After all, you wouldn't just auto-run scripts as a superuser suggested from someone on the internet, right?

nano /gisdata/nation_script_load.sh

Set the execution flag on the generated script and run it. This takes less than 2 minutes on my laptop.

chmod +x nation_script_load.sh
./nation_script_load.sh

The script above populates the tiger_data.county_all and tiger_data.state_all tables. Verify data was loaded.

SELECT COUNT(*) FROM tiger_data.county_all;
SELECT COUNT(*) FROM tiger_data.state_all;

┌───────┐
│ count │
╞═══════╡
│  3235 │
└───────┘

┌───────┐
│ count │
╞═══════╡
│    56 │
└───────┘

The same basic process is used to load the state data as was for the nation data. This step takes about 40 minutes on my laptop.

psql -c "SELECT Loader_Generate_Script(ARRAY['CO'], 'sh')" \
    -d geocode2022 \
    --no-psqlrc -tA > /gisdata/co_script_load.sh

chmod +x co_script_load.sh
./co_script_load.sh

The above script populates 11 tables with the necessary data for Colorado.

Cleanup

Checking the geocode_user via \du shows the role is a superuser that can log in to Postgres.

([local] 🐘) ryanlambert@geocode2022=# \du geocode_user 
        List of roles
┌──────────────┬────────────┐
│  Role name   │ Attributes │
╞══════════════╪════════════╡
│ geocode_user │ Superuser  │
└──────────────┴────────────┘

The following query revokes both Superuser and login abilities from the geocode_user role.

ALTER ROLE geocode_user WITH NOSUPERUSER NOLOGIN;

Check \du again, the Superuser attribute is gone and the Cannot login attribute is displayed.

┌──────────────┬──────────────┐
│  Role name   │  Attributes  │
╞══════════════╪══════════════╡
│ geocode_user │ Cannot login │
└──────────────┴──────────────┘

Run the tiger.install_missing_indexes() to ensure all columns used in geocoding are properly indexed. Last, it won't hurt to ANALYZE the database.

SELECT tiger.install_missing_indexes();
ANALYZE;

Geocode

The following query geocodes the address for Union Station found in OpenStreetMap data from the osm.building_polygon table.

SELECT b.osm_id, b.address, gc.rating, 
        ST_Distance(ST_Transform(b.geom, 2773),
                ST_Transform(gc.geomout, 2773)
            )::INT AS distance,
        ST_Transform(gc.geomout, 3857) AS geom_geocoded,
        b.geom
    FROM osm.building_polygon b
    LEFT JOIN LATERAL geocode(b.address, 1) gc ON True
    WHERE b.osm_id = 25650822
;


┌──────────┬────────────────────────────────────────┬────────┬──────────┐
│  osm_id  │                address                 │ rating │ distance │
╞══════════╪════════════════════════════════════════╪════════╪══════════╡
│ 25650822 │ 1701 Wynkoop Street, Denver, CO, 80202 │      0 │       13 │
└──────────┴────────────────────────────────────────┴────────┴──────────┘

Screenshot showing the source polygon for Union Station next to the geocoded point using the street address and the PostGIS Geocoder function.

Why not use an API?

In a world of APIs, why go through all this work to setup your own geocoder? There's even an API from the U.S. Census Bureau if you specifically want that data source. There are quite a few reasons it is beneficial to stand up your own internal geocoding service. Some commonly encountered reasons:

Using a geocoder locally where your data is stored can avoid a lot of caveats. Location data is often considered sensitive, this gives you more assurances of how the data is (and isn't!) being used.

Summary

The instructions above created a geocode2022 database loaded with data to serve as a PostGIS TIGER/Line geocoder for Colorado. Continue reading with Geocode Addresses with PostGIS and Tiger/LINE to explore bulk geocoding and the accuracy of geocoding results.

Need help with your PostgreSQL servers or databases? Contact us to start the conversation!

By Ryan Lambert
Published October 08, 2023
Last Updated October 10, 2023