RustProof Labs blog planet-postgresql posts

PostgreSQL: Integers, on-disk order, and performance

Wed, 05 Aug 2020 05:01:00 GMT

This post examines how the on-disk order of integer data can influence performance in PostgreSQL. When working with relational databases, you often do not need to think about data storage, though there are times when these details can have a noticeable impact on your database's performance.

This post uses PostgreSQL 12.3 on Ubuntu 18.04 on a DigitalOcean droplet with 4 CPU and 8 GB RAM droplet, aka "Rig B" from Scaling osm2pgsql: Process and costs.

Updated for 2020: Load OpenStreetMap data to PostGIS

Sat, 04 Jan 2020 05:01:00 GMT

I originally wrote about how to load OpenStreetMap data into PostGIS just under one year ago. Between then and now, 364 days, a number of major changes have occurred in the form of new versions of all the software involved. Due to the combination of changes I decided to write an updated version of the post. After all, I was no longer able to copy/paste my own post as part of my procedures!

Update, January 2, 2021: This post documents the original "pgsql" output of osm2pgsql. I am no longer using this method for my OpenStreetMap data in PostGIS. The new osm2pgsql Flex output and PgOSM-Flex project provides a superior experience and improved final quality of data.

The goal of this post is to cover the typical machine and configuration I use to load OpenStreetMap data into PostGIS with the latest versions. The first part of the post covers the how of setting up and loading OpenStreetMap data to PostGIS. The second portion explains a bit of the reasoning of why I do things this way. The latest versions of the software at this time are:

PostgreSQL 12
PostGIS 3
osm2pgsql 1.2

This post is part of the series PostgreSQL: From Idea to Database. Also, see our recorded session Load PostGIS with osm2pgsql for similar content in video format.

Changes to software

PostgreSQL 12, PostGIS 3, and osm2pgsql 1.2 are all new releases since the original post. The original used Postgres 11, PostGIS 2.5 and osm2pgsql 0.94. If you just look at the version numbers they may look like small increases, but the whole package is chock-full of improvements!

If you can upgrade to the latest and greatest, I recommend you do so!

Hands on with osm2pgsql's new Flex output

Thu, 10 Dec 2020 05:01:00 GMT

The osm2pgsql project has seen quite a bit of development over the past couple of years. This is a core piece of software used by a large number of people to load OpenStreetMap data into PostGIS / PostgreSQL databases, so it has been great to see the activity and improvements. Recently, I was contacted by Jochen Topf to see if I would give one of those (big!) improvements, osm2pgsql's new Flex output, a try. While the flex output is still marked as "experimental" it is already quite robust. In fact, I have already started thinking of the typical pgsql output I have used for nearly a decade as "the old output!"

So what does this new Flex output do for us? It gives us control over the imported data's format, structure and quality. This process uses Lua styles (scripts) to achieve powerful results. The legacy pgsql output from osm2pgsql gave you three (3) main tables with everything organized into points, lines and polygons, solely by geometry type. From a database design perspective this would be like keeping product prices, employee salaries and expense reports all in one table using the justification "they all deal with money." With the flex output we are no longer constrained by this legacy design. With that in mind, the rest of this post explores osm2pgsql's Flex output.

Large Text in PostgreSQL: Performance and Storage

Sun, 05 Jul 2020 05:01:00 GMT

Storing large amounts of data in a single cell in the database has long been a point of discussion. This topic has surfaced in the form of design questions in database projects I have been involved with over the years. Often, it surfaces as a request to store images, PDFs, or other "non-relational" data directly in the database. I was an advocate for storing files on the file system for many, if not all, of those scenarios.

Then, after years of working with PostGIS data I had the realization that much of my vector data that performs so well when properly structured and queried, was larger and more complex than many other blobs of data I had previously resisted. Two years ago I made the decision to store images in a production database using BYTEA. We can guarantee there are a limited number of images with a controlled maximum resolution (limiting size) and a specific use case. There was also the knowledge that caching the images in the frontend would be an easy solution if performance started declining. This system is approaching two years in production with great performance. I am so glad the project has a singular data source: PostgreSQL!

Find your local SRID in PostGIS

Wed, 04 Nov 2020 05:01:00 GMT

The past few weeks I had been tossing around some ideas that resulted in me looking for a particular data set. I needed to get the bounding boxes for the most commonly used SRIDs (Spatial Reference IDentifier) in PostGIS to join with the public.spatial_ref_sys table. My hope was to be able to use the data to quickly identify local SRIDs for geometries spreading across the U.S. This data was needed to support another idea where I want both accurate spatial calculations and the best possible performance when working with large OpenStreetMap data sets.

The good news is now I have the exact data I was looking for. The unexpected bonus is that there is a much broader use case for this data in providing an easy way to find which SRIDs might be appropriate for a specific area!

This post explores this new data with an example of how to use it with pre-existing spatial data.

PostgreSQL 13Beta3: B-Tree index deduplication

Sun, 06 Sep 2020 05:01:00 GMT

PostgreSQL 13 development is coming along nicely, Postgres 13 Beta3 was released on 8/13/2020. The Postgres Beta 1 and 2 releases were released in May and June 2020. One of the features that has my interest in Postgres 13 is the B-Tree deduplication effort. B-Tree indexes are the default indexing method in Postgres, and are likely the most-used indexes in production environments. Any improvements to this part of the database are likely to have wide-reaching benefits. Removing duplication from indexes keeps their physical size smaller, reduces I/O overhead, and should help keep SELECT queries fast!

Postgres 13 Performance with OpenStreetMap data

Mon, 19 Oct 2020 05:01:00 GMT

With Postgres 13 released recently, a new season of testing has begun! I recently wrote about the impact of BTree index deduplication, finding that improvement is a serious win. This post continues looking at Postgres 13 by examining performance through a few steps of an OpenStreetMap data (PostGIS) workflow.

Reasons to upgrade

Performance appears to be a strong advantage to Postgres 13 over Postgres 12. Marc Linster wrote there's "not one headline-grabbing feature, but rather a wide variety of improvements along with updates to previously released features." I am finding that to be an appropriate description. At this point I intend to upgrade our servers for the improved performance, plus a few other cool benefits.

Use PostgreSQL file_fdw to Access External Data

Fri, 27 Mar 2020 05:01:00 GMT

Loading external data is a core task for many database administrators. With Postgres we have the powerful option of using Foreign Data Wrappers (FDW) to bring in external data sources. FDWs allow us to access data that is external to the database by using SQL from within the database. These external data sources can be in a number of formats, including other relational databases (Postgres, Oracle, MySQL), NoSQL sources (MongoDB, Redis), raw data files (csv, JSON) and many more.

This post shows how to use file_fdw to load remote data from CSV files available from GitHub. Sharing data via public source control tools like GitHub has become a common way to make data sets widely available. Other public data is available from various government and non-profit sites, so this is a handy tool to have available and reuse.

External data source

For this post I am using the COVID-19 data John Hopkins University is curating. See the main GitHub page for full attribution and meta-details about the data.

Pi 4 Performance: PostgreSQL and PostGIS

Sat, 14 Mar 2020 05:01:00 GMT

Happy Pi (π) Day! I decided to celebrate with another post looking at the Raspberry Pi 4's performance running Postgres and PostGIS. A few months ago I added another Raspberry Pi to my collection, the new Model 4 with 4 GB RAM. My initial review focused on pgbench results comparing the Pi 4 against the previous 3B models as well as the even lower-powered Raspberry Pi Zero W. This post continues testing the Rasperry Pi 4's performance with PostgreSQL and PostGIS, this time with a look at a more suitable setup for production use cases. The main difference are the use of an external SSD drive and full-disk encryption.

Hardware and Configuration

The Raspberry Pi 4 is mounted in an enclosed Cana Kit case with a small fan in the top running on the 3.3 V rail (lower power), powered by a dedicated 3.5A power supply.

PostGIS Trajectory: Space plus Time

Sun, 29 Nov 2020 05:01:00 GMT

A few months ago I started experimenting with a few project ideas involving data over space and time. Naturally, I want to use Postgres and PostGIS as the main workhorse for these projects, the challenge was working out exactly how to pull it all together. After a couple false starts I had to put those projects aside for other priorities. In my free time I have continued working through some related reading on the topic. I found why you should be using PostGIS trajectories by Anita Graser and recommend reading that before continuing with this post. In fact, read Evaluating Spatio-temporal Data Models for Trajectories in PostGIS Databases while you're at it! There is great information in those resources with more links to other resources.

This post outlines examples of how to use these new PostGIS trajectory tricks with OpenStreetMap data I already have available (load and prepare ). Often, trajectory examples assume using data collected from our new age of IoT sensors sending GPS points and timestamps. This example approaches trajectories from a data modeling perpective instead, showing how to synthesize trajectory data using pgrouting. Visualization of data is a critical component of sharing information, QGIS has long been my favorite GIS application to use with PostGIS data.

PgDD extension moves to pgrx

Fri, 08 Oct 2021 05:01:00 GMT

Our data dictionary extension, PgDD, has been re-written using the pgrx framework in Rust! At this time I have tagged 0.4.0.rc3 and just need to do a bit more testing before the official 0.4.0 release. While I am excited for the news for PgDD, what is more exciting is the pgrx framework and the ease it brings to developing Postgres extensions! Getting started with pgrx is straightforward and using cargo pgrx run makes it simple to build your extension against multiple versions of Postgres.

This post outlines how I came to the decision to use pgrx for Postgres extension development.

Note: pgrx was originally named pgx. This post has been updated to reflect its current name.

Progression of PgDD

Before now, PgDD was a raw SQL extension, with that version being an evolution from prior iterations. Shortly after I converted PgDD to a raw SQL extension I wanted it to do more, specifically related to supporting newer features such as generated columns and native partitioning. Supporting new features in new versions of Postgres is a good idea, but I couldn't drop support for older versions at that time either. Using generated columns as an example, the feature was added in Postgres 12 and came along with an update to the pg_catalog.pg_attribute system catalog. In Pg12 and newer, pg_attribute has a column named attgenerated while earlier versions of Postgres do not have that column.

Timescale, Compression and OpenStreetMap Tags

Fri, 20 Aug 2021 05:01:00 GMT

This post captures my initial exploration with the Timescale DB extension in Postgres. I have watched Timescale with interest for quite some time but had not really experimented with it before now. I am considering Timescale as another solid option for improving my long-term storage of OpenStreetMap data snapshots. Naturally, I am using PostGIS enabled databases filled with OpenStreetMap data.

I started looking at restructuring our OpenStreetMap data with my post Why Partition OpenStreetMap data? That post has an overview of the historic use case I need to support. While my 1st attempt at declarative partitioning ran into a snag, my 2nd attempt worked rather well. This post looks beyond my initial requirements for the project and establishes additional benefits from adding Timescale into our databases.

Timescale benefits

There are two main reasons I am looking into Timescale as an option over Postgres' built-in declarative partitioning:

No need to manually create partitions
Compression is tempting

New partitions with Postgres' declarative partitioning must be created manually. The syntax isn't terribly tricky and the process can be automated, but it still exists therefore it still needs to be managed. When using Timescale's hypertables new partitions are handled behind the scenes without my direct intervention. The other temptation from Timescale is their columnar-style compression on row-based data. In standard Postgres, the only time compression kicks in is at the row level when a single row will exceed a specified size (default 2kb). See my post on large text data in Postgres that discusses compression in Postgres. Timescale has been writing about their compression so I figured it was time to give it a go. While compression wasn't one of the original goals I had outlined... it would be nice!!

Better OpenStreetMap places in PostGIS

Sat, 23 Jan 2021 05:01:00 GMT

Data quality is important. This post continues exploring the improved quality of OpenStreetMap data loaded to Postgres/PostGIS via PgOSM-Flex. These improvements are enabled by the new flex output of osm2pgsql, making it easier to understand and consume OpenStreetMap data for analytic purposes.

I started exploring the Flex output a few weeks ago, the post before this one used PgOSM-Flex v0.0.3. This post uses PgOSM-Flex v0.0.7 and highlights a few cool improvements by exploring the OSM place data. Some of the improvements made of the past few weeks were ideas brought over from the legacy PgOSM project. Other improvements were spurred by questions and conversations with the community, such as the nested admin polygons.

Improved places

This post focuses on the osm.place_polygon data that stores things like city, county and Country boundaries, along with neighborhoods and other details. The the format of place data has a number of improvements covered in this post:

Consolidated name
Remove duplication between relation/member polygons
Boundary hierarchy

The data loaded for this post is the U.S. West sub-region from Geofabrik. It was loaded using the run-all.lua and run-all.sql scripts in PgOSM-Flex.

Partition OpenStreetMap data in PostGIS

Tue, 16 Feb 2021 05:01:00 GMT

This post continues my quest to explore Postgres native partitioning and determine if it is a good fit for my OpenStreetMap data in PostGIS. I show how I am planning to implement a partitioning scheme in a way that a) works well for my use case, and b) is easy to implement and maintain.
My previous post covered why I think partitioning will be a benefit in our databases.

The following steps are the result of a few iterations of ideas, including asking the osm2pgsql team about one idea I had. Ultimately, I think it is good that osm2pgsql will not support the idea I had asked about there. It forced me to rethink my approach and end up at a better solution. The reality is that partitioning only make sense if the partitioning scheme supports the end use, and end uses are quite varied. Trying to automate partitioning directly in the PgOSM-Flex project would have greatly increased costs of maintaining that project, and likely wasted a ton of time.

The post after this one shows that the plan outlined in this post is not perfect, though it shows great promise. There will be at least one more post to outline how everything all work out!

Prepare for partitioning

I am using OpenStreetMap data loaded by PgOSM-Flex, testing with the osm.road_line table (see more posts on PgOSM-Flex). Eventually I plan to partition a few tables from the imported data, but to start with I am working only with the roads.

First Review of Partitioning OpenStreetMap

Sun, 21 Feb 2021 05:01:00 GMT

My previous two posts set the stage to evaluate declarative Postgres partitioning for OpenStreetMap data. This post outlines what I found when I tested my plan and outlines my next steps. The goal with this series is to determine if partitioning is a path worth going down, or if the additional complexity outweighs any benefits. The first post on partitioning outlined my use case and why I thought partitioning would be a potential benefit. The maintenance aspects of partitioning are my #1 hope for improvement, with easy and fast loading and removal of entire data sets being a big deal for me.

The second post detailed my approach to partitioning to allow me to partition based on date and region. In that post I even bragged that a clever workaround was a suitable solution.

"No big deal, creating the osmp.pgosm_flex_partition table gives each osm_date + region a single ID to use to define list partitions." -- Arrogant Me

Read on to see where that assumption fell apart and my planned next steps.

I was hoping to have made a "Go / No-Go" decision by this point... I am currently at a solid "Probably!"

Load data

For testing I simulated Colorado data being loaded once per month on the 1st of each month and North America once per year on January 1. This was conceptually easier to implement and test than trying to capture exactly what I described in my initial post. This approach resulted in 17 snapshots of OpenStreetMap being loaded, 15 with Colorado and two with North America. I loaded this data twice, once using the planned partitioned setup and the other using a simple stacked table to compare performance against.

Identify OpenStreetMap changes with Postgres

Mon, 30 Aug 2021 05:01:00 GMT

The data in the main OpenStreetMap database is constantly changing. Folks around the world are almost certainly saving changes via JOSM, iD, and other editors as you read these words. With change constantly occurring in the data, it is often desirable to have an idea of what has actually changed in the data. This post explores one approach to tracking changes to the tags attribute data once it has been loaded to Postgres.

The topic of this post surfaced while I was working on refreshing a project involving travel times (routing). In the process I noticed a few instances where the analysis had shifted significantly. My first hunch was that entire segments of road had been added or removed, but that was not the cause. It became apparent that tags in the area had been improved. It was easy to specifically point to the value associated with the highway key but I also knew there were other changes happening, I just wasn't sure what all was involved and at what scale.

Calculate tag hash

The database I am working in has five (5) Colorado snapshots loaded spanning back to 2018. The tags data is loaded to a table named osmts.tags, read my post Timescale, Compression and OpenStreetMap Tags for how this table was created. The tags table has one row for every OpenStreetMap feature and stores the full key/value attribute data in a JSONB column (osmts.tags.tags). A relatively simple way to detecting change in data is to calculate the hash for each feature's key/value data. Comparing hashes for any change will identify rows that had changes to their attribute data.

Psycopg3 Initial Review

Tue, 07 Sep 2021 05:01:00 GMT

If you use Postgres and Python together you are almost certainly familiar with psycopg2. Daniele Varrazzo has been the maintainer of the psycopg project for many years. In 2020 Daniele started working full-time on creating psycopg3, the successor to psycopg2. Recently, the Beta 1 release of psycopg3 was made available via PyPI install. This post highlights two pieces of happy news with psycopg3:

Migration is easy
The connection pool rocks

As the first section shows, migration from psycopg2 to psycopg3 is quite easy. The majority of this post is dedicated to examining pyscopg3's connection pool and the difference this feature can make to your application's performance.

Migration

Easy migration is an important feature to encourage developers to upgrade. It is frustrating when a "simple upgrade" turns into a cascade of error after error throughout your application. Luckily for us, psycopg3 got this part right! In the past week I fully migrated two projects to psycopg3 and started migrating two more projects. So far the friction has been very low and confined to edge case uses.

The following example shows a simplified example of how my projects have used psycopg2.

Find missing crossings in OpenStreetMap with PostGIS

Sun, 07 Nov 2021 05:01:00 GMT

The #30DayMapChallenge is going on again this November. Each day of the month has a different theme for that day's map challenge. These challenges do not have a requirement for technology, so naturally I am using OpenStreetMap data stored in PostGIS with QGIS for the visualization component.

The challenge for Day 5 was an OpenStreetMap data challenge. I decided to find and visualize missing crossing tags. Crossing tags are added to the node (point) where a pedestrian highway (e.g. highway=footway) intersects a motorized highway (e.g. highway=tertiary). This post explains how I used PostGIS and OpenStreetMap data to find intersections missing a dedicated crossing tag.

Without further ado, here was my submission for Day 5.

Round Two: Partitioning OpenStreetMap

Fri, 26 Feb 2021 05:01:00 GMT

A few weeks ago I decided to seriously consider Postgres' declarative table partitioning for our OpenStreetMap data. Once the decision was made to investigate this option, I outlined our use case with requirements to keep multiple versions of OpenStreetMap data over time. That process helped draft my initial plan for how to create and manage the partitioned data. When I put the initial code to the test I found a snag and adjusted the plan.

This post shows a working example of how to partition OpenStreetMap data loaded using PgOSM-Flex.

TLDR

Spoiler alert!

It works, I love it! I am moving forward with the plan outlined in this post. Some highlights from testing with Colorado sized data:

Bulk import generates 17% less WAL
Bulk delete generates 99.8% less WAL
Simple aggregate query runs 75% faster

Load and query Pi-hole data from Postgres

Mon, 01 Feb 2021 05:01:00 GMT

I have used Pi-hole on our local network for a few years now. It is running on a dedicated Raspberry Pi 3B attached to the router (Netgear Nighthawk) to provide fast local DNS/DCHP while blocking ads at the network level. The built-in Pi-hole web interface allows for some basic querying/reporting of the collected data, but it's a bit limited and quite slow as the data grows over time. My current pihole-FTL.db database is 1.4 GB and contains 12 months of data.

$ ls -alh /etc/pihole/pihole-FTL.db
-rw-r--r--  1 pihole pihole 1.4G Jan 31 14:04 pihole-FTL.db

Pi-hole saves its data in a few SQLite databases with the FTL database (Faster Than Light) being the most interesting. While I could try to work with the data directly in SQLite, I strongly prefer Postgres and decided this was a great time to give the pgspider/sqlite_fdw extension a try. This post goes over the steps I took to bring Pi-hole data into Postgres from its sqlite data source.

See my previous post on using file_fdw for more about Postgres' Foreign Data Wrappers.

Postgres Permissions and Materialized Views

Mon, 05 Jul 2021 05:01:00 GMT

Materialized views in Postgres are a handy way to persist the result of a query to disk. This is helpful when the underlying query is expensive and slow yet high performance SELECT queries are required. When using materialized views they need to be explicitly refreshed to show changes to the underlying table. This is done through the REFRESH MATERIALIZED VIEW <name>; syntax.

Keeping materialized views regularly refreshed is often a delegated to a cron job on a schedule. There is also often a need for database users to manually refresh the data on demand. At this point many users stub their toe on permissions because refreshing a materialized view can only be done by the owner of the materialized view. This post uses a simple example to illustrate how to delegate refresh permissions to other Postgres roles.

Permissions required for PostGIS

Wed, 01 Dec 2021 05:01:00 GMT

PostGIS is a widely popular spatial database extension for Postgres. It's also one of my favorite tools! A recent discussion on the People, Postgres, Data Discord server highlighted that the permissions required for various PostGIS operations were not clearly explained in the PostGIS documentation. As it turned out, I didn't know exactly what was required either. The basic PostGIS install page provides resources for installing the binary on the server and the basic CREATE EXTENSION commands, but does not explain permissions required.

This post explores the permissions required for three types of PostGIS interactions:

Install/Create PostGIS
Use PostGIS
Load data from pg_dump

Database and Users

I am using Postgres installed on my laptop for these tests, Postgres 13.5 and PostGIS 3.1. I created an empty database named postgis_perms and check the \du slash command in psql to see the current roles. This instance has my my ryanlambert role, a superuser, and the default postgres role. The postgres role is not used in this post outside of this example.

([local] 🐘) ryanlambert@postgis_perms=# \du
                                     List of roles
┌─────────────┬────────────────────────────────────────────────────────────┬───────────┐
│  Role name  │                         Attributes                         │ Member of │
╞═════════════╪════════════════════════════════════════════════════════════╪═══════════╡
│ postgres    │ Superuser, Create role, Create DB, Replication, Bypass RLS │ {}        │
│ ryanlambert │ Superuser, Create role, Create DB                          │ {}        │
└─────────────┴────────────────────────────────────────────────────────────┴───────────┘

OpenStreetMap to PostGIS is getting lighter

Sat, 01 May 2021 05:01:00 GMT

If you have ever wanted OpenStreetMap data in Postgres/PostGIS, you are probably familiar with the osm2pgsql tool. Lately I have been writing about the osm2pgsql developments with the new Flex output and how it is enabling improved data quality. This post changes focus away from the flex output and examines the performance of the osm2pgsql load itself.

One challenge with osm2pgsql over the years has been generic recommendations have been difficult to make. The safest recommendation for nearly any combination of hardware and source data size was to use osm2pgsql --slim --drop to put most of the intermediate data into Postgres instead of relying directly on RAM, which it needed a lot of. This choice has offsetting costs of putting all that data into Postgres (only to be deleted) in terms of disk usage and I/O performance.

A few days ago, a pull request from Jochen Topf to create a new RAM middle caught my eye. The text that piqued my interest (emphasis mine):

When not using two-stage processing the memory requirements are much much smaller than with the old ram middle. Rule of thumb is, you'll need about 1GB plus 2.5 times the size of the PBF file as memory. This makes it possible to import even continent-sized data on reasonably-sized machines.

Wait... what?! Is this for real??

Using Query ID in Postgres 14

Fri, 15 Oct 2021 05:01:00 GMT

Postgres 14 was released on September 30, 2021. With a new major version comes new features to explore! This post takes a look at the unique query id option enabled with compute_query_id in postgresql.conf. This particular backend improvement, included with Postgres 14, is one I am excited about because it makes investigating and monitoring query related performance easier. This post covers how to enable the new feature and explores how it can be used in real life performance tuning.

Enable query id

For testing I created a new instance with Postgres 14 installed and edited the postgresql.conf file to change a few configuration options related to the query id. I set compute_query_id to on instead of auto and to allow the pg_stat_statements extension to be loaded. Additionally, I turn on log_duration, set log_statement to all and update log_line_prefix to include query_id=%Q,

compute_query_id = on
shared_preload_libraries = 'pg_stat_statements'

log_duration = on
log_statement = 'all'
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h,query_id=%Q '

Improved OpenStreetMap data structure in PostGIS

Sun, 03 Jan 2021 05:01:00 GMT

It was nearly a decade ago when I first loaded OpenStreetMap data to PostGIS. Over the years my fingers have typed osm2pgsql --slim --drop ... countless times and I do not see an end to that trend anytime soon. One thing that is changing is that getting high quality OpenStreetMap data into PostGIS is easier than ever! This improvement in data quality is made possible by the new Flex output available in osm2pgsql 1.4.0, I wrote about my initial impressions of the Flex output a few weeks ago.

This post looks at how I am starting to use osm2pgsql's Flex output to provide a standardized and sanitized OpenStreetMap data set in Postgres/PostGIS. No longer is osm2pgsql limited to loading data to the 3-table structure (planet_osm_point, planet_osm_line and planet_osm_polygon) so I am eagerly converting to the Flex output and taking advantage of these changes! It is also easier than ever to create customized mix-and-match data loads for customized needs of specific projects.

Use BIGINT in Postgres

Sat, 05 Jun 2021 05:01:00 GMT

This post examines a common database design decision involving the choice of using BIGINT versus INT data types. You may already know that the BIGINT data type uses twice the storage on disk (8 bytes per value) compared to the INT data type (4 bytes per value). Knowing this, a common decision is to use INT wherever possible, only resorting to using BIGINT when it was obvious* that the column will be storing values greater than 2.147 Billion (the max of INT).

That's what I did too, until 2-3 years ago! I started changing my default mindset to using BIGINT over INT, reversing my long-held habit. This post explains why I default to using BIGINT and examine the performance impacts of the decision.

TLDR;

As I conclude at the end:

The tests I ran here show that a production-scale database with properly sized hardware can handle that slight overhead with no problem.

Why default to `BIGINT`?

The main reason to default to BIGINT is to avoid INT to BIGINT migrations. The need to do an INT to BIGINT migration comes up at the least opportune time and the task is time consuming. This type of migration typically involves at least one column used as a PRIMARY KEY and that is often used elsewhere as a FOREIGN KEY on other table(s) that must also be migrated.

In the spirit of defensive database design, BIGINT is the safest choice. Remember the *obvious part mentioned above? Planning and estimating is a difficult topic and people (myself included) get it wrong all the time! Yes, there is overhead for using BIGINT, but I believe the overhead associated with the extra 4 bytes is trivial for the majority of production databases.

Why Partition OpenStreetMap data?

Sun, 14 Feb 2021 05:01:00 GMT

This post covers the first part of my path in considering native Postgres partitioning and how it might be helpful to my work with OpenStreetMap data in PostGIS. Partitioning tables in Postgres can have significant benefits when working with larger data sets, and OpenStreetMap data as a whole is generally considered a large data set. The post following this one will outline the steps I am taking to implement partitioning with data loaded by PgOSM-Flex. A third post is planned to dive into the impacts on performance this change has.

Table partitioning is not an architecture that should be implemented casually without planning and good reason. The consequences of a poorly planned and implemented partitioning scheme can be severe. This is why it is worth the extra time to plan, evaluate and test this option before making any lasting implementation decisions. This post starts by examining the work flow I have used with OpenStreetMap data, challenges with my legacy approach, and highlights where I think Postgres partitioning can provide serious improvement. My next post shows how I am approaching the task of partitioning OpenStreetMap data.

At the time of writing this post I have not decided if this is a path I will continue down for production use. I have not started testing and collecting data for the 3rd post. I will likely make the "Go / No-Go" decision while I am collecting data for the performance related post.

OpenStreetMap for Analytics

The main way I use OpenStreetMap data is within analytics style projects. Routing, travel times, watersheds, urban growth, and land usage are all easily within scope for OpenStreetMap data in PostGIS.

Using Uber's H3 hex grid in PostGIS

Sun, 24 Apr 2022 05:01:00 GMT

This post explores using the H3 hex grid system within PostGIS. H3 was developed by Uber and has some cool benefits over the PostGIS native ST_HexagonGrid() function used in my post Find missing crossings in OpenStreetMap with PostGIS. The hex grid built-in to PostGIS is great for one-off projects covering a specific region, though it has shortcomings for larger scale consistency. On the other hand, the H3 grid is a globally defined grid that scales up and down through resolutions neatly. For more details, read Uber's description.

This post used the H3 v3 extension. See Using v4 of the Postgres H3 extension for usage in the latest version.

This post works through a few of the functions available in the H3 extension and how they can be used for spatial aggregation in an analysis. One additional focus is how to generate a table of H3 hexagons for a given resolution.

Note: This post does not focus on using H3 for the best performance. See my post H3 indexes for performance with PostGIS data for a look into high performance spatial searches with H3.

Install H3 in Postgres

The H3 library is available to PostGIS as a Postgres extension. I am using the bytesandbrains h3-pg project available on GitHub. The extension can be installed using pgxn install h3. Once installed, create the H3 extension in the database.

CREATE EXTENSION h3;

Backups for Postgres - PGSQL Phriday #002

Fri, 04 Nov 2022 05:01:00 GMT

This blog post is for PGSQL Phriday #002. Read Ryan Booz' introduction from September for more details on PGSQL Phriday. Andreas Scherbaum is this month's host and chose the topic: Postgres backups!

The topic reposted here:

Which tool(s) are you using, where do you store backups, how often do you do backups? Are there any recommendations you can give the reader how to improve their backups? Any lesser known features in your favorite backup tool? Any new and cool features in a recently released version?

Bonus question: Is pg_dump a backup tool?

What is a backup?

I address most of ads' questions in this post, but before we dive in we need to define "backup." Merriam Webster has a definition for backup in the context of computers:

backup (3): "a copy of computer data (such as a file or the contents of a hard drive)"

I'm running with this simple defintion of backup for today's post. To frame the definition of "backup" in a non-Postgres context: Suppose I have a "business document." I want to make some major changes but am afraid of accidentally losing something important. What do I do? I copy / paste the file, change the name to include today's date, and edit away. Did I create a backup of the original document? Sure. In a way. Is it the same thing as when the IT department backs up the network drive the documents where both the original and newly modified files are saved? Nope. Do both approaches serve their purpose? Yes!

Database backups are similar. There isn't a one-size-fits-all solution.

Routing with Lines through Polygons

Sun, 23 Oct 2022 05:01:00 GMT

One of my favorite layers to route with pgRouting is the water layer. I am interested in where water comes from, where it goes, where runoff happens, and how urban development interacts with this powerful force of nature. The OpenStreetMap water layer, however, presents a challenge when routing with PostGIS and pgRouting: Polygons.

Why are polygons a challenge? A routing network using pgRouting is built from lines (edges). Now, to state the obvious: polygons are not lines.

Real world waterway networks are made up of both lines and polygons. Rivers, streams, and drainage routes are predominately (but not exclusively!) mapped using lines. These lines feed into and out of ponds, lakes, and reservoirs. The following animation shows how much impact the water polygons can have on a waterway network... some very important paths simply disappear when they are excluded.

To make the full water network route-able we need to create a combined line layer. The combined line layer will include:

Initial osm.water_line inputs
Medial axis lines from osm.water_polygon
Lines to connect initial inputs to medial axis

Postgres 15 Configuration Changes

Sun, 16 Oct 2022 05:01:00 GMT

A few years ago around the time PostgreSQL 12 was released, I created a tool to help identify the changes to postgresql.conf. The pgConfig tool has helped me become (and stay) aware of important changes to Postgres configuration as I work with various major version upgrades. Now that Postgres 15 is available, pgConfig is updated with the latest configuration. This post provides a quick look at changes in the Postgres 15 version of the postgresql.conf options.

Summary of changes

The postgresql.conf for Postgres 15 has 6 new items, 3 changed items and 1 removed item. Visit the pgConfig site to see the full list of changes.

Postgres 15 improves UNIQUE and NULL

Mon, 11 Jul 2022 05:01:00 GMT

Postgres 15 beta 2 was released recently! I enjoy Beta season... reviewing and testing new features is a fun diversion from daily tasks. This post takes a look at an improvement to UNIQUE constraints on columns with NULL values. While the nuances of unique constraints are not as flashy as making sorts faster (that's exciting!), improving the database developer's control over data quality is always a good benefit.

This email chain has the history behind this change. The Postgres 15 release notes summarize this improvement:

"Allow unique constraints and indexes to treat NULL values as not distinct (Peter Eisentraut)

Previously NULL values were always indexed as distinct values, but this can now be changed by creating constraints and indexes using UNIQUE NULLS NOT DISTINCT."

Two styles of `UNIQUE`

To take a look at what this change does, we create two tables. The null_old_style table has a 2-column UNIQUE constraint on (val1, val2). The val2 allows NULL values.

CREATE TABLE null_old_style
(
    id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    val1 TEXT NOT NULL,
    val2 TEXT NULL,
    CONSTRAINT uq_val1_val2
        UNIQUE (val1, val2)
);

Postgres Data Dictionary for everyone

Tue, 04 Jan 2022 05:01:00 GMT

A data dictionary is an important tool for anyone that stores and consumes data. The PgDD extension makes it easy to inspect and explore your data structures in Postgres. This post shows how PgDD provides access to current and accurate information about your databases for a variety of users:

Analysts
DBAs and Developers
The Business

This data dictionary information from PgDD is made available using standard SQL by querying a small set of views.

Background

Relational databases, including Postgres, track the majority of the information needed for a data dictionary. This is done in the underlying system catalogs; Postgres' system catalogs are in the pg_catalog schema. The challenge with using the system catalogs is they are not very user friendly to query for the type of details commonly needed. PgDD does not do anything magical, it is simply a wrapper around the Postgres system catalogs!

Better OpenStreetMap data using PgOSM Flex 0.6.0

Tue, 04 Oct 2022 05:01:00 GMT

In late 2020 when osm2pgsql released the flex output I eagerly jumped on that bandwagon. The osm2pgsql flex output enabled the type of data structure and cleanup abilities I had always wanted from osm2pgsql. By January 2021 the PgOSM Flex project was up and running and I was phasing out my legacy OpenStreetMap processes. Since then, I have written more than a dozen posts exploring different improvements and use cases for the OpenStreetMap data loaded via PgOSM Flex. This post looks at a few notable improvements to version 0.6.0 over prior versions. The two areas of focus are:

Data quality
Usability

Data quality improvements

The set of improvements that gave me the idea for this post were made in PgOSM Flex versions 0.5.1 and 0.6.0. Version 0.5.1 took advantage of the long awaited addition of multilinestring support to osm2pgsql. Adding that feature in osm2pgsql allowed relations of lines to be added in the same manner that relations of polygons had used. Without the multilinestring support, relations such as 13642053, shown in the following screenshot, were being skipped by the PgOSM Flex import. This improvement targeted roads, waterways, and public transport layers.

What is the PostgreSQL community to you? - PGSQL Phriday #003

Fri, 02 Dec 2022 05:01:00 GMT

This blog post is for PGSQL Phriday #003. Read Ryan Booz' introduction from September for more details on PGSQL Phriday. Pat Wright (SQL Asylum) is this month's host and chose the topic: What is the PostgreSQL community to you?

TLDR;

The Postgres community is helpful.

One big community with many layers

The remainder of this post explores why I say the Postgres community is helpful. Postgres is an open source project with multiple layers and locations of community. Membership is open, free, and no invite is needed.

Prefer having an invite? You're invited!

Book Release! Mastering PostGIS and OpenStreetMap

Sat, 01 Oct 2022 05:01:00 GMT

I'm excited to announce my book, Mastering PostGIS and OpenStreetMap, is available to purchase as of October 1, 2022! This book provides a practical guide to introduce readers to PostGIS, OpenStreetMap data, and spatial querying. Queries used for examples are written against real OpenStreetMap data (included) to help you learn how to navigate and explore complex spatial data. The examples start simple and quickly progress through a variety of clever spatial queries and powerful techniques.

Section 12.3, Create Denver specific tables, is available as a free preview section. The full Table of Contents is available from the free preview page.

Who is this book for?

Mastering PostGIS and OpenStreetMap is for anyone that wants to learn more about PostGIS and/or OpenStreetMap data. The hefty Appendix helps keep new users on track without distracting users with more experience. The following table gives an idea of the topics covered.

Topic	Included?
Install PostGIS	✅
Spatial SQL queries	✅
Basics of OpenStreetMap tagging	✅
Load OpenStreetMap data to PostGIS	✅
Find and use local SRIDs everywhere	✅
Handle real-world (dirty!) data	✅
Performance of Geometry vs. Geography	✅
Routing	✅

Stubbing toes with `auto_explain`

Tue, 20 Dec 2022 05:01:00 GMT

Postgres has a handy module called auto_explain. The auto_explain module lives up to its name: it runs EXPLAIN automatically for you. The intent for this module is to automatically provide information useful for troubleshooting about your slow queries as they happen. This post outlines a pitfall I recently discovered with auto_explain. Luckily for us, it's an easy thing to avoid.

I discovered this by running CREATE EXTENSION postgis; and watching it run for quite a while before failing with an out of disk space error. That is not my typical experience with a simple CREATE EXTENSION command!

Standard use of `auto_explain`

A common way auto_explain is used is to target "slow queries" through the setting auto_explain.log_min_duration. This setting defines the number of milliseconds to use as the threshold of when to log the EXPLAIN output. If your queries are typically 10-50 ms, you might decide to set auto_explain.log_min_duration = 100 to log queries taking twice as long as your goal. An instance serving big analytic queries might want to set that much higher, say 2 or 5 seconds.

Innocent testing

I say my testing was innocent because I wasn't trying to break something. That makes it innocent, right? I was playing around with auto_explain trying out PgMustard's scoring API. At the time I didn't want to think about where to set that threshold... I just wanted to capture some explain output for testing. The auto_explain documentation explains that setting auto_explain.log_min_duration = 0 will capture "all plans." Sounds good, let's do that!

H3 indexes for performance with PostGIS data

Fri, 24 Jun 2022 05:01:00 GMT

I recently started using the H3 hex grid extension in Postgres with the goal of making some not-so-fast queries faster. My previous post, Using Uber's H3 hex grid in PostGIS, has an introduction to the H3 extension. The focus in that post, admittedly, is a PostGIS focused view instead of an H3 focused view. This post takes a closer look at using the H3 extension to enhance performance of spatial searches.

The two common spatial query patterns considered in this post are:

Nearest neighbor style searches
Regional analysis

This post used the H3 v3 extension. See Using v4 of the Postgres H3 extension for usage in the latest version.

Setup and Point of Focus

This post uses two tables to examine performance. The following queries add an h3_ix column to the osm.natural_point and osm.building_polygon tables. This approach uses GENERATED columns and adds an index to the column. Going through these steps allow us to remove the need for PostGIS joins at query time for rough distance searches. See my previous post for details about installing the H3 extension and the basics of how it works.

Audit Data with Triggers: PGSQL Phriday #007

Fri, 07 Apr 2023 05:01:00 GMT

Welcome to another #PGSQLPhriday post! This month's host is Lætitia Avrot, who picked the topic of Triggers with these questions:

"Do you love them? Do you hate them? Do you sometimes love them sometimes hate them? And, most importantly, why? Do you know legitimate use cases for them? How to mitigate their drawbacks (if you think they have any)?"

Let's dive in!

Triggers are a specialized tool

I rarely use triggers. I don't hate triggers, I just think they should be used sparingly. Like any specialized tool, you should not expect to use triggers for every occasion where they could be used. However... there is one notable use where case I really like triggers: audit tables. Part of the magic of using triggers for auditing data changes in Postgres is the JSON/JSONB support available.

Setup Geocoder with PostGIS and Tiger/LINE

Sun, 08 Oct 2023 05:01:00 GMT

Geocoding addresses is the process of taking a street address and converting it to its location on a map. This post shows how to create a PostGIS geocoder using the U.S. Census Bureau's TIGER/Line data set. This is part one of a series of posts exploring geocoding addresses. The next post illustrates how to geocode in bulk with a focus on evaluating the accuracy of the resulting geometry data.

Before diving in, let's look at an example of geocoding. The address for Union Station (see on OpenStreetMap) is 1701 Wynkoop Street, Denver, CO, 80202. This address was the input to geocode. The blue point shown in the following screenshot is the resulting point from the PostGIS geocode() function. The pop-up dialog shows the address, a rating of 0, and the calculated distance away from the OpenStreetMap polygon representing that address (13 meters), shown in red under the pop-up dialog.

Working with GPS data in PostGIS

Mon, 18 Dec 2023 05:01:00 GMT

One of the key elements to using PostGIS is having spatial data to work with! Lucky for us, one big difference today compared to the not-so-distant past is that essentially everyone is carrying a GPS unit with them nearly everywhere. This makes it easy to create your own GPS data that you can then load into PostGIS! This post explores some basics of loading GPS data to PostGIS and cleaning it for use. It turns out, GPS data fr om nearly any GPS-enabled device comes with some... character. Getting from the raw input to usable spatial data takes a bit of effort.

This post starts with using ogr2ogr to load the .gpx data to PostGIS. Once the data is in PostGIS then we actually want to do something with it. Before the data is completely usable, we should spend some time cleaning the data first. Technically you can start querying the data right away, however, I have found there is always data cleanup and processing involved first to make the data truly useful.

This is especially true when using data collected over longer periods of time with a variety of users and data sources.

Travel database project

Before I started writing this post I had assumed that all of the code would be contained in the body of this post. It turned out that to get to the quality I wanted, I had to create a new travel database project (MIT licensed) to share the code. This is already a long post without including a few hundred more lines of code! The travel-db project creates a few tables, a view, and a stored procedure in the travel schema. The stored procedure travel.load_bad_elf_points is responsible for cleaning an importing the data and weighs in at nearly 500 lines of code itself.

PgOSM Flex for Production OpenStreetMap data

Fri, 21 Apr 2023 05:01:00 GMT

The PgOSM Flex Project is looking forward to the 0.8.0! If you aren't familiar with PgOSM Flex, it is a tool that loads high quality OpenStreetMap datasets to PostGIS using osm2pgsql. I have a few examples of using OpenStreetMap data loaded this way.

I am extremely excited about PgOSM Flex 0.8.0 because the project as a whole is really starting to feel "production ready." While I have been using PgOSM Flex in production for more than 2 years, there have been a few rough edges over that time. However, the improvements over the past year have brought a number of amazing components together.

PgOSM Flex 0.8.0 does include a few ⚠️ breaking changes! ⚠️ Read the release notes for full details.

PgOSM Flex in production

What does "in production" mean for a tool in a data pipeline?

Reliable
Easy to try out
Easy to load/update to prod
Low friction software updates

This post covers why I think PgOSM Flex meets all of those requirements.

PGSQL Phriday #005 Recap

Fri, 10 Feb 2023 05:01:00 GMT

Thank you everyone who contributed to PgSQL Phriday #005! This month's topic: "Is your data relational?" If I missed any contributions, or if new ones are published, let me know and I'll try to update this post. These snippets are in a somewhat random order, loosely threaded together by sub-topic.

Contributing posts

Hetti D. wrote a great post starting by addressing the bonus question. I put that question last partly because I have struggled with a succinct definition myself. I also put it last because I hoped the initial 3 questions would lead us to answer the bonus question in our own ways. Hetti also discusses storing blobs and objects, and considerations between complexities and trade-offs with more targeted technology.

Using v4 of the Postgres H3 extension

Mon, 22 May 2023 05:01:00 GMT

I wrote about using the H3 extension last year in Using Uber's H3 hex grid in PostGIS and H3 indexes for performance with PostGIS data. Naturally, things have changed over the past 12 months, specifically version 4 of the H3 Postgres extension was released. The H3 Postgres extension (h3-pg) closely follows the upstream H3 project, including naming conventions. The changes made in H3 version 4 unfortunately changed every function name used in my original blog posts. It seems this mass renaming was a one-time alignment in the H3 project, hopefully they don't all get renamed again.

This post covers the changes required to migrate the code in my prior two posts work with version 4.x h3-pg.

Create the h3 extension

Creating the extension for PostGIS usage now involves installing two (2) extensions. Some components have been split out into the h3_postgis extension. I use CASCADE when installing the h3_postgis portion since that also requires postgis_raster which I do not have installed by default.

CREATE EXTENSION h3;
CREATE EXTENSION h3_postgis CASCADE;

PASS Session: Postgres Extensions Shape the Future

Sun, 19 Nov 2023 05:01:00 GMT

This post supports my session titled PostgreSQL: Extensions Shape the Future at PASS Data Community Summit 2023 on November 15. Thank you to everyone who joined this session during PASS. I believe the audio recording with slides should be made available at some point a few months in the future.

Slides

The following download is the PDF version of the slide deck.

Extensions Shape the Future slides (PDF)

Pre-conference Session Materials: GIS Data, Queries, and Performance

Sun, 12 Nov 2023 05:01:00 GMT

This post supports our full day pre-conference session, PostGIS and PostgreSQL: GIS Data, Queries, and Performance at PASS Data Community Summit 2023 on November 13.

Thank you everyone who participated! This page has been updated with the slide decks used during the session.

Downloads for session

The data, permissions script, and example SQL queries used through this session are available below.

See you at PASS Data Community Summit!

Sun, 15 Oct 2023 05:01:00 GMT

This year's PASS summit in Seattle is only four weeks away! I am honored that I was selected to provide a full day pre-conference training on PostGIS, as well as a general session talk on extensions. Both of my topics are focused on the Postgres ecosystem. Of course, that is not a surprise to my regular readers! It may be a surprise to those who have been aware of PASS in the past.

What is PASS?

This year's PASS conference is called "PASS Data Community Summit 2023." In the past, PASS was an acronym for Professional Association for SQL Server, and the conference was very much a Microsoft conference. When I attended in 2018 it was because I wanted to learn more about MS SQL Server and PowerBI. This year, that focus is expanding to include Postgres!

Track performance differences with pg_stat_statements

Sat, 06 May 2023 05:01:00 GMT

This is my entry for PgSQL Phriday #008. It's Saturday, so I guess this is a day late! This month's topic, chosen by Michael from pgMustard, is on the excellent pg_stat_statements extension. When I saw Michael was the host this month I knew he'd pick a topic I would want to contribute on! Michael's post for his own topic provides helpful queries and good reminders about changes to columns between Postgres version 12 and 13.

In this post I show one way I like using pg_stat_statements: tracking the impact of configuration changes to a specific workload. I used a contrived change to configuration to quickly make an obvious impact.

Process to test

I am using PgOSM Flex to load Colorado OpenStreetMap data to PostGIS. PgOSM Flex uses a multi-step ETL that prepares the database, runs osm2pgsql, and then runs multiple post-processing steps. This results in 2.4 GB of data into Postgres. That should be enough activity to show something interesting.

Geocode Addresses with PostGIS and Tiger/LINE

Tue, 10 Oct 2023 05:01:00 GMT

My previous post, Setup Geocoder with PostGIS and Tiger/LINE, prepared a PostGIS geocoder with TIGER/Line data for Colorado. This post uses that setup to bulk geocode addresses from OpenStreetMap buildings to try to determine the accuracy of the geometry data derived from the input addresses.

Quality Expectations

Let's get this out of the way: No geocoding process is going to be perfectly accurate.

There are a variety of contributing factors to the data quality. The geocoder data source is pretty decent, coming from the U.S. Census Bureau's TIGER/Line data set. The vintage is 2022, and this is as recent as I can load today using this data source. We should to understand that this will not contain any new development or changes from the past year (or so).

The OpenStreetMap data is also a source of error. It is certain that there are typos, outdated addresses, and other inaccuracies from the OpenStreetMap data. Some degree of that has to be expected with nearly any data source. The region of OpenStreetMap used will also be an influence, some regions just don't have many boots-on-the-ground editing and validating OpenStreetMap data.

Accuracy of Geometry data in PostGIS

Sat, 15 Apr 2023 05:01:00 GMT

A common use case with PostGIS data is to calculate things, such as distances between points, lengths of lines, and the area of polygons. The topic of accuracy, or inaccuracy, with GEOMETRY data comes up often. The most frequent offenders are generic SRIDs such as 3857 and 4326. In some projects accuracy is paramount. Non-negotiable. On the other hand, plenty of projects do not need accurate calculations. Those projects often rely on relationships between calculations, not the actual values of the calculations themselves. If Coffee shop Y is 4 times further away than Coffee shop Z. I'll often go to Coffee shop Z just based on that.

In most cases, users should still understand how significant the errors are. This post explores one approach to determine the how accurate (or not!) the calculations of a given SRID are in a particular region, based on latitude (North/South). The queries used in this post can be adjusted for your specific area.

Set the stage

The calculations in this post focus on the distance of two points situated 40 decimal degrees apart. The points are created in pairs of west/east points at -120 (W) and -80 (W). Those were picked arbitrarily, though intentionally spread far enough apart to make the errors in distance calculations feel obviously significant. The point pairs are created in 5 decimal degree intervals of latitude from 80 North to 80 South. The following screenshot shows how the points frame in much of North America.

While the points on the map using a mercator projection appear to be equidistant... they are not!

PostgreSQL 16 improves infinity: PgSQLPhriday #012

Fri, 01 Sep 2023 05:01:00 GMT

This month's #pgsqlphriday challenge is the 12th PgSQL Phriday, marking the end of the first year of the event! Before getting into this month's topic I want to give a shout out to Ryan Booz for starting #pgsqlphriday. More importantly though, a huge thank you to the hosts and contributors from the past year! I looked forward to seeing the topic each month followed by waiting to see who all would contribute and how they would approach the topic.

Check out pgsqlphriday.com for the full list of topics, including recaps from each topic to link to contributing posts. This month is the 7th topic I've been able to contribute to the event. I even had the honor of hosting #005 with the topic Is your data relational? I'm really looking forward to another year ahead!

Now returning to your regularly scheduled PgSQL Phriday content.

This month, Ryan Booz chose the topic: What Excites You About PostgreSQL 16? With the release of Postgres 16 expected in the near(ish) future, it's starting to get real. It won't be long until casual users are upgrading their Postgres instances. To decide what to write about I headed to the Postgres 16 release notes to scan through the documents. Through all of the items, I picked this item attributed to Vik Fearing.

Accept the spelling "+infinity" in datetime input

The rest of this post looks at what this means, and why I think this matters.

Postgres 15: Explain Buffer now with Temp Timings

Sat, 24 Jun 2023 05:01:00 GMT

This post explores a helpful addition to Postgres 15's EXPLAIN output when using BUFFERS. The Postgres 15 release notes mention lists this item:

Add EXPLAIN (BUFFERS) output for temporary file block I/O (Masahiko Sawada)

This improvement adds new detail to the output provided from Postgres 15 when running EXPLAIN (ANALYZE, BUFFERS) <query>. This post explores this feature along with a couple different ways the reported I/O timing interacts with performance tuning.

Getting the feature

The first thing you need is to be using at least Postgres 15. Your instance also needs to have track_io_timing=on in your Postgres configuration file, postgresql.conf. Check the value of this setting with SHOW track_io_timing;.

SHOW track_io_timing;
┌─────────────────┐
│ track_io_timing │
╞═════════════════╡
│ on              │
└─────────────────┘

Test Data and Server

This post used a Postgres 15 instance on a Digital Ocean droplet with 2 AMD CPU and 2 GB RAM. I loaded Colorado OpenStreetMap data via PgOSM Flex. The data loaded to the osm schema weighs in at 2.5 GB, the public schema, 3.3 GB, has the raw data needed to enable PgOSM Flex's --replication feature via osm2pgsql-replication. The advantage to having more data than RAM is it's pretty easy to show I/O timings, which I need for this post!

Postgres and Software Development: PGSQL Phriday #004

Fri, 06 Jan 2023 05:01:00 GMT

This blog post is part of PGSQL Phriday #004. Hettie Dombrovskaya is this month's host! I was very excited to see the topic chosen as Postgres and Software Development. My contribution to this month's #pgsqlphriday topic covers how I manage code through our development processes. Check out Hettie's post for more about this month's topic.

Types of code

Before continuing on to the specific questions for the challenge, I want to define the broad groupings of code I work.

Mission critical
Not trivial
Trivial

Mission critical code is where most of the "special sauce" is at. Mission critical SQL code includes DDL commands to create the database structure such as CREATE TABLE foo and CREATE VIEW baz. This level of code represents the data structure that enables everything else to function.

Relational and Non-relational Data: PGSQL Phriday #005

Mon, 23 Jan 2023 05:01:00 GMT

Welcome to the 5th installment of the #PGSQLPhriday blogging series. I am thrilled to be this month's host! The topic posts should be published by Friday February 3rd.

When Ryan Booz proposed the idea for #PGSQLPhriday I was immediately excited about it. Other than our first names, Ryan and I have a other few things in common. One of these common points is we both started our database careers in the world of MS SQL Server and later found our way to Postgres. My move to Postgres, and why I discovered Postgres, is at the heart of this month's topic for PGSQL Phriday 005.

Is your data relational?

The entire reason I discovered and started using Postgres was PostGIS. I needed PostGIS because I had a project in 2011 that could benefit from the OpenStreetMap data. The project still needed rock solid support for relational data and the SQL Standard, which Postgres also provides. However, it was the spatial support of PostGIS that pulled me into the world of Postgres.

Load the Right Amount of OpenStreetMap Data

Thu, 24 Aug 2023 05:01:00 GMT

Populating a PostGIS database with OpenStreetMap data is favorite way to start a new geospatial project. Loading a region of OpenStreetMap data enables you with data ranging from roads, buildings, water features, amenities, and so much more! The breadth and bulk of data is great, but it can turn into a hinderance especially for projects focused on smaller regions. This post explores how to use PgOSM Flex with custom layersets, multiple schemas, and osmium. The goal is load limited data for a larger region, while loading detailed data for a smaller, target region.

The larger region for this post will be the Colorado extract from Geofabrik. The smaller region will be the Fort Collins area, extracted from the Colorado file. The following image shows the data loaded in this post with two maps side-by-side. The minimal data loaded for all of Colorado is shown on the left and the full details of Fort Collins is on the right.

Postgres Events: PgSQLPhriday #014

Fri, 01 Dec 2023 05:01:00 GMT

It is PgSQLPhriday time again! This month's event is PgSQLPhriday (#014) and is hosted by Pavlo Golub. I'm barely making the deadline, but didn't want to miss this one! Pavlo chose PostgreSQL Events for the focus for this month's topic. See his post for the full details. As always, I can't wait to read the rest of the contributions this month. This post addresses roughly three of his questions.

Networking

It just so happens, I finally got to meet Pavlo in person at the PASS 2023 summit in Seattle, Washington! 👋

Q: "Discuss the importance of networking. Have you formed valuable connections or partnerships as a result of these events?"

Postgres is Relational Plus

Fri, 03 Feb 2023 05:01:00 GMT

I was the host for this month's #PGSQLPhriday topic (#005), and decided on the topic question: Is your data relational? This is my submission on the topic, and how I use Postgres for Relational Plus usages.

Non-relational data

Q: What non-relational data do you store in Postgres and how do you use it?

PostGIS is the most prominent non-relational data I am involved with. Pretty much all of the PostGIS data I am involved with is rooted alongside solidly relational data. Want to know demographics of customers within 10 miles of a specific location? The location portion is spatial, the demographic data is relational.

Getting started with MobilityDB

Tue, 15 Aug 2023 05:01:00 GMT

The MobilityDB is an exciting project I have been watching for a while. The project's README explains that MobilityDB "adds support for temporal and spatio-temporal objects." The spatio-temporal part translates as PostGIS plus time which is very interesting stuff to me. I had briefly experimented with the project in its early days and found a lot of potential. Mobility DB 1.0 was released in April 2022, at the time of writing the 1.1.0 alpha release is available.

This post explains how I got started with MobilityDB using PostGIS and pgRouting. I'm using OpenStreetMap roads data with pgRouting to generate trajectories. If you have gpx traces or other time-aware PostGIS data handy, those could be used in place of the routes I create with pgRouting.

Install MobilityDB

When I started working on this post a few weeks ago I had an unexpectedly difficult time trying to get MobilityDB working. I was trying to install from the production branch with Postgres 15 and Ubuntu 22.04 and ran into a series of errors. It turned out the fixes to allow MobilityDB to work with these latest versions had been in the develop branch for more than a year. After realizing what the problem was I asked a question and got an answer. The fixes are now in the master branch tagged as 1.1.0 alpha. Thank you to everyone involved with making that happen!

To install MobilityDB I'm following the instructions to install from source. These steps involve git clone then using cmake, make, and sudo make install. Esteban Zimanyi explained they are working on getting packaging worked out for providing deb and yum installers. It looks like work is progressing on those!

Update Configuration

After installing the extension, the postgresql.conf needs to be updated to include PostGIS in the shared_preload_libraries and increase the max_locks_per_transaction to double the default value.

shared_preload_libraries = 'postgis-3'
max_locks_per_transaction = 128

Can you use ltree for Nested Place Data?

Thu, 29 Feb 2024 05:01:00 GMT

The topic of the ltree data type has come up a few times recently. This intersects with a common type of query used in PostGIS: nested geometries. An example of nested geometries is the state of Colorado exists within the United States. The PgOSM Flex project calculates and stores nested polygon data from OpenStreetMap places into a handful of array (TEXT[], BIGINT[]) columns. I decided to explore ltree to see if it would be a suitable option for PgOSM Flex nested places.

Spoiler alert: ltree is not suitable for OpenStreetMap data in the way I would want to use it.

Nested data in arrays

The following is what the "Colorado is in the U.S" would look like using a Postgres TEXT[] array:

{"United States","Colorado"}

UUID in Postgres: PgSQLPhriday #015

Wed, 31 Jan 2024 05:01:00 GMT

This month's PgSQLPhriday #015 topic is about UUIDs. Lætitia Avrot is this month's host, see her post for the full challenge text. The topic is described as a debate between the Database People and Developers. I'm not sure there's such a clean divide on people's opinions on the topic, as I know plenty of Database People that have settled on using UUIDs as their default. Similarly, I know even more developer types that have followed the arguably more conventional choice of using an auto-incrementing ID.

TLDR;

I avoid UUIDs. The only places I have used UUIDs in production are the places where a 3rd party system is involved.

Hosting a set of Postgres Demo databases

Sat, 10 Feb 2024 05:01:00 GMT

In April 2023, I submitted my proposal for a full-day pre-conference at PASS 2023. My chosen topic was focused on PostGIS, titled GIS Data, Queries, and Performance. A key part of my submission was that the session would be an interactive, follow-along type design. Julie and I believe that doing is key to learning so we wanted to enforce that as much as possible. The plan was to use real data and queries to teach a nuanced, technical topic to an audience of unknown size or background. I also knew that PASS is very much a Microsoft focused community.

Knowing these things, I could not assume specific pre-existing knowledge about Postgres and PostGIS. I also didn't want to assume they had a Postgres 15 instance with PostGIS immediately available. I decided the best approach was to provide participants each a demo Postgres database so they didn't have to worry about those steps. These demo databases would be pre-loaded with the same data and extensions I used for my demos. This would allow participants to run the same queries, on the same data, on the same general hardware.

Of course, when my proposal was accepted then I realized I had to figure out how I was actually going to deliver! This post explains how I deployed demo databases to the participants of my PASS 2023 pre-con session.

The net result was a reliable, secure (enough), scalable, and affordable setup.

Improved Quality in OpenStreetMap Road Network for pgRouting

Sun, 28 Dec 2025 05:01:00 GMT

Recent changes in the software bundled in PgOSM Flex resulted in unexpected improvements when using OpenStreetMap roads data for routing. The short story: routing with PgOSM Flex 1.2.0 is faster, easier, and produces higher quality data for routing! I came to this conclusion after completing a variety of testing with the old and new versions of PgOSM Flex. This post outlines my testing and findings.

The concern I had before this testing was that the variety of changes involved in preparing data for routing in PgOSM Flex 1.2.0 might have degraded routing quality. I am beyond thrilled with what I found instead. Quality of the generated network didn't suffer at all, it was a major win!

What Changed?

The changes started with PgOSM Flex 1.1.1 by bumping internal versions used in PgOSM Flex to Postgres 18, PostGIS 3.6, osm2pgsql 2.2.0, and Debian 13. There was not expected to be any significant changes bundled in that release. After v1.1.1 was released, it came to my attention that pgRouting 4.0 had been released and that update broke the routing instructions in PgOSM Flex's documentation. This was thankfully reported by Travis Hathaway who also helped verify the updates to the process.

pgRouting 4 removed the pgr_nodeNetwork, pgr_createTopology, and pgr_analyzeGraph functions. Removing these functions was the catalyst for the changes made in PgOSM Flex 1.2.0. I had used those pgr_* functions as part of my core process in data preparation for routing for as long as I have used pgRouting.

After adjusting the documentation it became clear there were performance issues using the replacement functions in pgRouting 4.0, namely in pgr_separateTouching(). The performance issue in the pgRouting function is reported as pgrouting#3010. Working through the performance challenges resulted in PgOSM Flex 1.1.2 and ultimately PgOSM Flex 1.2.0 that now uses a custom procedure to prepare the edge network far better suited to OpenStreetMap data.

Local LLM with OpenWeb UI and Ollama

Wed, 18 Mar 2026 05:01:00 GMT

Like much of the world, I have been exploring capabilities and realities of LLMs and other generative tools for a while now. I am focused on using the technology with the framing of my technology-focused work, plus my other common scoping on data privacy and ethics. I want basic coding help (SQL, Python, Docker, PowerShell, DAX), ideation, writing boilerplate code, and leveraging existing procedures. Naturally, I want this available offline in a private and secure environment. I have been focused on running a local LLM with RAG capabilities and having control over what data goes where, and how it is used. Especially data about my conversations with the generative LLM.

This post collects my notes on what my expectations and goals are, and outlines the components I am using currently, and thoughts on my path forward.

RustProof Labs blog planet-postgresql posts

PostgreSQL: Integers, on-disk order, and performance

Updated for 2020: Load OpenStreetMap data to PostGIS

Changes to software

Hands on with osm2pgsql's new Flex output

Large Text in PostgreSQL: Performance and Storage

Find your local SRID in PostGIS

PostgreSQL 13Beta3: B-Tree index deduplication

Postgres 13 Performance with OpenStreetMap data

Reasons to upgrade

Use PostgreSQL file_fdw to Access External Data

External data source

Pi 4 Performance: PostgreSQL and PostGIS

Hardware and Configuration

PostGIS Trajectory: Space plus Time

PgDD extension moves to pgrx

Progression of PgDD

Timescale, Compression and OpenStreetMap Tags

Timescale benefits

Better OpenStreetMap places in PostGIS

Improved places

Partition OpenStreetMap data in PostGIS

Prepare for partitioning

First Review of Partitioning OpenStreetMap

Load data

Identify OpenStreetMap changes with Postgres

Calculate tag hash

Psycopg3 Initial Review

Migration

Find missing crossings in OpenStreetMap with PostGIS

Round Two: Partitioning OpenStreetMap

TLDR

Load and query Pi-hole data from Postgres

Postgres Permissions and Materialized Views

Permissions required for PostGIS

Database and Users

OpenStreetMap to PostGIS is getting lighter

Using Query ID in Postgres 14

Enable query id

Improved OpenStreetMap data structure in PostGIS

Use BIGINT in Postgres

TLDR;

Why default to BIGINT?

Why Partition OpenStreetMap data?

OpenStreetMap for Analytics

Using Uber's H3 hex grid in PostGIS

Install H3 in Postgres

Backups for Postgres - PGSQL Phriday #002

What is a backup?

Routing with Lines through Polygons

Postgres 15 Configuration Changes

Summary of changes

Postgres 15 improves UNIQUE and NULL

Two styles of UNIQUE

Postgres Data Dictionary for everyone

Background

Better OpenStreetMap data using PgOSM Flex 0.6.0

Data quality improvements

What is the PostgreSQL community to you? - PGSQL Phriday #003

TLDR;

One big community with many layers

Book Release! Mastering PostGIS and OpenStreetMap

Who is this book for?

Stubbing toes with auto_explain

Standard use of auto_explain

Innocent testing

H3 indexes for performance with PostGIS data

Setup and Point of Focus

Audit Data with Triggers: PGSQL Phriday #007

Triggers are a specialized tool

Setup Geocoder with PostGIS and Tiger/LINE

Working with GPS data in PostGIS

Travel database project

PgOSM Flex for Production OpenStreetMap data

PgOSM Flex in production

PGSQL Phriday #005 Recap

Contributing posts

Using v4 of the Postgres H3 extension

Create the h3 extension

PASS Session: Postgres Extensions Shape the Future

Why default to `BIGINT`?

Two styles of `UNIQUE`

Stubbing toes with `auto_explain`

Standard use of `auto_explain`