Psycopg3 Initial Review
If you use Postgres and Python together you are almost certainly familiar with psycopg2. Daniele Varrazzo has been the maintainer of the psycopg project for many years. In 2020 Daniele started working full-time on creating psycopg3, the successor to psycopg2. Recently, the Beta 1 release of psycopg3 was made available via PyPI install. This post highlights two pieces of happy news with psycopg3:
- Migration is easy
- The connection pool rocks
As the first section shows, migration from psycopg2 to psycopg3 is quite easy. The majority of this post is dedicated to examining pyscopg3's connection pool and the difference this feature can make to your application's performance.
Migration
Easy migration is an important feature to encourage developers to upgrade. It is frustrating when a "simple upgrade" turns into a cascade of error after error throughout your application. Luckily for us, psycopg3 got this part right! In the past week I fully migrated two projects to psycopg3 and started migrating two more projects. So far the friction has been very low and confined to edge case uses.
The following example shows a simplified example of how my projects have
used psycopg2
.
Identify OpenStreetMap changes with Postgres
The data in the main OpenStreetMap database is constantly changing. Folks around the world are almost certainly saving changes via JOSM, iD, and other editors as you read these words. With change constantly occurring in the data, it is often desirable to have an idea of what has actually changed in the data. This post explores one approach to tracking changes to the tags attribute data once it has been loaded to Postgres.
The topic of this post surfaced while I was working on refreshing
a project involving travel times (routing). In the process I noticed a
few instances where the analysis had shifted significantly.
My first hunch was that entire segments of road had been
added or removed, but that was not
the cause. It became apparent that tags in the area had been
improved. It was easy to specifically point to the value associated
with the highway
key but I also knew there were other changes
happening, I just wasn't sure what all was involved and at what scale.
Calculate tag hash
The database I am working in has five (5) Colorado
snapshots loaded spanning back to 2018. The tags data is loaded
to a table named osmts.tags
, read my post
Timescale, Compression and OpenStreetMap Tags
for how this table was created. The tags
table has one row for
every OpenStreetMap feature and stores the full key/value attribute data
in a JSONB column (osmts.tags.tags
).
A relatively simple way to detecting change in data is to
calculate the hash for each feature's key/value data. Comparing
hashes for any change will identify rows that had changes to their
attribute data.
Timescale, Compression and OpenStreetMap Tags
This post captures my initial exploration with the Timescale DB extension in Postgres. I have watched Timescale with interest for quite some time but had not really experimented with it before now. I am considering Timescale as another solid option for improving my long-term storage of OpenStreetMap data snapshots. Naturally, I am using PostGIS enabled databases filled with OpenStreetMap data.
I started looking at restructuring our OpenStreetMap data with my post Why Partition OpenStreetMap data? That post has an overview of the historic use case I need to support. While my 1st attempt at declarative partitioning ran into a snag, my 2nd attempt worked rather well. This post looks beyond my initial requirements for the project and establishes additional benefits from adding Timescale into our databases.
Timescale benefits
There are two main reasons I am looking into Timescale as an option over Postgres' built-in declarative partitioning:
- No need to manually create partitions
- Compression is tempting
New partitions with Postgres' declarative partitioning must be created manually. The syntax isn't terribly tricky and the process can be automated, but it still exists therefore it still needs to be managed. When using Timescale's hypertables new partitions are handled behind the scenes without my direct intervention. The other temptation from Timescale is their columnar-style compression on row-based data. In standard Postgres, the only time compression kicks in is at the row level when a single row will exceed a specified size (default 2kb). See my post on large text data in Postgres that discusses compression in Postgres. Timescale has been writing about their compression so I figured it was time to give it a go. While compression wasn't one of the original goals I had outlined... it would be nice!!
Postgres Permissions and Materialized Views
Materialized views in Postgres are a handy way to persist the result
of a query to disk. This is helpful when the underlying query
is expensive and slow yet high performance SELECT
queries are required.
When using materialized views
they need to be explicitly refreshed to show changes to the underlying
table. This is done through the REFRESH MATERIALIZED VIEW <name>;
syntax.
Keeping materialized views regularly refreshed is often a delegated to a cron job on a schedule. There is also often a need for database users to manually refresh the data on demand. At this point many users stub their toe on permissions because refreshing a materialized view can only be done by the owner of the materialized view. This post uses a simple example to illustrate how to delegate refresh permissions to other Postgres roles.
Use BIGINT in Postgres
This post examines a common database design decision
involving the choice of using BIGINT
versus INT
data types.
You may already know that the BIGINT
data type uses
twice the storage on disk (8 bytes per value) compared to
the INT
data type (4 bytes per value).
Knowing this, a common
decision is to use INT
wherever possible, only resorting to using
BIGINT
when it was obvious*
that the column will be storing
values greater than 2.147 Billion (the
max of INT
).
That's what I did too, until 2-3 years ago!
I started changing my default mindset to using BIGINT
over INT
,
reversing my long-held habit.
This post explains why I default to using BIGINT
and examine the performance impacts of the decision.
TLDR;
As I conclude at the end:
The tests I ran here show that a production-scale database with properly sized hardware can handle that slight overhead with no problem.
Why default to BIGINT
?
The main reason to default to BIGINT
is to avoid
INT
to BIGINT
migrations. The need to do an INT
to BIGINT
migration comes up at the
least opportune time and the task is time consuming.
This type of migration typically involves at least one column used
as a PRIMARY KEY
and that is often used elsewhere as a FOREIGN KEY
on other table(s) that must also be migrated.
In the spirit of defensive database design, BIGINT
is the safest choice. Remember the *obvious part mentioned
above? Planning and estimating is a difficult topic and
people (myself included) get it wrong all the time!
Yes, there is overhead for using BIGINT
,
but I believe the overhead associated with the extra 4 bytes
is trivial for the majority of production databases.