Timescale, Compression and OpenStreetMap Tags
This post captures my initial exploration with the Timescale DB extension in Postgres. I have watched Timescale with interest for quite some time but had not really experimented with it before now. I am considering Timescale as another solid option for improving my long-term storage of OpenStreetMap data snapshots. Naturally, I am using PostGIS enabled databases filled with OpenStreetMap data.
I started looking at restructuring our OpenStreetMap data with my post Why Partition OpenStreetMap data? That post has an overview of the historic use case I need to support. While my 1st attempt at declarative partitioning ran into a snag, my 2nd attempt worked rather well. This post looks beyond my initial requirements for the project and establishes additional benefits from adding Timescale into our databases.
There are two main reasons I am looking into Timescale as an option over Postgres' built-in declarative partitioning:
- No need to manually create partitions
- Compression is tempting
New partitions with Postgres' declarative partitioning must be created manually. The syntax isn't terribly tricky and the process can be automated, but it still exists therefore it still needs to be managed. When using Timescale's hypertables new partitions are handled behind the scenes without my direct intervention. The other temptation from Timescale is their columnar-style compression on row-based data. In standard Postgres, the only time compression kicks in is at the row level when a single row will exceed a specified size (default 2kb). See my post on large text data in Postgres that discusses compression in Postgres. Timescale has been writing about their compression so I figured it was time to give it a go. While compression wasn't one of the original goals I had outlined... it would be nice!!
Postgres Permissions and Materialized Views
Materialized views in Postgres are a handy way to persist the result
of a query to disk. This is helpful when the underlying query
is expensive and slow yet high performance
SELECT queries are required.
When using materialized views
they need to be explicitly refreshed to show changes to the underlying
table. This is done through the
REFRESH MATERIALIZED VIEW <name>;
Keeping materialized views regularly refreshed is often a delegated to a cron job on a schedule. There is also often a need for database users to manually refresh the data on demand. At this point many users stub their toe on permissions because refreshing a materialized view can only be done by the owner of the materialized view. This post uses a simple example to illustrate how to delegate refresh permissions to other Postgres roles.
Use BIGINT in Postgres
This post examines a common database design decision
involving the choice of using
INT data types.
You may already know that the
BIGINT data type uses
twice the storage on disk (8 bytes per value) compared to
INT data type (4 bytes per value).
Knowing this, a common
decision is to use
INT wherever possible, only resorting to using
BIGINT when it was obvious*
that the column will be storing
values greater than 2.147 Billion (the
That's what I did too, until 2-3 years ago!
I started changing my default mindset to using
reversing my long-held habit.
This post explains why I default to using
and examine the performance impacts of the decision.
As I conclude at the end:
The tests I ran here show that a production-scale database with properly sized hardware can handle that slight overhead with no problem.
Why default to
The main reason to default to
BIGINT is to avoid
BIGINT migrations. The need to do an
BIGINT migration comes up at the
least opportune time and the task is time consuming.
This type of migration typically involves at least one column used
PRIMARY KEY and that is often used elsewhere as a
on other table(s) that must also be migrated.
In the spirit of defensive database design,
is the safest choice. Remember the *obvious part mentioned
above? Planning and estimating is a difficult topic and
people (myself included) get it wrong all the time!
Yes, there is overhead for using
but I believe the overhead associated with the extra 4 bytes
is trivial for the majority of production databases.
OpenStreetMap to PostGIS is getting lighter
If you have ever wanted OpenStreetMap data in Postgres/PostGIS, you are probably familiar with the osm2pgsql tool. Lately I have been writing about the osm2pgsql developments with the new Flex output and how it is enabling improved data quality. This post changes focus away from the flex output and examines the performance of the osm2pgsql load itself.
One challenge with osm2pgsql over the years has been generic
recommendations have been difficult to make. The safest recommendation
for nearly any combination of hardware and source data size was
osm2pgsql --slim --drop to put most of the intermediate data
into Postgres instead of relying directly on RAM, which it needed a lot of.
This choice has offsetting costs of putting all that data into Postgres (only to be deleted) in terms of disk usage and I/O performance.
A few days ago, a pull request from Jochen Topf to create a new RAM middle caught my eye. The text that piqued my interest (emphasis mine):
When not using two-stage processing the memory requirements are much much smaller than with the old ram middle. Rule of thumb is, you'll need about 1GB plus 2.5 times the size of the PBF file as memory. This makes it possible to import even continent-sized data on reasonably-sized machines.
Wait... what?! Is this for real??
(Webinar) OpenStreetMap to PostGIS: Easier and Better!
This page has the resources and recording for the OpenStreetMap to PostGIS: Easier and Better! webinar from Wednesday March 31, 2021.
Downloads for session
Scripts used for the demo: