My book Mastering PostGIS and OpenStreetMap is available!

Postgres 15: Explain Buffer now with Temp Timings

By Ryan Lambert -- Published June 24, 2023

This post explores a helpful addition to Postgres 15's EXPLAIN output when using BUFFERS. The Postgres 15 release notes mention lists this item:

Add EXPLAIN (BUFFERS) output for temporary file block I/O (Masahiko Sawada)

This improvement adds new detail to the output provided from Postgres 15 when running EXPLAIN (ANALYZE, BUFFERS) <query>. This post explores this feature along with a couple different ways the reported I/O timing interacts with performance tuning.

Getting the feature

The first thing you need is to be using at least Postgres 15. Your instance also needs to have track_io_timing=on in your Postgres configuration file, postgresql.conf. Check the value of this setting with SHOW track_io_timing;.

SHOW track_io_timing;
┌─────────────────┐
│ track_io_timing │
╞═════════════════╡
│ on              │
└─────────────────┘

Test Data and Server

This post used a Postgres 15 instance on a Digital Ocean droplet with 2 AMD CPU and 2 GB RAM. I loaded Colorado OpenStreetMap data via PgOSM Flex. The data loaded to the osm schema weighs in at 2.5 GB, the public schema, 3.3 GB, has the raw data needed to enable PgOSM Flex's --replication feature via osm2pgsql-replication. The advantage to having more data than RAM is it's pretty easy to show I/O timings, which I need for this post!

SELECT s_name, table_count, view_count, size_plus_indexes
    FROM dd.schemas
    WHERE s_name IN ('osm', 'public')
;

┌────────┬─────────────┬────────────┬───────────────────┐
│ s_name │ table_count │ view_count │ size_plus_indexes │
╞════════╪═════════════╪════════════╪═══════════════════╡
│ public │           8 │          5 │ 3419 MB           │
│ osm    │          41 │          4 │ 2576 MB           │
└────────┴─────────────┴────────────┴───────────────────┘

The dd.schemas view used above is part of the PgDD extension. PgDD is an easy-to-query data dictionary extension for Postgres built with the pgrx framework.

Timing Output from `BUFFERS`

The following query is an easy example for illustrating the new I/O timings output. The osm.building_polygon table has about 1.3 million rows and weighs in at 342 MB, which makes it pretty easy to get uncached data on this small server. Sorting is always expensive, sorting by GEOMETRY with polygons ends up being far more expensive than sorting BIGINTs. This query causes a good hefty chunk being written to disk to perform the sort.

EXPLAIN (ANALYZE, BUFFERS, SETTINGS)
SELECT *
    FROM osm.building_polygon
    ORDER BY geom
;

The output below was taken after running this query a few times. There are 2 lines in this query plan reporting I/O Timings. The outermost detail reads I/O Timings: shared/local read=117.889, temp read=152.325 write=553.295. The I/O timings included in Postgres 14 and prior are now prefixed by shared/local in Postgres 15. This portion relates to reading data from disk, such as tables or indexes. The timings new to Postgres 15 follow temp, and represent where temp files are used. The temp file portion in this example is temp read=152.325 write=553.295. The query overall takes 3.4 seconds to return.

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                             QUERY PLAN                                                             │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Sort  (cost=417509.05..420756.57 rows=1299005 width=323) (actual time=2559.751..3202.282 rows=1293207 loops=1)                     │
│   Sort Key: geom                                                                                                                   │
│   Sort Method: external merge  Disk: 318696kB                                                                                      │
│   Buffers: shared hit=566 read=43214, temp read=79656 written=79695                                                                │
│   I/O Timings: shared/local read=117.889, temp read=152.325 write=553.295                                                          │
│   ->  Seq Scan on building_polygon  (cost=0.00..56752.05 rows=1299005 width=323) (actual time=0.101..411.451 rows=1293207 loops=1) │
│         Buffers: shared hit=548 read=43214                                                                                         │
│         I/O Timings: shared/local read=117.889                                                                                     │
│ Settings: jit = 'off', max_parallel_workers_per_gather = '0', random_page_cost = '1.1', work_mem = '10MB'                          │
│ Planning Time: 0.174 ms                                                                                                            │
│ Execution Time: 3356.681 ms                                                                                                        │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

After running the same query a number more times, Postgres has cached the entire table into shared_buffers. The next output no longer has the shared/local timings and all that remains is temp timings. We can see this has dropped the overall query time from 3.4 seconds to 2.7 seconds, a difference of 673 ms. The shared/local timing was reported as 118 ms. This reported I/O timing represents less than 20% of the total timing reduction by removing the I/O in that step.

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                             QUERY PLAN                                                             │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Sort  (cost=417509.05..420756.57 rows=1299005 width=323) (actual time=2041.855..2561.969 rows=1293207 loops=1)                     │
│   Sort Key: geom                                                                                                                   │
│   Sort Method: external merge  Disk: 318696kB                                                                                      │
│   Buffers: shared hit=43780, temp read=79656 written=79695                                                                         │
│   I/O Timings: temp read=110.007 write=430.901                                                                                     │
│   ->  Seq Scan on building_polygon  (cost=0.00..56752.05 rows=1299005 width=323) (actual time=0.015..257.855 rows=1293207 loops=1) │
│         Buffers: shared hit=43762                                                                                                  │
│ Settings: jit = 'off', max_parallel_workers_per_gather = '0', random_page_cost = '1.1', work_mem = '10MB'                          │
│ Planning Time: 0.107 ms                                                                                                            │
│ Execution Time: 2683.554 ms                                                                                                        │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Interpreting the output

I end up talking with a lot of people about I/O timing while performance tuning. If I were presented this query to tune running on Postgres 14 and prior, I would have focused in on this line:

Sort Method: external merge  Disk: 318696kB

That's telling us the ORDER BY geom requires about 311 MB to perform the sort. With Postgres 14 and prior, that was all the information we got about that step. To "fix" that you'd need to set work_mem to at least that much in order to sort in memory. Remember, this instance only has 2GB RAM so that is not practical without increasing the instance's available RAM.

Let's assume adjusting configuration values is not an option for now. In a performance tuning scenario the best we could do was make wild guesses about how much of the timing impact was due to that sort on disk. Quite a few people have heard me ask: "Well, how fast can your server read/write 300+ MB?" Often this number ranges well into the GB size.

Starting with Postgres 15, we can now parse that sort to disk size along with the new detail of timing.

I/O Timings: temp read=110.007 write=430.901

Now we know that 541 milliseconds was spent writing and reading that temp file. That I/O time represents 20% of the total time (2.7 seconds), and we can assume that removing the I/O should typically remove at least this much time. This new information enables more informed conversations about the tradeoffs between the changes we need to make against the performance expectations. Is setting work_mem significantly higher worth reducing a 2.7 second query to 2.2 seconds? We might get a bigger benefit than that, but that is not guaranteed.

Enter a commonly heard phrase when asking for help in this area: "you have to test it." The exact combination of your data, your hardware, your configuration and your query all matter.

Pushing up `work_mem`

Warning: This section makes changes that could lead to bad things happening with only 2 GB RAM available. Do not change work_mem in this extreme manner without being sure of what you are doing.

I first tried the query with SET work_mem='350MB'; (311 MB would be the bare minimum) yet it still used the same sort on disk. When I bumped work_mem up to 400 MB it finally did the trick. The output from the same query with 400 MB available finally does a quicksort in memory, shown in the following plan. Notice the timing dropped to 1.2 seconds from the previous 2.7 seconds. That's 1.5 seconds faster, not just the 541 ms reported as I/O timing! In this specific instance, the I/O timing represented a little more than a third of the timing difference this configuration change affected. The other important difference is the disk sort uses a merge algorithm while the in-memory sort is using quicksort.

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                             QUERY PLAN                                                             │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Sort  (cost=303084.20..306331.72 rows=1299005 width=323) (actual time=1000.621..1144.861 rows=1293207 loops=1)                     │
│   Sort Key: geom                                                                                                                   │
│   Sort Method: quicksort  Memory: 389079kB                                                                                         │
│   Buffers: shared hit=43768                                                                                                        │
│   ->  Seq Scan on building_polygon  (cost=0.00..56752.05 rows=1299005 width=323) (actual time=0.014..230.784 rows=1293207 loops=1) │
│         Buffers: shared hit=43762                                                                                                  │
│ Settings: jit = 'off', max_parallel_workers_per_gather = '0', random_page_cost = '1.1', work_mem = '400MB'                         │
│ Planning Time: 0.096 ms                                                                                                            │
│ Execution Time: 1233.101 ms                                                                                                        │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Even faster, don't sort by geometry

As I mentioned earlier, sorting by GEOMETRY is far more expensive than sorting by a BIGINT column. The following query changes ORDER BY geom to ORDER BY osm_id.

SELECT *
    FROM osm.building_polygon
    ORDER BY osm_id
;

This explain output shows that with work_mem=10MB the query sorting by BIGINT (which is also the primary key) returns in 615 ms.

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                              QUERY PLAN                                                               │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Index Scan using pk_osm_building_polygon_osm_id on building_polygon  (cost=0.43..71488.94 rows=1299005 width=323) (actual time=0.025.…│
│….537.393 rows=1293207 loops=1)                                                                                                        │
│   Buffers: shared hit=617087                                                                                                          │
│ Settings: jit = 'off', max_parallel_workers_per_gather = '0', random_page_cost = '1.1', work_mem = '10MB'                             │
│ Planning Time: 0.091 ms                                                                                                               │
│ Execution Time: 615.988 ms                                                                                                            │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Summary

This post explored the new temp timing output included when using EXPLAIN (ANALYZE, BUFFER). The new detail provides us with new context to the performance tuning equation that can be helpful with performance tuning efforts. For further reading behind this feature see the dicussion on the pgsql-hackers list.

Need help with your PostgreSQL servers or databases? Contact us to start the conversation!

By Ryan Lambert
Published June 24, 2023
Last Updated June 24, 2023