Postgres 15: Explain Buffer now with Temp Timings
This post explores a helpful addition to Postgres 15's EXPLAIN
output when
using BUFFERS
.
The Postgres 15 release notes
mention lists this item:
- Add
EXPLAIN (BUFFERS)
output for temporary file block I/O (Masahiko Sawada)
This improvement adds new detail to the output provided from Postgres 15 when running
EXPLAIN (ANALYZE, BUFFERS) <query>
. This post explores this feature along with
a couple different ways the reported I/O timing interacts with performance tuning.
Getting the feature
The first thing you need is to be using at least Postgres 15.
Your instance also needs to have track_io_timing=on
in your Postgres configuration
file, postgresql.conf
.
Check the value of this setting with SHOW track_io_timing;
.
SHOW track_io_timing;
┌─────────────────┐
│ track_io_timing │
╞═════════════════╡
│ on │
└─────────────────┘
Test Data and Server
This post used a Postgres 15 instance on a
Digital Ocean droplet
with 2 AMD CPU and 2 GB RAM. I loaded Colorado OpenStreetMap data via
PgOSM Flex.
The data loaded to the osm
schema
weighs in at 2.5 GB, the public
schema, 3.3 GB, has the raw data needed to enable
PgOSM Flex's --replication
feature via osm2pgsql-replication. The advantage
to having more data than RAM is it's pretty easy to show I/O timings, which I need
for this post!
SELECT s_name, table_count, view_count, size_plus_indexes
FROM dd.schemas
WHERE s_name IN ('osm', 'public')
;
┌────────┬─────────────┬────────────┬───────────────────┐
│ s_name │ table_count │ view_count │ size_plus_indexes │
╞════════╪═════════════╪════════════╪═══════════════════╡
│ public │ 8 │ 5 │ 3419 MB │
│ osm │ 41 │ 4 │ 2576 MB │
└────────┴─────────────┴────────────┴───────────────────┘
The
dd.schemas
view used above is part of the PgDD extension. PgDD is an easy-to-query data dictionary extension for Postgres built with the pgrx framework.
Timing Output from BUFFERS
The following query is an easy example for illustrating the new I/O timings output.
The osm.building_polygon
table has about 1.3 million rows and weighs in at
342 MB, which makes it pretty easy to get uncached data on this small
server. Sorting is always expensive, sorting by GEOMETRY
with polygons
ends up being far more expensive than sorting BIGINT
s.
This query causes a good hefty chunk being written to disk to perform the sort.
EXPLAIN (ANALYZE, BUFFERS, SETTINGS)
SELECT *
FROM osm.building_polygon
ORDER BY geom
;
The output below was taken after running this query a few times. There are 2
lines in this query plan reporting I/O Timings
. The outermost detail reads
I/O Timings: shared/local read=117.889, temp read=152.325 write=553.295
.
The I/O timings included in Postgres 14 and prior are now prefixed by
shared/local
in Postgres 15. This portion relates to reading data from disk,
such as tables or indexes.
The timings new to Postgres 15 follow temp
, and represent where temp files are
used. The temp file portion in this example is temp read=152.325 write=553.295
.
The query overall takes 3.4 seconds to return.
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Sort (cost=417509.05..420756.57 rows=1299005 width=323) (actual time=2559.751..3202.282 rows=1293207 loops=1) │
│ Sort Key: geom │
│ Sort Method: external merge Disk: 318696kB │
│ Buffers: shared hit=566 read=43214, temp read=79656 written=79695 │
│ I/O Timings: shared/local read=117.889, temp read=152.325 write=553.295 │
│ -> Seq Scan on building_polygon (cost=0.00..56752.05 rows=1299005 width=323) (actual time=0.101..411.451 rows=1293207 loops=1) │
│ Buffers: shared hit=548 read=43214 │
│ I/O Timings: shared/local read=117.889 │
│ Settings: jit = 'off', max_parallel_workers_per_gather = '0', random_page_cost = '1.1', work_mem = '10MB' │
│ Planning Time: 0.174 ms │
│ Execution Time: 3356.681 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
After running the same query a number more times, Postgres has cached the entire
table into shared_buffers
.
The next output no longer has the shared/local
timings and all that remains
is temp
timings. We can see this has dropped the overall query time from 3.4 seconds
to 2.7 seconds, a difference of 673 ms. The shared/local
timing was reported
as 118 ms. This reported I/O timing represents less than 20% of the total timing
reduction by removing the I/O in that step.
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Sort (cost=417509.05..420756.57 rows=1299005 width=323) (actual time=2041.855..2561.969 rows=1293207 loops=1) │
│ Sort Key: geom │
│ Sort Method: external merge Disk: 318696kB │
│ Buffers: shared hit=43780, temp read=79656 written=79695 │
│ I/O Timings: temp read=110.007 write=430.901 │
│ -> Seq Scan on building_polygon (cost=0.00..56752.05 rows=1299005 width=323) (actual time=0.015..257.855 rows=1293207 loops=1) │
│ Buffers: shared hit=43762 │
│ Settings: jit = 'off', max_parallel_workers_per_gather = '0', random_page_cost = '1.1', work_mem = '10MB' │
│ Planning Time: 0.107 ms │
│ Execution Time: 2683.554 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Interpreting the output
I end up talking with a lot of people about I/O timing while performance tuning. If I were presented this query to tune running on Postgres 14 and prior, I would have focused in on this line:
Sort Method: external merge Disk: 318696kB
That's telling us the ORDER BY geom
requires about 311 MB to perform the sort.
With Postgres 14 and prior, that was all the information we got about that step.
To "fix" that you'd need to set work_mem
to at least that much in order
to sort in memory. Remember, this instance only has 2GB RAM so that is not
practical without increasing the instance's available RAM.
Let's assume adjusting configuration values is not an option for now. In a performance tuning scenario the best we could do was make wild guesses about how much of the timing impact was due to that sort on disk. Quite a few people have heard me ask: "Well, how fast can your server read/write 300+ MB?" Often this number ranges well into the GB size.
Starting with Postgres 15, we can now parse that sort to disk size along with the new detail of timing.
I/O Timings: temp read=110.007 write=430.901
Now we know that 541 milliseconds was spent writing and reading that temp file.
That I/O time represents 20% of the total time (2.7 seconds), and we can assume
that removing the I/O should typically remove at least this much time.
This new information enables more informed conversations about the tradeoffs
between the changes we need to make against the performance expectations. Is
setting work_mem
significantly higher worth reducing a 2.7 second query to 2.2 seconds?
We might get a bigger benefit than that, but that is not guaranteed.
Enter a commonly heard phrase when asking for help in this area: "you have to test it." The exact combination of your data, your hardware, your configuration and your query all matter.
Pushing up work_mem
Warning: This section makes changes that could lead to bad things happening with only 2 GB RAM available. Do not change
work_mem
in this extreme manner without being sure of what you are doing.
I first tried the query with SET work_mem='350MB';
(311 MB would be the bare minimum)
yet it still used the same sort on disk.
When I bumped work_mem
up to 400 MB it finally did the trick.
The output from the same query with 400 MB available finally does a quicksort in
memory, shown in the following plan. Notice the timing dropped to 1.2 seconds from
the previous 2.7 seconds. That's 1.5 seconds faster, not just the 541 ms reported as
I/O timing!
In this specific instance, the I/O timing represented a little more than a third
of the timing difference this configuration change affected. The other important
difference is the disk sort uses a merge algorithm while the in-memory sort
is using quicksort
.
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Sort (cost=303084.20..306331.72 rows=1299005 width=323) (actual time=1000.621..1144.861 rows=1293207 loops=1) │
│ Sort Key: geom │
│ Sort Method: quicksort Memory: 389079kB │
│ Buffers: shared hit=43768 │
│ -> Seq Scan on building_polygon (cost=0.00..56752.05 rows=1299005 width=323) (actual time=0.014..230.784 rows=1293207 loops=1) │
│ Buffers: shared hit=43762 │
│ Settings: jit = 'off', max_parallel_workers_per_gather = '0', random_page_cost = '1.1', work_mem = '400MB' │
│ Planning Time: 0.096 ms │
│ Execution Time: 1233.101 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Even faster, don't sort by geometry
As I mentioned earlier, sorting by GEOMETRY
is far more expensive than sorting
by a BIGINT
column. The following query changes ORDER BY geom
to
ORDER BY osm_id
.
SELECT *
FROM osm.building_polygon
ORDER BY osm_id
;
This explain output shows that with work_mem=10MB
the query sorting by BIGINT
(which is also the primary key
) returns in 615 ms.
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Index Scan using pk_osm_building_polygon_osm_id on building_polygon (cost=0.43..71488.94 rows=1299005 width=323) (actual time=0.025.…│
│….537.393 rows=1293207 loops=1) │
│ Buffers: shared hit=617087 │
│ Settings: jit = 'off', max_parallel_workers_per_gather = '0', random_page_cost = '1.1', work_mem = '10MB' │
│ Planning Time: 0.091 ms │
│ Execution Time: 615.988 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Summary
This post explored the new temp timing output included when using EXPLAIN (ANALYZE, BUFFER)
.
The new detail provides us with new context to the performance tuning equation
that can be helpful with performance tuning efforts.
For further reading behind this feature see the dicussion on the pgsql-hackers list.
Need help with your PostgreSQL servers or databases? Contact us to start the conversation!