Postgres Data Dictionary for everyone
A data dictionary is an important tool for anyone that stores and consumes data. The PgDD extension makes it easy to inspect and explore your data structures in Postgres. This post shows how PgDD provides access to current and accurate information about your databases for a variety of users:
- Analysts
- DBAs and Developers
- The Business
This data dictionary information from PgDD is made available using standard SQL by querying a small set of views.
Background
Relational databases, including Postgres, track the majority of the information
needed for a data dictionary. This is done in the underlying
system catalogs;
Postgres' system catalogs are in the pg_catalog
schema.
The challenge with using the system catalogs is they are not very
user friendly to query for the type of details commonly needed.
PgDD does not do anything magical, it is simply
a wrapper around the Postgres system catalogs!
Similar information can be found using the
psql
client with its slash commands. These slash commands (e.g.\dt
,\dn
, etc.) are handy when working withinpsql
but limits use topsql
.
The interface to PgDD is simply using standard SQL queries
against views in the dd
schema.
The basic idea of what's available in PgDD can be found by querying the
dd.views
view for the views in the dd
schema.
This shows the five (5) PgDD views currently available, columns
, functions
, schemas
, tables
and views
.
SELECT s_name, v_name, description
FROM dd.views
WHERE s_name = 'dd'
ORDER BY v_name
;
┌────────┬───────────┬────────────────────────────────────────────────────────────────────┐
│ s_name │ v_name │ description │
╞════════╪═══════════╪════════════════════════════════════════════════════════════════════╡
│ dd │ columns │ Data dictionary view: Lists columns, excluding system columns. │
│ dd │ functions │ Data dictionary view: Lists functions, excluding system functions. │
│ dd │ schemas │ Data dictionary view: Lists schemas, excluding system schemas. │
│ dd │ tables │ Data dictionary view: Lists tables, excluding system tables. │
│ dd │ views │ Data dictionary view: Lists views, excluding system views. │
└────────┴───────────┴────────────────────────────────────────────────────────────────────┘
The current version is PgDD 0.4.0, for more about the history and progression of PgDD read here and here.
For the Analyst
Analysts need to understand the data they are querying and using, ideally within the tools they are most comfortable with. Their understanding of the data enables them to accurately analyze and disseminate data. Some of the tools I use with my analyst hat on are psql, DBeaver, Python (Jupyter Notebooks) and QGIS. PgDD can be queried from any of these tools with ease.
Describe a column
To illustrate a common scenario for an analyst,
I pulled out a query from my post
Find missing crossings in OpenStreetMap with PostGIS,
the query below is from Step 4a. Imagine you are reviewing this code
and found yourself wondering "What does t.osm_type
mean?"
-- Crossing aggregates
WITH d AS (
SELECT p.osm_id, COUNT(*) AS crossings
FROM foot_points p
INNER JOIN osm.traffic_point t
ON t.geom = p.geom AND t.osm_type = 'crossing'
GROUP BY p.osm_id
)
UPDATE foot_roads r
SET crossings = d.crossings
FROM d
WHERE r.osm_id = d.osm_id
;
I was the author of this query, and I created the data structures used by PgOSM Flex. Even still, I only have a basic remembering of what that particular column means. For the exact, nuanced details I still need to check the documentation.
A quick query using the dd.columns
view can tell us the data type (text
)
and provide the description for the column. This description informs
about specific values in osm_type
and their interaction
with the osm_subtype
column. The best part is no matter where I was writing
the SQL query that prompted the question, I can simply add a query
against the dd
view to answer the question. Then get back to my
queries, all without leaving my tool of choice.
SELECT c_name, data_type, description
FROM dd.columns
WHERE s_name = 'osm'
AND t_name = 'traffic_point'
AND c_name = 'osm_type'
;
┌──────────┬───────────┬──────────────────────────────────────────────────────────────────────────────┐
│ c_name │ data_type │ description │
╞══════════╪═══════════╪══════════════════════════════════════════════════════════════════════════════╡
│ osm_type │ text │ Value of the main key associated with traffic details. If osm_subtype IS NU…│
│ │ │…LL then key = "highway" or key = "noexit". Otherwise the main key is the va…│
│ │ │…lue stored in osm_type while osm_subtype has the value for the main key. │
└──────────┴───────────┴──────────────────────────────────────────────────────────────────────────────┘
The
description
of database objects are added using Postgres'COMMENT ON
syntax. Database objects are ideally documented as they are created. Though, if your objects aren't well documented today that is something easy to start improving!
Finding data
Another question I have found myself asking is "where else can I find this
data?" Sometimes I know a column exists but do not know exactly which tables
or views it is in. For example, I know some of the tables loaded by
PgOSM Flex have the address
column and I want to know which ones.
An easy query with PgDD and I have a list of tables, views and
materialized views, each with an address
column.
SELECT s_name, t_name, source_type, data_type
FROM dd.columns
WHERE s_name = 'osm'
AND c_name = 'address'
;
┌────────┬──────────────────┬───────────────────┬───────────┐
│ s_name │ t_name │ source_type │ data_type │
╞════════╪══════════════════╪═══════════════════╪═══════════╡
│ osm │ poi_line │ table │ text │
│ osm │ poi_point │ table │ text │
│ osm │ poi_polygon │ table │ text │
│ osm │ amenity_line │ table │ text │
│ osm │ amenity_point │ table │ text │
│ osm │ amenity_polygon │ table │ text │
│ osm │ building_point │ table │ text │
│ osm │ building_polygon │ table │ text │
│ osm │ shop_point │ table │ text │
│ osm │ shop_polygon │ table │ text │
│ osm │ vbuilding_all │ view │ text │
│ osm │ vshop_all │ view │ text │
│ osm │ vpoi_all │ materialized view │ text │
└────────┴──────────────────┴───────────────────┴───────────┘
For the DBA and Developer
The examples above for the analyst use cases can easily apply to the DBA and developer roles as well. These more technical roles, responsible for creating and maintaining the data structures, also care about details regarding the size and shape of the data contained.
The dd.schemas
view gives a nice overview of the contents within a database
grouped by schema name (s_name
).
This view includes the size on disk, and provides counts of
tables, views and functions.
This view is useful to get a big picture view of a database, and is
especially helpful if the database is well organized into logical schemas.
The following query filters for three (3) specific schemas,
the results show that the osm
schema has the bulk of the tables (41)
and is approaching 2 GB in size.
SELECT s_name, size_plus_indexes,
table_count, view_count, function_count,
description
FROM dd.schemas
WHERE s_name IN ('osm', 'osm_routing', 'public')
;
┌────────────┬───────────────────┬─────────────┬────────────┬────────────────┬────────────────────────────────────────┐
│ s_name │ size_plus_indexes │ table_count │ view_count │ function_count │ description │
╞════════════╪═══════════════════╪═════════════╪════════════╪════════════════╪════════════════════════════════════════╡
│ osm │ 1763 MB │ 41 │ 6 │ 3 │ Schema populated by PgOSM-Flex. SELEC…│
│ │ │ │ │ │…T * FROM osm.pgosm_flex; for details. │
│ osm_routin…│ 239 MB │ 3 │ 0 │ 0 │ ¤ │
│…g │ │ │ │ │ │
│ public │ 7544 kB │ 3 │ 2 │ 1051 │ standard public schema │
└────────────┴───────────────────┴─────────────┴────────────┴────────────────┴────────────────────────────────────────┘
Tables
Another common task is tracking tables based on their size. Table size
is commonly discussed using row count, but the size on disk is also important.
The next query returns the five (5) largest
tables by size on disk along with their row counts and bytes_per_row
estimate. Saving snapshots of this data over time is helpful to track and manage data growth.
SELECT s_name, t_name, rows, size_pretty, size_plus_indexes, bytes_per_row
FROM dd.tables
ORDER BY size_bytes DESC
LIMIT 5
;
┌─────────────┬──────────────────┬─────────┬─────────────┬───────────────────┬───────────────┐
│ s_name │ t_name │ rows │ size_pretty │ size_plus_indexes │ bytes_per_row │
╞═════════════╪══════════════════╪═════════╪═════════════╪═══════════════════╪═══════════════╡
│ osm │ tags │ 3076011 │ 538 MB │ 631 MB │ 184 │
│ osm │ road_line │ 770203 │ 220 MB │ 284 MB │ 300 │
│ osm │ building_polygon │ 771073 │ 203 MB │ 261 MB │ 276 │
│ osm │ water_line │ 434721 │ 180 MB │ 216 MB │ 435 │
│ osm_routing │ roads_noded │ 193652 │ 107 MB │ 163 MB │ 578 │
└─────────────┴──────────────────┴─────────┴─────────────┴───────────────────┴───────────────┘
Find columns by type
When I go exploring in a new database, one of the things I check for are
INTEGER
(int4
) columns used as primary/foreign keys.
I recommend
using BIGINT
(int8
) instead of INT
for these scenarios. The following query helps me start building a
list of potential candidates for a type change to BIGINT
.
SELECT c_name, data_type
FROM dd.columns
WHERE data_type = 'int4'
AND s_name = 'osm'
GROUP BY c_name, data_type
;
┌─────────────┬───────────┐
│ c_name │ data_type │
╞═════════════╪═══════════╡
│ admin_level │ int4 │
│ ele │ int4 │
│ layer │ int4 │
│ levels │ int4 │
│ maxspeed │ int4 │
└─────────────┴───────────┘
If any columns from the above query had id
in their name (c_name
)
I would be sure to give them a closer look and assess if an INT
to BIGINT
migration is recommended.
In the case of this database, the int4
columns shown in the above results store attribute values
from OpenStreetMap. Each of these columns are scoped to remain under
the 2.1 billion threshold of int4
and are fine to leave as is.
The value in the
maxspeed
column is in km/hr, making 2.1 B km/hr the maximum allowed value (2,147,483,647 km/hr to be exact). That's 1.99x the speed of light! Considering this data is about roads on Earth from OpenStreetMap... I think we're safe with this limit.
Functions
This database's public
schema reported having 1,051 functions.
In my databases, having a lot of functions in the public
schema
is a result of having PostGIS and pgRouting installed.
Speaking of PostGIS, what if you were curious to find which PostGIS
functions support the GEOGRAPHY
data type? That is easy with PgDD!
The following query
shows how the argument_data_types
column can be filtered to include
geography
. It also filters for function names (f_name
) that begin
with st
to limit results to the main PostGIS functions.
SELECT DISTINCT f_name
FROM dd.functions
WHERE s_name = 'public'
AND argument_data_types LIKE '%geography%'
AND LEFT(f_name, 2) = 'st'
;
This query returns the 25 main PostGIS functions that provide
support for GEOGRAPHY
as an input. A sample of the results are shown below.
┌─────────────────┐
│ f_name │
╞═════════════════╡
│ st_area │
│ st_asgeojson │
│ st_buffer │
│ st_distance │
│ st_intersects │
│ st_length │
│ st_setsrid │
│ st_srid │
└─────────────────┘
For the Business
Businesses are comprised of people. The people of a business need to understand the data they work with. From the DBAs and developers that create and maintain the data structures, to the analysts that consume, aggregate, and report on the data, to the executives making decisions and the people implementing them... understanding the data is key! A good data dictionary improves the data stewardship abilities of the entire data team, removes confusion about the data, and enables the business to better support its databases long-term.
Beyond the core data team(s), the results from PgDD can be brought out of the database and shared with the organization in a format appropriate for your specific needs. The goal here is to remove the idea of black box data where users throughout the business know they have data but don't feel they understand it. The PgDD UI project is a rough proof of concept of an offline, in-browser data dictionary. This type of documentation can be shared internally providing non-database users more detailed information about the data their organization has available.
Summary
This post has taken a look at the PgDD extension for Postgres and how it can help users at multiple levels benefit from an in-database data dictionary. Providing helpful details through a limited number of views, PgDD is easy to learn, easy to use, and easy to share. Enabling multiple types of users with the same information improves cross-team understanding of the data.
Need help with your PostgreSQL servers or databases? Contact us to start the conversation!