RustProof Labs: blogging for education (logo)

Postgres Data Dictionary for everyone

By Ryan Lambert -- Published January 04, 2022

A data dictionary is an important tool for anyone that stores and consumes data. The PgDD extension makes it easy to inspect and explore your data structures in Postgres. This post shows how PgDD provides access to current and accurate information about your databases for a variety of users:

This data dictionary information from PgDD is made available using standard SQL by querying a small set of views.

Background

Relational databases, including Postgres, track the majority of the information needed for a data dictionary. This is done in the underlying system catalogs; Postgres' system catalogs are in the pg_catalog schema. The challenge with using the system catalogs is they are not very user friendly to query for the type of details commonly needed. PgDD does not do anything magical, it is simply a wrapper around the Postgres system catalogs!

Similar information can be found using the psql client with its slash commands. These slash commands (e.g. \dt, \dn, etc.) are handy when working within psql but limits use to psql.

The interface to PgDD is simply using standard SQL queries against views in the dd schema. The basic idea of what's available in PgDD can be found by querying the dd.views view for the views in the dd schema. This shows the five (5) PgDD views currently available, columns, functions, schemas, tables and views.

SELECT s_name, v_name, description
    FROM dd.views
    WHERE s_name = 'dd'
    ORDER BY v_name
;

┌────────┬───────────┬────────────────────────────────────────────────────────────────────┐
│ s_name │  v_name   │                            description                             │
╞════════╪═══════════╪════════════════════════════════════════════════════════════════════╡
│ dd     │ columns   │ Data dictionary view: Lists columns, excluding system columns.     │
│ dd     │ functions │ Data dictionary view: Lists functions, excluding system functions. │
│ dd     │ schemas   │ Data dictionary view: Lists schemas, excluding system schemas.     │
│ dd     │ tables    │ Data dictionary view: Lists tables, excluding system tables.       │
│ dd     │ views     │ Data dictionary view: Lists views, excluding system views.         │
└────────┴───────────┴────────────────────────────────────────────────────────────────────┘

The current version is PgDD 0.4.0, for more about the history and progression of PgDD read here and here.

For the Analyst

Analysts need to understand the data they are querying and using, ideally within the tools they are most comfortable with. Their understanding of the data enables them to accurately analyze and disseminate data. Some of the tools I use with my analyst hat on are psql, DBeaver, Python (Jupyter Notebooks) and QGIS. PgDD can be queried from any of these tools with ease.

Describe a column

To illustrate a common scenario for an analyst, I pulled out a query from my post Find missing crossings in OpenStreetMap with PostGIS, the query below is from Step 4a. Imagine you are reviewing this code and found yourself wondering "What does t.osm_type mean?"

-- Crossing aggregates
WITH d AS (
SELECT p.osm_id, COUNT(*) AS crossings
    FROM foot_points p
    INNER JOIN osm.traffic_point t
        ON t.geom = p.geom AND t.osm_type = 'crossing'
    GROUP BY p.osm_id
)
UPDATE foot_roads r
    SET crossings = d.crossings
    FROM d 
    WHERE r.osm_id = d.osm_id
;

I was the author of this query, and I created the data structures used by PgOSM Flex. Even still, I only have a basic remembering of what that particular column means. For the exact, nuanced details I still need to check the documentation.

A quick query using the dd.columns view can tell us the data type (text) and provide the description for the column. This description informs about specific values in osm_type and their interaction with the osm_subtype column. The best part is no matter where I was writing the SQL query that prompted the question, I can simply add a query against the dd view to answer the question. Then get back to my queries, all without leaving my tool of choice.

SELECT c_name, data_type, description
    FROM dd.columns
    WHERE s_name = 'osm'
        AND t_name = 'traffic_point'
        AND c_name = 'osm_type'
;

┌──────────┬───────────┬──────────────────────────────────────────────────────────────────────────────┐
│  c_name  │ data_type │                                 description                                  │
╞══════════╪═══════════╪══════════════════════════════════════════════════════════════════════════════╡
│ osm_type │ text      │ Value of the main key associated with traffic details.  If osm_subtype IS NU…│
│          │           │…LL then key = "highway" or key = "noexit".  Otherwise the main key is the va…│
│          │           │…lue stored in osm_type while osm_subtype has the value for the main key.     │
└──────────┴───────────┴──────────────────────────────────────────────────────────────────────────────┘

The description of database objects are added using Postgres' COMMENT ON syntax. Database objects are ideally documented as they are created. Though, if your objects aren't well documented today that is something easy to start improving!

Finding data

Another question I have found myself asking is "where else can I find this data?" Sometimes I know a column exists but do not know exactly which tables or views it is in. For example, I know some of the tables loaded by PgOSM Flex have the address column and I want to know which ones. An easy query with PgDD and I have a list of tables, views and materialized views, each with an address column.

SELECT s_name, t_name, source_type, data_type
    FROM dd.columns
    WHERE s_name = 'osm'
        AND c_name = 'address'
;

┌────────┬──────────────────┬───────────────────┬───────────┐
│ s_name │      t_name      │    source_type    │ data_type │
╞════════╪══════════════════╪═══════════════════╪═══════════╡
│ osm    │ poi_line         │ table             │ text      │
│ osm    │ poi_point        │ table             │ text      │
│ osm    │ poi_polygon      │ table             │ text      │
│ osm    │ amenity_line     │ table             │ text      │
│ osm    │ amenity_point    │ table             │ text      │
│ osm    │ amenity_polygon  │ table             │ text      │
│ osm    │ building_point   │ table             │ text      │
│ osm    │ building_polygon │ table             │ text      │
│ osm    │ shop_point       │ table             │ text      │
│ osm    │ shop_polygon     │ table             │ text      │
│ osm    │ vbuilding_all    │ view              │ text      │
│ osm    │ vshop_all        │ view              │ text      │
│ osm    │ vpoi_all         │ materialized view │ text      │
└────────┴──────────────────┴───────────────────┴───────────┘

For the DBA and Developer

The examples above for the analyst use cases can easily apply to the DBA and developer roles as well. These more technical roles, responsible for creating and maintaining the data structures, also care about details regarding the size and shape of the data contained.

The dd.schemas view gives a nice overview of the contents within a database grouped by schema name (s_name). This view includes the size on disk, and provides counts of tables, views and functions. This view is useful to get a big picture view of a database, and is especially helpful if the database is well organized into logical schemas. The following query filters for three (3) specific schemas, the results show that the osm schema has the bulk of the tables (41) and is approaching 2 GB in size.

SELECT s_name, size_plus_indexes,
        table_count, view_count, function_count,
        description
    FROM dd.schemas
    WHERE s_name IN ('osm', 'osm_routing', 'public')
;

┌────────────┬───────────────────┬─────────────┬────────────┬────────────────┬────────────────────────────────────────┐
│   s_name   │ size_plus_indexes │ table_count │ view_count │ function_count │              description               │
╞════════════╪═══════════════════╪═════════════╪════════════╪════════════════╪════════════════════════════════════════╡
│ osm        │ 1763 MB           │          41 │          6 │              3 │ Schema populated by PgOSM-Flex.  SELEC…│
│            │                   │             │            │                │…T * FROM osm.pgosm_flex; for details.  │
│ osm_routin…│ 239 MB            │           3 │          0 │              0 │ ¤                                      │
│…g          │                   │             │            │                │                                        │
│ public     │ 7544 kB           │           3 │          2 │           1051 │ standard public schema                 │
└────────────┴───────────────────┴─────────────┴────────────┴────────────────┴────────────────────────────────────────┘

Tables

Another common task is tracking tables based on their size. Table size is commonly discussed using row count, but the size on disk is also important. The next query returns the five (5) largest tables by size on disk along with their row counts and bytes_per_row estimate. Saving snapshots of this data over time is helpful to track and manage data growth.

SELECT s_name, t_name, rows, size_pretty, size_plus_indexes, bytes_per_row
    FROM dd.tables 
    ORDER BY size_bytes DESC
    LIMIT 5
;

┌─────────────┬──────────────────┬─────────┬─────────────┬───────────────────┬───────────────┐
│   s_name    │      t_name      │  rows   │ size_pretty │ size_plus_indexes │ bytes_per_row │
╞═════════════╪══════════════════╪═════════╪═════════════╪═══════════════════╪═══════════════╡
│ osm         │ tags             │ 3076011 │ 538 MB      │ 631 MB            │           184 │
│ osm         │ road_line        │  770203 │ 220 MB      │ 284 MB            │           300 │
│ osm         │ building_polygon │  771073 │ 203 MB      │ 261 MB            │           276 │
│ osm         │ water_line       │  434721 │ 180 MB      │ 216 MB            │           435 │
│ osm_routing │ roads_noded      │  193652 │ 107 MB      │ 163 MB            │           578 │
└─────────────┴──────────────────┴─────────┴─────────────┴───────────────────┴───────────────┘

Find columns by type

When I go exploring in a new database, one of the things I check for are INTEGER (int4) columns used as primary/foreign keys. I recommend using BIGINT (int8) instead of INT for these scenarios. The following query helps me start building a list of potential candidates for a type change to BIGINT.

SELECT c_name, data_type
    FROM dd.columns
    WHERE data_type = 'int4'
        AND s_name = 'osm'
    GROUP BY c_name, data_type
;

┌─────────────┬───────────┐
│   c_name    │ data_type │
╞═════════════╪═══════════╡
│ admin_level │ int4      │
│ ele         │ int4      │
│ layer       │ int4      │
│ levels      │ int4      │
│ maxspeed    │ int4      │
└─────────────┴───────────┘

If any columns from the above query had id in their name (c_name) I would be sure to give them a closer look and assess if an INT to BIGINT migration is recommended. In the case of this database, the int4 columns shown in the above results store attribute values from OpenStreetMap. Each of these columns are scoped to remain under the 2.1 billion threshold of int4 and are fine to leave as is.

The value in the maxspeed column is in km/hr, making 2.1 B km/hr the maximum allowed value (2,147,483,647 km/hr to be exact). That's 1.99x the speed of light! Considering this data is about roads on Earth from OpenStreetMap... I think we're safe with this limit.

Functions

This database's public schema reported having 1,051 functions. In my databases, having a lot of functions in the public schema is a result of having PostGIS and pgRouting installed. Speaking of PostGIS, what if you were curious to find which PostGIS functions support the GEOGRAPHY data type? That is easy with PgDD!

The following query shows how the argument_data_types column can be filtered to include geography. It also filters for function names (f_name) that begin with st to limit results to the main PostGIS functions.

SELECT DISTINCT f_name
    FROM dd.functions
    WHERE s_name = 'public'
        AND argument_data_types LIKE '%geography%'
        AND LEFT(f_name, 2) = 'st'
;

This query returns the 25 main PostGIS functions that provide support for GEOGRAPHY as an input. A sample of the results are shown below.

┌─────────────────┐
│     f_name      │
╞═════════════════╡
│ st_area         │
│ st_asgeojson    │
│ st_buffer       │
│ st_distance     │
│ st_intersects   │
│ st_length       │
│ st_setsrid      │
│ st_srid         │
└─────────────────┘

For the Business

Businesses are comprised of people. The people of a business need to understand the data they work with. From the DBAs and developers that create and maintain the data structures, to the analysts that consume, aggregate, and report on the data, to the executives making decisions and the people implementing them... understanding the data is key! A good data dictionary improves the data stewardship abilities of the entire data team, removes confusion about the data, and enables the business to better support its databases long-term.

Beyond the core data team(s), the results from PgDD can be brought out of the database and shared with the organization in a format appropriate for your specific needs. The goal here is to remove the idea of black box data where users throughout the business know they have data but don't feel they understand it. The PgDD UI project is a rough proof of concept of an offline, in-browser data dictionary. This type of documentation can be shared internally providing non-database users more detailed information about the data their organization has available.

Summary

This post has taken a look at the PgDD extension for Postgres and how it can help users at multiple levels benefit from an in-database data dictionary. Providing helpful details through a limited number of views, PgDD is easy to learn, easy to use, and easy to share. Enabling multiple types of users with the same information improves cross-team understanding of the data.

Need help with your PostgreSQL servers or databases? Contact us to start the conversation!

By Ryan Lambert
Published January 04, 2022
Last Updated January 04, 2022