RustProof Labs: blogging for education (logo)

PgDD - PostgreSQL Data Dictionary Extension

By Ryan Lambert -- Published November 17, 2019

The PgDD project is now an extension for Postgres! This post is about the PgDD (PostgreSQL Data Dictionary) project, why it exists, and how to install the extension. I have found the ability to query the most important database structures a huge benefit, hopefully others will find this project useful too.

Why PgDD is helpful

PgDD makes it possible to easily explore your database structure with standard SQL syntax. You do not need to use a specific tool or learn a set of new commands, simply install the extension and start querying. To find the 5 largest tables by size on disk, simply query the dd.tables view.

SELECT *
    FROM dd.tables
    ORDER BY size_bytes DESC
    LIMIT 5;

The project README lists the data dictionary views made available by the PgDD extension.

Documentation is key for any system, this is especially true with databases. Part of the reason good documentation is so important is because databases are often the most long-lived portion of any system or application. As I wrote before:

"A good data dictionary provides insights into a database's structure, constraints, relationships, and sources of data found inside a system."

History of PgDD

PgDD started a few years ago as a few helpful queries derived from reverse engineering the psql commands. Those commands (i.e. \dt+) had the output I needed, I just needed to get the queries used so I could extract the same information without requiring psql. Before long I had a variety of versions of views in different databases, and my own inconsistencies were not helping me out.

Last year (fall 2018) I standardized the project and made it open source. I first wrote about PgDD last November. The initial releases used Sqitch to install the project into each database. I picked Sqitch because it was easy for me with our other internal projects using Sqitch already.

The decision to make PgDD an extension was to make adoption by the Postgres community easier. While I think Sqitch is great and it did technically work, PgDD can leverage the benefits of being a proper extension.

Install

This post assumes you have PostgreSQL installed and running. There are a few simple prereqs for building extensions from source, the Postgres -dev packages plus git and make. The following command installs dev packages on Ubuntu for all versions of Postgres, you can replace all with your specific version, e.g. 12 for a lighter footprint.

sudo apt install git make postgresql-server-dev-all

Make PgDD

Clone the repository (I put them under the ~/git directory) and install.

mkdir ~/git
cd ~/git
git clone https://github.com/rustprooflabs/pgdd.git
cd ~/git/pgdd
sudo make install

This makes the extension available to Postgres by adding it to the contrib directory.

Create extension

Now connect to your database and create the extension. At this point you can start using the PgDD views without any additional steps, or you can add additional details to the data dictionary's meta tables.

CREATE EXTENSION pgdd;

Using dd.meta_* tables

The data dictionary's meta tables allow adding more specific details than just simple open-ended comments. The two additional attributes I added for our needs were:

These attributes can be saved at three levels of detail, each having its own dd.meta_ table.

Data sources

I use the data_source column in two ways. First, attribution of data sources is a critical component producing data-heavy reports for end users. Tracking attribution in the database makes it easy for analysts to add that detail directly from the database.

Another benefit, is it makes it easier to find and check data sources for updated data sets. It's typically when I'm querying data that I realize there may be a new year of data available, and I have to hunt down the source of the data. Having the attribution in the DB layer makes it straight-forward to figure out where to look. Notice the table unified_school_districts below includes the URL to where the source data. Nice, I can go straight there from the database without have to search vague terms to find the procedures/Google, and hopefully the needed link. This makes it much easier to keep the various data sources up-to-date.

SELECT t_name, data_source
    FROM dd.meta_table 
    WHERE t_name IN ('campus_location', 'school_district', 'unified_school_districts')
    ORDER BY t_name;
┌──────────────────────────┬──────────────────────────────────────────────────────────┐
│          t_name          │                       data_source                        │
╞══════════════════════════╪══════════════════════════════════════════════════════════╡
│ campus_location          │ Created by RustProof Labs via QGIS.                      │
│ school_district          │ Manually maintained.  Initial load used manually combine…│
│                          │…d data sources from U.S. Census Bureau and Colorado Depa…│
│                          │…rtment of Education.                                     │
│ unified_school_districts │ U.S. Census Bureau.  https://www.census.gov/geo/maps-dat…│
│                          │…a/data/cbf/cbf_sd.html                                   │
└──────────────────────────┴──────────────────────────────────────────────────────────┘

Sensitive flag

The benefits from the ability of identifying data as sensitive should be apparent. Breaching sensitive data is bad. Knowing where your sensitive data resides is the first step to properly protecting data.

SELECT t_name, c_name
    FROM dd.meta_column 
    WHERE sensitive;
┌────────┬──────────┐
│ t_name │ c_name   │
╞════════╪══════════╡
│ person │ dob      │
│ person │ address  │
└────────┴──────────┘

Next steps

I know the underlying system views change over time so that will need to be managed, but I do not know the best approach to take yet. I'm sure there is a proper "extension" way of things and lucky for me, I already know the Postgres community will help point me in the right path. When the need arises, I can ask on Slack or the mailing lists and someone will point me to what I need.

But, psql...

Yes, psql is fantastic and its "slash commands" (\l, \dt, \dn, etc) are helpful built-in tools... If you are using psql. The limitation is you have to be using psql to take advantage of those commands.

Unfortunately, command line tools are not helpful to an analyst with PostGIS data where a spatial viewer is essential (see DBeaver and PgAdmin4). Pretty much all of our Postgres databases have PostGIS data, so I am typically querying them using DBeaver. I can even bring the data dictionary into a QGIS project as an attribute table for quickly accessing information without having to switch tools.

Summary

With PgDD now as an extension I will start working on improvements and additions moving forward. I have a few thoughts of where it will go but only time will tell. I hope others will find this to be a useful tool in their databases. Check out the repo and give it a try!

Need help with your PostgreSQL servers or databases? Contact us to start the conversation!

By Ryan Lambert
Published November 17, 2019
Last Updated November 17, 2019