PgDD - PostgreSQL Data Dictionary Extension
The PgDD project is now an extension for Postgres! This post is about the PgDD (PostgreSQL Data Dictionary) project, why it exists, and how to install the extension. I have found the ability to query the most important database structures a huge benefit, hopefully others will find this project useful too.
This post is outdated, PgDD has been rewritten using the pgx framework. Read PgDD extension moves to Pgx and Postgres Data Dictionary for everyone for more details!
Why PgDD is helpful
PgDD makes it possible to easily explore your database structure
with standard SQL syntax. You do not need to use a specific tool or learn
a set of new commands, simply install the extension and start querying.
To find the 5 largest tables by size on disk, simply query the
dd.tables
view.
SELECT *
FROM dd.tables
ORDER BY size_bytes DESC
LIMIT 5;
The project README lists the data dictionary views made available by the PgDD extension.
Documentation is key for any system, this is especially true with databases. Part of the reason good documentation is so important is because databases are often the most long-lived portion of any system or application. As I wrote before:
"A good data dictionary provides insights into a database's structure, constraints, relationships, and sources of data found inside a system."
History of PgDD
PgDD started a few years ago as a few helpful queries derived from
reverse engineering the psql commands.
Those commands (i.e. \dt+
) had the output I needed, I just needed to get
the queries used so I could extract the same information without
requiring psql. Before long I had a variety of versions of views in different
databases, and my own inconsistencies were not helping me out.
Last year (fall 2018) I standardized the project and made it open source. I first wrote about PgDD last November. The initial releases used Sqitch to install the project into each database. I picked Sqitch because it was easy for me with our other internal projects using Sqitch already.
The decision to make PgDD an extension was to make adoption by the Postgres community easier. While I think Sqitch is great and it did technically work, PgDD can leverage the benefits of being a proper extension.
Install
This post assumes you have PostgreSQL installed and running. There are
a few simple prereqs for building extensions from source,
the Postgres -dev
packages plus git
and make
.
The following command installs dev packages on Ubuntu for all versions of
Postgres, you can replace all
with your specific version,
e.g. 12
for a lighter footprint.
sudo apt install git make postgresql-server-dev-all
Make PgDD
Clone the repository (I put them under the ~/git
directory) and install.
mkdir ~/git
cd ~/git
git clone https://github.com/rustprooflabs/pgdd.git
cd ~/git/pgdd
sudo make install
This makes the extension available to Postgres by adding it to the contrib directory.
Create extension
Now connect to your database and create the extension. At this point you can start using the PgDD views without any additional steps, or you can add additional details to the data dictionary's meta tables.
CREATE EXTENSION pgdd;
Using dd.meta_*
tables
The data dictionary's meta tables allow adding more specific details than just simple open-ended comments. The two additional attributes I added for our needs were:
- Data source
- Sensitive
These attributes can be saved at three levels of detail, each having its own
dd.meta_
table.
- Schema
- Table
- Column
Data sources
I use the data_source
column in two ways. First, attribution of data sources
is a critical component producing data-heavy reports for end users. Tracking
attribution in the database makes it easy for analysts to add that detail directly
from the database.
Another benefit, is it makes it easier to find and check data sources for updated
data sets. It's typically when I'm querying data that I realize there may be a new year
of data available, and I have to hunt down the source of the data. Having the
attribution in the DB layer makes it straight-forward to figure out where to look.
Notice the table unified_school_districts
below includes the URL to
where the source data. Nice, I can go straight there from the database without
have to search vague terms to find the procedures/Google, and hopefully the needed link. This makes it much easier to keep the various data sources
up-to-date.
SELECT t_name, data_source
FROM dd.meta_table
WHERE t_name IN ('campus_location', 'school_district', 'unified_school_districts')
ORDER BY t_name;
┌──────────────────────────┬──────────────────────────────────────────────────────────┐
│ t_name │ data_source │
╞══════════════════════════╪══════════════════════════════════════════════════════════╡
│ campus_location │ Created by RustProof Labs via QGIS. │
│ school_district │ Manually maintained. Initial load used manually combine…│
│ │…d data sources from U.S. Census Bureau and Colorado Depa…│
│ │…rtment of Education. │
│ unified_school_districts │ U.S. Census Bureau. https://www.census.gov/geo/maps-dat…│
│ │…a/data/cbf/cbf_sd.html │
└──────────────────────────┴──────────────────────────────────────────────────────────┘
Sensitive flag
The benefits from the ability of identifying data as sensitive should be apparent. Breaching sensitive data is bad. Knowing where your sensitive data resides is the first step to properly protecting data.
SELECT t_name, c_name
FROM dd.meta_column
WHERE sensitive;
┌────────┬──────────┐
│ t_name │ c_name │
╞════════╪══════════╡
│ person │ dob │
│ person │ address │
└────────┴──────────┘
Next steps
I know the underlying system views change over time so that will need to be managed, but I do not know the best approach to take yet. I'm sure there is a proper "extension" way of things and lucky for me, I already know the Postgres community will help point me in the right path. When the need arises, I can ask on Slack or the mailing lists and someone will point me to what I need.
But, psql...
Yes, psql is fantastic and its "slash commands"
(\l
, \dt
, \dn
, etc) are helpful built-in tools...
If you are using psql.
The limitation is you have to be using psql to take advantage of those commands.
Unfortunately, command line tools are not helpful to an analyst with PostGIS data where a spatial viewer is essential (see DBeaver and PgAdmin4). Pretty much all of our Postgres databases have PostGIS data, so I am typically querying them using DBeaver. I can even bring the data dictionary into a QGIS project as an attribute table for quickly accessing information without having to switch tools.
Summary
With PgDD now as an extension I will start working on improvements and additions moving forward. I have a few thoughts of where it will go but only time will tell. I hope others will find this to be a useful tool in their databases. Check out the repo and give it a try!
Need help with your PostgreSQL servers or databases? Contact us to start the conversation!