Postgres Data Dictionary for everyone
A data dictionary is an important tool for anyone that stores and consumes data. The PgDD extension makes it easy to inspect and explore your data structures in Postgres. This post shows how PgDD provides access to current and accurate information about your databases for a variety of users:
- DBAs and Developers
- The Business
This data dictionary information from PgDD is made available using standard SQL by querying a small set of views.
Relational databases, including Postgres, track the majority of the information
needed for a data dictionary. This is done in the underlying
Postgres' system catalogs are in the
The challenge with using the system catalogs is they are not very
user friendly to query for the type of details commonly needed.
PgDD does not do anything magical, it is simply
a wrapper around the Postgres system catalogs!
Similar information can be found using the
psqlclient with its slash commands. These slash commands (e.g.
\dn, etc.) are handy when working within
psqlbut limits use to
The interface to PgDD is simply using standard SQL queries
against views in the
The basic idea of what's available in PgDD can be found by querying the
dd.views view for the views in the
This shows the five (5) PgDD views currently available,
SELECT s_name, v_name, description FROM dd.views WHERE s_name = 'dd' ORDER BY v_name ; ┌────────┬───────────┬────────────────────────────────────────────────────────────────────┐ │ s_name │ v_name │ description │ ╞════════╪═══════════╪════════════════════════════════════════════════════════════════════╡ │ dd │ columns │ Data dictionary view: Lists columns, excluding system columns. │ │ dd │ functions │ Data dictionary view: Lists functions, excluding system functions. │ │ dd │ schemas │ Data dictionary view: Lists schemas, excluding system schemas. │ │ dd │ tables │ Data dictionary view: Lists tables, excluding system tables. │ │ dd │ views │ Data dictionary view: Lists views, excluding system views. │ └────────┴───────────┴────────────────────────────────────────────────────────────────────┘
For the Analyst
Analysts need to understand the data they are querying and using, ideally within the tools they are most comfortable with. Their understanding of the data enables them to accurately analyze and disseminate data. Some of the tools I use with my analyst hat on are psql, DBeaver, Python (Jupyter Notebooks) and QGIS. PgDD can be queried from any of these tools with ease.
Describe a column
To illustrate a common scenario for an analyst,
I pulled out a query from my post
Find missing crossings in OpenStreetMap with PostGIS,
the query below is from Step 4a. Imagine you are reviewing this code
and found yourself wondering "What does
-- Crossing aggregates WITH d AS ( SELECT p.osm_id, COUNT(*) AS crossings FROM foot_points p INNER JOIN osm.traffic_point t ON t.geom = p.geom AND t.osm_type = 'crossing' GROUP BY p.osm_id ) UPDATE foot_roads r SET crossings = d.crossings FROM d WHERE r.osm_id = d.osm_id ;
I was the author of this query, and I created the data structures used by PgOSM Flex. Even still, I only have a basic remembering of what that particular column means. For the exact, nuanced details I still need to check the documentation.
A quick query using the
dd.columns view can tell us the data type (
and provide the description for the column. This description informs
about specific values in
osm_type and their interaction
osm_subtype column. The best part is no matter where I was writing
the SQL query that prompted the question, I can simply add a query
dd view to answer the question. Then get back to my
queries, all without leaving my tool of choice.
SELECT c_name, data_type, description FROM dd.columns WHERE s_name = 'osm' AND t_name = 'traffic_point' AND c_name = 'osm_type' ; ┌──────────┬───────────┬──────────────────────────────────────────────────────────────────────────────┐ │ c_name │ data_type │ description │ ╞══════════╪═══════════╪══════════════════════════════════════════════════════════════════════════════╡ │ osm_type │ text │ Value of the main key associated with traffic details. If osm_subtype IS NU…│ │ │ │…LL then key = "highway" or key = "noexit". Otherwise the main key is the va…│ │ │ │…lue stored in osm_type while osm_subtype has the value for the main key. │ └──────────┴───────────┴──────────────────────────────────────────────────────────────────────────────┘
descriptionof database objects are added using Postgres'
COMMENT ONsyntax. Database objects are ideally documented as they are created. Though, if your objects aren't well documented today that is something easy to start improving!
Another question I have found myself asking is "where else can I find this
data?" Sometimes I know a column exists but do not know exactly which tables
or views it is in. For example, I know some of the tables loaded by
PgOSM Flex have the
address column and I want to know which ones.
An easy query with PgDD and I have a list of tables, views and
materialized views, each with an
SELECT s_name, t_name, source_type, data_type FROM dd.columns WHERE s_name = 'osm' AND c_name = 'address' ; ┌────────┬──────────────────┬───────────────────┬───────────┐ │ s_name │ t_name │ source_type │ data_type │ ╞════════╪══════════════════╪═══════════════════╪═══════════╡ │ osm │ poi_line │ table │ text │ │ osm │ poi_point │ table │ text │ │ osm │ poi_polygon │ table │ text │ │ osm │ amenity_line │ table │ text │ │ osm │ amenity_point │ table │ text │ │ osm │ amenity_polygon │ table │ text │ │ osm │ building_point │ table │ text │ │ osm │ building_polygon │ table │ text │ │ osm │ shop_point │ table │ text │ │ osm │ shop_polygon │ table │ text │ │ osm │ vbuilding_all │ view │ text │ │ osm │ vshop_all │ view │ text │ │ osm │ vpoi_all │ materialized view │ text │ └────────┴──────────────────┴───────────────────┴───────────┘
For the DBA and Developer
The examples above for the analyst use cases can easily apply to the DBA and developer roles as well. These more technical roles, responsible for creating and maintaining the data structures, also care about details regarding the size and shape of the data contained.
dd.schemas view gives a nice overview of the contents within a database
grouped by schema name (
This view includes the size on disk, and provides counts of
tables, views and functions.
This view is useful to get a big picture view of a database, and is
especially helpful if the database is well organized into logical schemas.
The following query filters for three (3) specific schemas,
the results show that the
osm schema has the bulk of the tables (41)
and is approaching 2 GB in size.
SELECT s_name, size_plus_indexes, table_count, view_count, function_count, description FROM dd.schemas WHERE s_name IN ('osm', 'osm_routing', 'public') ; ┌────────────┬───────────────────┬─────────────┬────────────┬────────────────┬────────────────────────────────────────┐ │ s_name │ size_plus_indexes │ table_count │ view_count │ function_count │ description │ ╞════════════╪═══════════════════╪═════════════╪════════════╪════════════════╪════════════════════════════════════════╡ │ osm │ 1763 MB │ 41 │ 6 │ 3 │ Schema populated by PgOSM-Flex. SELEC…│ │ │ │ │ │ │…T * FROM osm.pgosm_flex; for details. │ │ osm_routin…│ 239 MB │ 3 │ 0 │ 0 │ ¤ │ │…g │ │ │ │ │ │ │ public │ 7544 kB │ 3 │ 2 │ 1051 │ standard public schema │ └────────────┴───────────────────┴─────────────┴────────────┴────────────────┴────────────────────────────────────────┘
Another common task is tracking tables based on their size. Table size
is commonly discussed using row count, but the size on disk is also important.
The next query returns the five (5) largest
tables by size on disk along with their row counts and
estimate. Saving snapshots of this data over time is helpful to track and manage data growth.
SELECT s_name, t_name, rows, size_pretty, size_plus_indexes, bytes_per_row FROM dd.tables ORDER BY size_bytes DESC LIMIT 5 ; ┌─────────────┬──────────────────┬─────────┬─────────────┬───────────────────┬───────────────┐ │ s_name │ t_name │ rows │ size_pretty │ size_plus_indexes │ bytes_per_row │ ╞═════════════╪══════════════════╪═════════╪═════════════╪═══════════════════╪═══════════════╡ │ osm │ tags │ 3076011 │ 538 MB │ 631 MB │ 184 │ │ osm │ road_line │ 770203 │ 220 MB │ 284 MB │ 300 │ │ osm │ building_polygon │ 771073 │ 203 MB │ 261 MB │ 276 │ │ osm │ water_line │ 434721 │ 180 MB │ 216 MB │ 435 │ │ osm_routing │ roads_noded │ 193652 │ 107 MB │ 163 MB │ 578 │ └─────────────┴──────────────────┴─────────┴─────────────┴───────────────────┴───────────────┘
Find columns by type
When I go exploring in a new database, one of the things I check for are
int4) columns used as primary/foreign keys.
int8) instead of
for these scenarios. The following query helps me start building a
list of potential candidates for a type change to
SELECT c_name, data_type FROM dd.columns WHERE data_type = 'int4' AND s_name = 'osm' GROUP BY c_name, data_type ; ┌─────────────┬───────────┐ │ c_name │ data_type │ ╞═════════════╪═══════════╡ │ admin_level │ int4 │ │ ele │ int4 │ │ layer │ int4 │ │ levels │ int4 │ │ maxspeed │ int4 │ └─────────────┴───────────┘
If any columns from the above query had
id in their name (
I would be sure to give them a closer look and assess if an
migration is recommended.
In the case of this database, the
int4 columns shown in the above results store attribute values
from OpenStreetMap. Each of these columns are scoped to remain under
the 2.1 billion threshold of
int4 and are fine to leave as is.
The value in the
maxspeedcolumn is in km/hr, making 2.1 B km/hr the maximum allowed value (2,147,483,647 km/hr to be exact). That's 1.99x the speed of light! Considering this data is about roads on Earth from OpenStreetMap... I think we're safe with this limit.
public schema reported having 1,051 functions.
In my databases, having a lot of functions in the
is a result of having PostGIS and pgRouting installed.
Speaking of PostGIS, what if you were curious to find which PostGIS
functions support the
GEOGRAPHY data type? That is easy with PgDD!
The following query
shows how the
argument_data_types column can be filtered to include
geography. It also filters for function names (
f_name) that begin
st to limit results to the main PostGIS functions.
SELECT DISTINCT f_name FROM dd.functions WHERE s_name = 'public' AND argument_data_types LIKE '%geography%' AND LEFT(f_name, 2) = 'st' ;
This query returns the 25 main PostGIS functions that provide
GEOGRAPHY as an input. A sample of the results are shown below.
┌─────────────────┐ │ f_name │ ╞═════════════════╡ │ st_area │ │ st_asgeojson │ │ st_buffer │ │ st_distance │ │ st_intersects │ │ st_length │ │ st_setsrid │ │ st_srid │ └─────────────────┘
For the Business
Businesses are comprised of people. The people of a business need to understand the data they work with. From the DBAs and developers that create and maintain the data structures, to the analysts that consume, aggregate, and report on the data, to the executives making decisions and the people implementing them... understanding the data is key! A good data dictionary improves the data stewardship abilities of the entire data team, removes confusion about the data, and enables the business to better support its databases long-term.
Beyond the core data team(s), the results from PgDD can be brought out of the database and shared with the organization in a format appropriate for your specific needs. The goal here is to remove the idea of black box data where users throughout the business know they have data but don't feel they understand it. The PgDD UI project is a rough proof of concept of an offline, in-browser data dictionary. This type of documentation can be shared internally providing non-database users more detailed information about the data their organization has available.
This post has taken a look at the PgDD extension for Postgres and how it can help users at multiple levels benefit from an in-database data dictionary. Providing helpful details through a limited number of views, PgDD is easy to learn, easy to use, and easy to share. Enabling multiple types of users with the same information improves cross-team understanding of the data.
Need help with your PostgreSQL servers or databases? Contact us to start the conversation!
Published January 04, 2022
Last Updated January 04, 2022