From Idea to Database: Define, Design, Repeat
The database is one of the most critical components of most software projects. Planning for your database project should not be taken lightly. Failure rates in IT projects are unfortunately high; much of that problem is due to a lack of proper planning and design. Most of the database anti-patterns I previously wrote about can be avoided by understanding the system you need to build and designing it appropriately.
This post uses the PiWS project to illustrate the Define and Design approach I use throughout the initial development phases of any new database project. Most projects go through multiple iterations of define and design.
Project management, not optional
Anyone who has worked on a project with more than one person to design and build anything has heard this: "I forgot to tell you..." followed by some various change requirements.
Gotcha restoring XML data from pg_dump
Recently, I ran into a problem when moving a database from a production PostGIS-enabled PostgreSQL
database to a local development virtual machine. This database includes a table
many QGIS users may have encountered:
public.layer_styles table is used by QGIS software to allow saving the
styling used for PostGIS spatial layers. Style information is stored in
XML format in a column
styleqml that includes a
document type declaration
XML data with
<!DOCTYPE> is at the core of this problem. If your PostGIS databases uses
public.layer_styles table, the workaround and other considerations will be helpful.
The bug: Invalid
pg_dump file w/ XML
The problem encountered is filed under bug #15342. I like how Tom Lane summarized the issue:
"There are two problems here:
pg_dumpneglects to force a safe value of
xmloptionfor the restore step, plus there doesn't seem to be a safe value for it to force :-(."
The rest of this post explores the factors involved with this problem.
Who does this affect?
If you have a PostGIS database with QGIS users storing style information
public.layer_styles table, you are affected. More specifically, this affects PostgreSQL databases with columns using the
XML data type
data that includes
<!DOCTYPE> blocks. These databases
will experience headaches when restoring dump files saved using
The Check for XML columns section below provides a query to help determine if your databases are affected.
PostgreSQL Crash Course
PostgreSQL is a powerful, open-source relational database (RMDS). It is the database of choice here at RustProof Labs, I wrote previously why I chose PostgreSQL over MySQL. This post provides a 10-mile view of how PostgreSQL is structured, tools to use, and other important concepts. The goal is to provide a broad view of PostgreSQL with resources to more detail on each topic, without getting caught up in too much technical jargon.
This post is part of the series PostgreSQL: From Idea to Database.
PostgreSQL will run on most any modern hardware and operating systems. PostgreSQL has installation guides for a large number of operating systems, supporting most Linux/Unix, Mac, and Windows. As for hardware, it will run on nearly anything modern, including a Raspberry Pi (ARM processor). See the PostgreSQL supported platforms for more details.
I have ran PostgreSQL on Linux servers of all sizes, from tiny to powerful. I previously wrote about load testing PostgreSQL that gives more detail about some of the systems it can run on and what type of performance to expect.
Every RDMS has its own way to organize the elements required for a reliable database system. PostgreSQL has the following hierarchy to organize the required elements.
Cluster > Database > Schema > Table > Data
Data is stored as rows, contained in tables, within a schema, within a database, within a cluster. The PostgreSQL Concepts page provides a nice explanation of this overall structure.
Security is a feature. Choose it!
Nearly every modern organization, large or small, has a significant amount of IT infrastructure reliant on 3rd parties. For the purpose of this post, 3rd parties includes anything you think of as "the cloud" (e.g. not my servers) or any service that you don't fully control on your hardware. Web hosting, email, social media, and file storage are all examples of services commonly outsourced.
I wrote previously about the responsibilities associated with managing data securely. A solid cybersecurity posture should require evaluating 3rd parties for proper security practices. This post provides a loose framework for evaluating a 3rd party's overall cybersecurity posture. This evaluation can then be used in your ongoing decision making to pick new vendors and reevaluate existing vendors. This post does not go in depth to any specific technology or vendor specific details, instead I intend to focus more on the culture of security.
Security as a feature
Security as a feature in this model has 3 overarching components:
- Problem solving
- Learn from mistakes
The quality of cybersecurity is an aggregate of these three components. An organization with good security practices is well prepared, knows how to react when something goes wrong, and they learn from their mistakes.
Plan for the worst, hope for the best.
The reality is, until more organizations make cybersecurity a higher priority, the never-ending stream of data breaches will continue.
PostGIS: Tame your spatial data (Part 1)
The goal of this post is to examine and fix one common headache when using spatial data: its size. Spatial data (commonly called "GIS data") can become painfully slow because spatial data gets big in a hurry. I discussed this concept before by reducing the OSM roads layer to provide low-overhead thematic layers. This post uses the same basic philosophy to reduce another common bottleneck with polygon layers covering large areas, such as the US county boundaries.
This post goes into detail about using
PostGIS to simplify large
polygon objects, the effects this has on storage, performance
and the accuracy of spatial analysis when using simplified geometries.
The solution provided is not suitable for all data and/or use cases,
particularly if a high level of accuracy and precision is required.
This post is an advanced topic of the series PostgreSQL: From Idea to Database.
A numeric example
This post assumes basic knowledge of SQL syntax, PostgreSQL and PostGIS.
To begin this conversation, let's examine a bit about what we already know about storing numbers. Let's start with Pi (π).
Pi is a mathematical constant, roughly represented as
though the decimals go on and on into infinity. The following example shows how to create a table with a single row storing two
different representations of Pi. The first column,
pi_long, stores 31 decimals of Pi. The
pi_short column stores only two decimal
DROP TABLE IF EXISTS cool_numbers; CREATE TEMP TABLE cool_numbers AS SELECT 3.1415926535897932384626433832795::NUMERIC(32,31) AS pi_long, 3.14::NUMERIC(3,2) AS pi_short;
As you may expect, storing a number with more digits takes more disk space than storing a number with fewer digits. This is called
precision. The following
query uses the
pg_column_size() function to illustrate that the number with more decimal spaces takes up more disk space.