Ode to "The Database"
I'm a big fan of relational databases. Over the years I have tried out various ways to work with and manage data without utilizing a database. Every time I try to do that, I am reminded why I love databases so much... they do one job, and do it well. Manage your data. It's hard to argue with decades of success!
In this post I will be referencing PostgreSQL 9.4 specifically, but most of my arguments and examples should translate to any of the major RDMSs, such as MS SQL Server, Oracle Database, or MySQL.
Large-ish Data, Small-ish Hardware
In one of my recent posts, I found that with a small virtual machine I could load just over 4 million rows of data to HDF5 storage with Python and PyTables in about six minutes. Not bad, right? On the same VM using PostgreSQL, I loaded the same data in just over a minute and a half. Not only that, but Postgres also stored the data in 1/4 of the space taken by the HDF5 storage.
One of the major issues I've ran into when trying to avoid using a database is running out of RAM. For example, if you love the Python module Pandas like I do, you'll be disappointed to see how much work it is if you need to work with large chunks of data that won't fit in your available memory. Check out this thread on StackOverflow giving some advice on working with this size of dataset. It's not impossible, it's just not simple.
For me, faster load times, reduced storage space, and not having to manually size your batches for processing are good arguments to let a proper database do its thing.
Dirty Data (Needs Cleaning)
All data is dirty. Dirty Data can mean various things, but the result is always data that is hard to work with, interpret and analyze. If you have two data sources with one containing the state as a two character code ("CO") and the other with the full state's name ("Colorado"), you have dirty data. If you have log files you want to analyze, it's dirty data. If you have data that could be misinterpreted... it's dirty. Now that we know that every data source is dirty, it is reasonable to expect that every new source of data will require work before it is usable. Since every source of data needs to be cleaned first, it is vital that this task quick and easy.
As I did in my PyTables vs. PostgreSQL post, I will be referencing NOAA weather data to give an example of how data can be cleaned and loaded with a bit of SQL. This time I will be working with the 201412daily.txt file from the extracted zip file. The raw file has roughly 60 columns that are mostly numeric, but every column seems to have some various text values as well which makes things tricky. Of those 60 columns I will only keep a handful, and I want to have pure numeric data in the columns.
If I were working in Python + Pandas for this cleanup, I would need to load the data from the source file in batches of ~25,000 records, then loop though and clean my data, then save the cleaned data to a new file. A good example of RBAR (Row By Agonizing Row).
Back to the world of databases and all of a sudden I'm working on the whole table at once, no need to fit the whole data set into memory. No batches or looping required. A small amount of SQL gives a major advantage when processing data, namely the SELECT DISTINCT clause and CASE statements.
Cleaning a Column
Total precipitation is stored in one column of the data, but, like most columns in the data it is mostly numeric data with some occasional text values thrown in. Using SELECT DISTINCT is a good way to see the type of data you have to work with.
SELECT DISTINCT preciptotal FROM noaa_daily_staging ORDER BY preciptotal DESC
I've sorted descending (ORDER BY preciptotal DESC) in order to see the text values among the numeric and to get an idea of the max value in my current data set.
Total Precipitation sorted descending
You can see the "T" (trace precipitation) and "M" (malfunction?) values in that field. The code below will ensure that any of the text values I have found will be converted to NULL, including cells that are empty, but aren't actually NULL either.
SELECT CASE WHEN TRIM(preciptotal) IN ('T','M','err','') THEN NULL
ELSE CAST(preciptotal AS NUMERIC(5,2)) END AS preciptotal
Using this method for each field I want to keep, I'm able to quickly build a SELECT statement that will clean and load the data from the staging table to my production table. The code below is a portion of the function I created that handles the entire import process. Given 12 months of daily weather data, I'm able to import, process and load the 438,202 records in less than two minutes!
INSERT INTO noaa_daily (wban, dateID, tmax, tmin, tavg,
preciptotal, stnpressure, sealevel, snowfall)
CASE WHEN TRIM(staging.tmax) IN ('M','') THEN NULL
ELSE CAST(staging.tmax AS SMALLINT) END AS tmax,
CASE WHEN TRIM(staging.tmin) IN ('M','') THEN NULL
ELSE CAST(staging.tmin AS SMALLINT) END AS tmin,
CASE WHEN TRIM(staging.tavg) IN ('M','') THEN NULL
ELSE CAST(staging.tavg AS SMALLINT) END AS tavg,
CASE WHEN TRIM(staging.preciptotal) IN ('T','M','err','') THEN NULL
ELSE CAST(staging.preciptotal AS NUMERIC(5,2)) END AS preciptotal,
CASE WHEN TRIM(staging.stnpressure) IN ('M','') THEN NULL
ELSE CAST(staging.stnpressure AS NUMERIC(5,2)) END AS stnpressure,
CASE WHEN TRIM(staging.sealevel) IN ('M','') THEN NULL
ELSE CAST(staging.sealevel AS NUMERIC(5,2)) END AS sealevel,
CASE WHEN TRIM(staging.snowfall) IN ('M','T','') THEN NULL
ELSE CAST(staging.snowfall AS NUMERIC(5,2)) END AS snowfall
FROM noaa_daily_staging staging
LEFT JOIN reporting.calendar cal
ON TO_DATE(staging.yearmonthday,'yyyymmdd') = cal.date
LEFT JOIN noaa_daily prod
ON staging.wban = prod.wban AND cal.dateID = prod.dateID
WHERE prod.dateID IS NULL AND cal.dateID IS NOT NULL
TRUNCATE TABLE noaa_daily_staging;
One final note on the query above is to look at the last LEFT JOIN which I have aliased "prod". That's the same table that I'm inserting data to and joining to it during the INSERT command provides a very efficient method of ensuring that I don't load duplicate data. I've never found an efficient method of accomplishing this directly in a programming language.
No matter how small your data set is, you will likely want to join, or combine, data sets together. This is one of the biggest strengths of relational databases, since joins are all about relationships. Let's say you have a database filled with 400,000 records of daily weather data but you only want to work with weather data from the state of Alaska. Here's a quick way to accomplish this in the database. With the full year of 2014 loaded to the table reporting.daily_weather, this query returns 20,714 results in about three seconds.
SELECT *queries should never be used in production code!
FROM reporting.daily_weather w
INNER JOIN reporting.station s ON w.stationid = s.stationid
WHERE s.state = 'AK'
Using a programming language to do the same thing is just unbearable to me. I tried using Pandas in Python for tasks like this and while I can get the job done, it takes me an order of magnitude longer to write the code, and then because I don't have enough RAM to load the entire dataset at once it means that I'm looping through batches, and that stinks badly.
I regularly work in three database platforms (MS SQL, PostgreSQL and MySQL) and four programming languages (Python, PHP, C#, VB). I can easily and happily switch between the three databases knowing that 99% of my code follows the ANSI SQL standard. This doesn't mean there aren't any headaches, for example: ISNULL() vs. NVL() vs. COALESCE(). (Do you know which one you should use? COALESCE is the ANSI standard, use it!)
Programming languages all have their own dialect, rules and built in functions. Some lean heavily on libraries or 3rd party modules/plugins for their functionality. That's all good and well, but it does complicate things when all you want to do is work with data quickly. To illustrate this, let's look at a few "Hello, World!" examples.
Ever seen a database tutorial for Hello World? Here you go:
SELECT 'Hello, World!';
SELECT 'Hello, World!';
(I think you get the idea?)
While this might be a ridiculous example, it does illustrate the common querying language. Since querying the database is one of the most common operations in the database... :)
The same is not true in programming languages.
print ('Hello, World!')
<?php echo 'Hello, World!'; ?>
puts 'Hello, World!'
You can see that even some as diluted as a Hello, World example shows the incredible syntactical variance across languages. This isn't a drawback to any one of the languages. After all, I can't build a user interface in the database. I don't want to either, because the database has one job: Data.
I'm firmly in the camp of letting a proper database manage your data. Yes, you can manage data sets using other methods but they always seem clunky and slow to me. Outside of the database I feel like I'm fighting with the data, instead of bending it to my will.
Let the database manage the data. Let your other languages do all the other cool stuff, just stop trying to take the database out of the database.
Published February 01, 2015
Last Updated July 28, 2016