Data Processing in Python: PyTables vs PostgreSQL
One of the challenges when working with data is processing large amounts of it. Parsing out the data you really want, cleaning it up so you can work with it, and then effectively being able to work with it are key components to consider. In this post I'm going to use try out PyTables, which utilizes HDF5 storage, and compare it with a popular relational database, PostgreSQL. I will be looking at how long it takes to load the data from raw form (csv format in txt file), how much space it takes on the server, and ease of processing and querying the data once it is loaded.
The data I will be using for this test is weather data from NOAA. I am using data included in the QCLCD201312.zip archive, specifically the 201312hourly.txt and 201312station.txt files.
Planning for the Data
I need to load the 4,271,305 records from the csv data into a format I can easily query, run statistics, and chart the data. I don't plan on only loading one month of data either, I plan on loading multiple years of data. To get a quick estimate of size I will round up to 4.5 million rows per file: 4.5 million X 12 months X 5 years = 270 million records to load and process. Or, if we go off the rough file size of 525 MB we have: 525 X 12 X 5 = 31.5 GB of raw data. Not huge in our world of cheap TB drives, but still not tiny either.
Setting Up For Testing
I am using a Debian virtual machine running in VirtualBox with all the necessary Python components for PyTables and PostgreSQL 9.3 installed. For convenience, I have configured PSQL to allow my user to connect without a password. The VM has one core and 1 GB RAM with a 670MB swap file just in case. All of this is running inside my MacBook Pro with a 5800 RPM drive. In the future I want to try this out on a VPS with SSDs, more RAM and more processors to see what kind of performance increase that will give me.
import datetime as dt import os import tables import pandas as pd import sqlalchemy as sa from pandas import HDFStore conn = sa.create_engine('postgresql://127.0.0.1/ryanlambert', echo=False) year = '2013' month = '12' yearmonth = str(year) + str(month) base_path = '/home/ryanlambert/data/weather/QCLCD' + yearmonth + '/' stations_path = base_path + yearmonth + 'station.txt' weather_path = base_path + yearmonth + 'hourly.txt'
Loading to PyTables
With the initial setup for this testing out of the way, I'm going to jump into loading the text file into HDF5 using pandas and PyTables. We will need an HDF5 database file to write to, so I will create one using HDFStore (from pandas) in the local directory. The table we'll be writing to is named weather_hourly so I set it to a variable for easy reference later.
hdf = HDFStore('weather_noaa.h5') table_name = 'weather_hourly'
The next block of code handles the work of loading the data from the text file. First, I use the pandas.read_csv() method to load the data in chunks of 50,000 records with each field as string format. If you do not specify a chunk size, it will attempt to process the whole file at once which will not work if your machine doesn't have enough RAM. With this chunk size and the file I'm using there will be 86 chunks of data that it loops through to process.
weather_data = pd.read_csv(weather_path,sep=",",chunksize=50000, dtype=str) for weather_chunk in weather_data: weather_chunk = weather_chunk[['WBAN','Date','Time', 'SkyCondition', 'DryBulbFarenheit', 'RelativeHumidity', 'WindSpeed','WindDirection', 'ValueForWindCharacter', 'StationPressure','PressureTendency', 'PressureChange','RecordType','HourlyPrecip' ]] hdf.append(table_name, weather_chunk, format='table', data_columns=True, min_itemsize = 40) hdf.flush()
This is able to load the entire file to the HDF5 database from the source text file in about 6 minutes. So this was writing just under 12,000 records per second using this method.
The next consideration after load speed is how much space it takes on disk. I know HDF5 is optimized for handling large amounts of semi-structured data so I expected the file size to be larger than the source file as part of the tradeoff to gain the performance... I didn't expect it to go from 521 MB to 2.4 GB though! Thinking about 5 years of data would mean that I have to plan for 150 GB just to store the data that came from 32 GB of raw data. (This should also remind you to be wary of someone bragging that they have xx GB of data.... always ask in what format!)
Loading to PostgreSQL
Let's load the data again, but this time to a PostgreSQL database. Remember, I setup the connection string using SQLAlchemy back in the first block of code.
My expectations are that the load process will be slower than to HDF5, but the file size of the resulting table in the database should be smaller than the raw text file since I'm only loading a sub-set of the columns. The code I'm using to load to PostgreSQL is almost identical to my original for loading to HDF5. The only difference is that instead of appending to an HDF table I'm using the pandas.to_sql() method to append the data to PSQL.
weather_data = pd.read_csv(weather_path,sep=",",chunksize=50000, dtype=str) for weather_chunk in weather_data: weather_chunk = weather_chunk[['WBAN','Date','Time', 'SkyCondition', 'DryBulbFarenheit', 'RelativeHumidity', 'WindSpeed','WindDirection', 'ValueForWindCharacter', 'StationPressure','PressureTendency', 'PressureChange','RecordType','HourlyPrecip' ]] weather_chunk.to_sql('hourly_noaa', conn, if_exists='append')
Loading a meger 4,050 records per second, this method of importing took a whopping 35 minutes! The table size is a only 381 MB which is good, especially considering the data should only shrink from here as I cleanup the data and improve the datatypes used where possible.
Loading to PostgreSQL - Take 2
I'm not happy with the 5x longer load times, but I suspect that the timing above is related to the overhead of loading the data through pandas instead of using the strengths of the RDMS. I'm going to load the data to PSQL again, but this time using the native COPY command. One of the aspects I like about using Pandas was not having to define the table in advance, but this time we'll need to create the empty table first. Also, since this doesn't allow for preprocessing before loading I will accept all the columns in the file.
CREATE TABLE hourly_noaa (WBAN text, Date text, Time text, StationType text, SkyCondition text, SkyConditionFlag text, Visibility text, VisibilityFlag text, WeatherType text, WeatherTypeFlag text, DryBulbFarenheit text, DryBulbFarenheitFlag text, DryBulbCelsius text, DryBulbCelsiusFlag text, WetBulbFarenheit text, WetBulbFarenheitFlag text, WetBulbCelsius text, WetBulbCelsiusFlag text, DewPointFarenheit text, DewPointFarenheitFlag text, DewPointCelsius text, DewPointCelsiusFlag text, RelativeHumidity text, RelativeHumidityFlag text, WindSpeed text, WindSpeedFlag text, WindDirection text, WindDirectionFlag text, ValueForWindCharacter text, ValueForWindCharacterFlag text, StationPressure text, StationPressureFlag text, PressureTendency text, PressureTendencyFlag text, PressureChange text, PressureChangeFlag text, SeaLevelPressure text, SeaLevelPressureFlag text, RecordType text, RecordTypeFlag text, HourlyPrecip text, HourlyPrecipFlag text,Altimeter text ,AltimeterFlag text);
Obviously the unlimited length text fields are not ideal, but since this data hasn't been sanitized yet I'll go ahead and allow that for now. Fixing data types will be part of my data cleansing process later on along with removing the unused columns. After dropping and recreating the empty table I run the COPY command to load from CSV one last time with my fingers crossed....
psql COPY weather_noaa FROM '/data/201312hourly.txt' DELIMITER ',' CSV;
1 minute, 35 seconds!! Awesome!
That's 45,000 records per second compared to 12,000/sec for PyTables. I also checked the table size and now that I'm loading all the columns the table has grown to 654 MB. But, that is still considerably smaller than the 2.4 GB with reduced columns in the .h5 file.
Now with some data loaded I'm going to get a simple row count from each system. I run queries like this all the time against various databases while I'm exploring the data, so if PyTables isn't good with this it's going to be a big downside for me.
PyTables Row Count Code
Unless I've missed something, there isn't an easy way to count records in a PyTables table if you can't load the whole thing into RAM. Since that's the premise of this test... I don't have enough RAM and therefore I must loop through the data in batches. For this part I'm going to use a chunk size of 100,000. The following code took 1 minute, 43 seconds to count 4,271,305 records.
test_data = hdf.select(table_name, chunksize=100000) rows = 0 for chunk in test_data: rows += len(chunk) print (rows)
Note: I did experiment with varying batch sizes from 10,000 to 250,000 on both loading and retrieving data with no discernible changes.
PSQL Row Count Code
This is something that relational databases are great at, so this shouldn't take anywhere close to 2 minutes. In fact, it only took 7.5 seconds.
s = 'SELECT COUNT(*) FROM hourly_noaa' result = conn.execute(s) for row in result: print (row)
Before I can really work with the data and start to analyze it, I need to sanitize it and prepare it for use. Until I get the numeric fields down to only numeric values, I can't do much with this data. This post has already grown long enough so I will cover those steps in a future post.
Published November 24, 2014
Last Updated March 22, 2015