PostgreSQL Setup on Debian
Warning: This post is outdated. It is here for reference purposes only.
This post will go over how to install, setup and configure a functional PostgreSQL database on a Debian 7 server. I will be doing this on a virtual machine in VirtualBox, but these steps should be valid for any Debian server. If you haven't ever setup a server like this, first read my 3-part series that covers setting up a basic Debian 7 virtual machine. The first step is to install Postgres, which is very easy, then we will move on to configuring the server, creating databases and users, and even making it possible to connect to the server remotely.
I can't go over every possible option in this post, but will do my best to give a pretty decent way to setup a database server. You should evaluate all of my recommendations with your needs and policies in mind.
Data Processing in Python: PyTables vs PostgreSQL
One of the challenges when working with data is processing large amounts of it. Parsing out the data you really want, cleaning it up so you can work with it, and then effectively being able to work with it are key components to consider. In this post I'm going to use try out PyTables, which utilizes HDF5 storage, and compare it with a popular relational database, PostgreSQL. I will be looking at how long it takes to load the data from raw form (csv format in txt file), how much space it takes on the server, and ease of processing and querying the data once it is loaded.
The data I will be using for this test is weather data from NOAA. I am using data included in the QCLCD201312.zip archive, specifically the 201312hourly.txt and 201312station.txt files.
Python and Databases
With all the cool things I recently discovered with Python, and some headaches with one of our systems at work, I wanted to make a case for setting up a dedicated Python Notebook server for my department. Before I get into the fun details of testing the setup for our needs, I should probably explain our needs. My goal is to replace our (very expensive) SPSS licenses with something not so expensive. It just so happens that Python, Pandas, MatplotLib, iPython and all the other goodies come at the correct price of Free. The only cost that should come out of this is the virtual server that would be needed to run it.
In order to evaluate if python could be used provide a viable alternative, I had to ask: what do we use SPSS for currently? Well, we load data from csv format to MS SQL and run an occasional statistical analysis. That's about it.
Fun With Python
Warning: This post is outdated. It is here for reference purposes only.
I think it's pretty common that most individuals who write code have their favorite go-to language. For years my language of choice has been PHP. I always defaulted to PHP because I had done so much in PHP that it just came naturally and I could focus on my task which is writing code. I think it's a good thing to use familiar tools because it makes you more efficient in solving the problem at hand, but its also vital to see what else is out there. I haven't been unhappy with PHP, but I had been wanting an excuse to dive into Python for a while now and a few weeks ago my reason to seriously give Python a try came up. I had a project that needed to get done and it required working with a lot of data, doing some statistical analysis and generating charts.
Watch out for Shellshock
Just in case you haven't heard yet... there's another security concern, called Shellshock, to worry about and it's a big one! This bug affects Bash, and since plenty of people are already talking about it I won't spend time elaborating further here. Learn how to secure your Linux servers against it, it's only a partial patch at this time but it's better than nothing.