An Introduction to PostgreSQL for Data Scientists

Mar 11, 2017 · 765 words · 4 minutes read data processingpostgres

Data should be stored in databases. Often I talk with data scientists who use databases but only through high-level interfaces. There’s a lot of value in knowing how to standup your own database and learning about some of the things that go on behind-the-scenes. Even if you won’t be creating databases at your job, it’s worthwhile to spend some time thinking about data storage from an engineering perspective.

PostgreSQL

There’s a lot of database options to choose from but my flavor of choice is PostgreSQL, an open source, enterprise class object-relational database system.

It comes with a lot of great features, it’s fully compatible with the ANSI standard, ACID compliant, and has extensions. Unlike other database systems, PostgreSQL has an incredibly open license. This means you can use PostgreSQL without worrying about breaking license agreements.

Installing PostgreSQL

I’m installing PostgreSQL on a Mac, so I’ll use homebrew. Check out the PostgreSQL download page if you’re not familiar with homebrew or are on a difference OS.

Update homebrew and upgrade any unpinned brews, then install PostgreSQL. Optionally, you may want to clean up old, unused formulae.

brew update && brew upgrade
brew install postgresql
brew cleanup # optional

You can verify that PostgreSQL is installed by checking the version.

postgres -V

Creating your first database

The first thing we’ll do is create a new database cluster. A cluster is just a set of databases that are centrally managed by a single server. The database cluster is stored in a subdirectory of my user directory. Note that I’m not using ~ for path expansion, it’s better here to specify the full path.

pg_ctl initdb --pgdata=/Users/ellisvalentiner/postgres_demo

After creating the database cluster, we need to start the server, which will allow us to create databases within it. I’m specifying the location of the database cluster (-D or --pgdata) and where to capture the server log (-l or --log).

pg_ctl -D /Users/ellisvalentiner/postgres_demo -l logfile start

Using the PostgreSQL interactive terminal, psql, we can list all available databases. By default, there are databases named postgres, template0, and template1.

psql -l
#    Name    |      Owner      | Encoding |   Collate   |    Ctype    |          Access privileges          
# -----------+-----------------+----------+-------------+-------------+-------------------------------------
#  postgres  | ellisvalentiner | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
#  template0 | ellisvalentiner | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/ellisvalentiner                 +
#            |                 |          |             |             | ellisvalentiner=CTc/ellisvalentiner
#  template1 | ellisvalentiner | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/ellisvalentiner                 +
#            |                 |          |             |             | ellisvalentiner=CTc/ellisvalentiner
# (3 rows)

You could use any of these database out of the box, but it is better to create a new one with a meaningful name for your project. We can either connect to the database cluster and execute a CREATE DATABASE statement or use the createdb convenience function from bash.

createdb project_db

Now we’ve got database cluster, running locally, with a database for our project. Awesome.

Connection to the database

Over in R, load dplyr and use the src_postgres function to connect to the database:

# install.packages("RPostgreSQL")
library(dplyr)
project_db <- src_postgres(dbname = "project_db", host = "127.0.0.1",
                           port = "5432", user = "ellisvalentiner")

or in python, use psycopg2:

import psycopg2
conn = psycopg2.connect("dbname=project_db user=ellisvalentiner")

then proceed as you usually would.

Enabling external connections

Running a database cluster locally is fine, but chances are you’ll want to run it out of AWS or will be collaborating with others. In any case, you’ll need to tweak some configuration settings enable external connections.

Navigate to your PostgreSQL database cluster directory (e.g. /Users/ellisvalentiner/postgres_demo). There are two files that you’ll need to edit: pg_hba.conf and postgresql.conf

pg_hba.conf is the client authorization configuration file – by default only connections from the localhost are allowed. You can allow access to the database by editing the config file. Generally you’d want to be restrictive in allowing connections but you can also generously allow connections from anyone:

host    project_db  all 0.0.0.0/0   trust

Next you need to edit the postgresql.conf file. It contains the server configuration. Scroll down to about line 60 and add the IP addresses you want the server to listen on. You can simply listen to all addresses if you aren’t too worried about security:

listen_addresses = '*'

Finally you need to restart the database cluster for the server configuration changes to take effect:

pg_ctl -D postgres_demo restart

Wrap up

Standing up a PostgreSQL database cluster is pretty straightforward. You can use it to store data locally or configure it for external connections and share data with collegues. If you start using databases for your projects, you’ll start thinking more about the data storage and architecture decisions that an engineer considers.