An Introduction to PostgreSQL for Data Scientists
Mar 11, 2017 · 765 words · 4 minutes read
Data should be stored in databases. Often I talk with data scientists who use databases but only through high-level interfaces. There’s a lot of value in knowing how to standup your own database and learning about some of the things that go on behind-the-scenes. Even if you won’t be creating databases at your job, it’s worthwhile to spend some time thinking about data storage from an engineering perspective.
PostgreSQL
There’s a lot of database options to choose from but my flavor of choice is PostgreSQL, an open source, enterprise class object-relational database system.
It comes with a lot of great features, it’s fully compatible with the ANSI standard, ACID compliant, and has extensions. Unlike other database systems, PostgreSQL has an incredibly open license. This means you can use PostgreSQL without worrying about breaking license agreements.
Installing PostgreSQL
I’m installing PostgreSQL on a Mac, so I’ll use homebrew. Check out the PostgreSQL download page if you’re not familiar with homebrew or are on a difference OS.
Update homebrew and upgrade any unpinned brews, then install PostgreSQL. Optionally, you may want to clean up old, unused formulae.
brew update && brew upgrade
brew install postgresql
brew cleanup # optional
You can verify that PostgreSQL is installed by checking the version.
postgres -V
Creating your first database
The first thing we’ll do is create a new database cluster. A cluster is just a set of databases that are centrally managed by a single server. The database cluster is stored in a subdirectory of my user directory. Note that I’m not using ~
for path expansion, it’s better here to specify the full path.
pg_ctl initdb --pgdata=/Users/ellisvalentiner/postgres_demo
After creating the database cluster, we need to start the server, which will allow us to create databases within it. I’m specifying the location of the database cluster (-D
or --pgdata
) and where to capture the server log (-l
or --log
).
pg_ctl -D /Users/ellisvalentiner/postgres_demo -l logfile start
Using the PostgreSQL interactive terminal, psql
, we can list all available databases. By default, there are databases named postgres
, template0
, and template1
.
psql -l
# Name | Owner | Encoding | Collate | Ctype | Access privileges
# -----------+-----------------+----------+-------------+-------------+-------------------------------------
# postgres | ellisvalentiner | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
# template0 | ellisvalentiner | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/ellisvalentiner +
# | | | | | ellisvalentiner=CTc/ellisvalentiner
# template1 | ellisvalentiner | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/ellisvalentiner +
# | | | | | ellisvalentiner=CTc/ellisvalentiner
# (3 rows)
You could use any of these database out of the box, but it is better to create a new one with a meaningful name for your project. We can either connect to the database cluster and execute a CREATE DATABASE
statement or use the createdb
convenience function from bash.
createdb project_db
Now we’ve got database cluster, running locally, with a database for our project. Awesome.
Connection to the database
Over in R, load dplyr
and use the src_postgres
function to connect to the database:
# install.packages("RPostgreSQL")
library(dplyr)
project_db <- src_postgres(dbname = "project_db", host = "127.0.0.1",
port = "5432", user = "ellisvalentiner")
or in python, use psycopg2
:
import psycopg2
conn = psycopg2.connect("dbname=project_db user=ellisvalentiner")
then proceed as you usually would.
Enabling external connections
Running a database cluster locally is fine, but chances are you’ll want to run it out of AWS or will be collaborating with others. In any case, you’ll need to tweak some configuration settings enable external connections.
Navigate to your PostgreSQL database cluster directory (e.g. /Users/ellisvalentiner/postgres_demo
). There are two files that you’ll need to edit: pg_hba.conf
and postgresql.conf
pg_hba.conf
is the client authorization configuration file – by default only connections from the localhost are allowed. You can allow access to the database by editing the config file. Generally you’d want to be restrictive in allowing connections but you can also generously allow connections from anyone:
host project_db all 0.0.0.0/0 trust
Next you need to edit the postgresql.conf
file. It contains the server configuration. Scroll down to about line 60 and add the IP addresses you want the server to listen on. You can simply listen to all addresses if you aren’t too worried about security:
listen_addresses = '*'
Finally you need to restart the database cluster for the server configuration changes to take effect:
pg_ctl -D postgres_demo restart
Wrap up
Standing up a PostgreSQL database cluster is pretty straightforward. You can use it to store data locally or configure it for external connections and share data with collegues. If you start using databases for your projects, you’ll start thinking more about the data storage and architecture decisions that an engineer considers.