Getting Data into R Part 2 - Databases
Feb 23, 2017 · 598 words · 3 minutes read
Flat files are great, except they’re not. There is a RFC standard specification for CSVs, but pretty much no one follows it. So every CSV is it’s own unique snowflake and different applications will sometimes parse the files differently. In practice it’s common (and much better) to leverage databases to store and access data. There are several good packages for accessing databases from R.
DBI
The DBI package (database interface) provides the scaffolding for interfacing with database management systems (DBMS). DBI defines methods for general tasks, such as
- managing connections (connecting/disconnecting)
- executing statements
- fetching results (including error/exception handling)
Managing connections
Managing database connections is pretty straightforward. Just use the dbConnect
and dbDisconnect
functions. Be sure to disconnect from the database when you’re finished so you don’t leave a connection hanging!
conn <- dbConnect(...)
on.exit(dbDisconnect(conn))
Executing statements
Executing statements, such as queries, is similarly easy using the dbSendQuery
function. The function only submits and synchronously executes the SQL statement and does not return the results.
res <- dbSendQuery(conn, "SELECT * FROM tbl;")
Fetching results
Fetch the results from the last executed statement using dbFetch
. The result set is not closed automatically, so be sure to call dbClearResult
afterward.
dbFetch(res)
dbClearResult(res)
Made even simpler
The above steps can be combined using the dbGetQuery
function, a wrapper around dbSendQuery
, dbFetch
, and dbClearResult
.
res <- dbGetQuery(conn, "SELECT * FROM tbl;")
The general paradigm is that DBI exposes an API layer for specific DBMS device drivers (e.g. Postgres, SQL Server). Drivers are just the implementations of methods for a given DBMS. This means you also need to specify a driver to setup your database connection.
ODBC
Open Database Connectivity (ODBC) is a low-level, high-performance interface that enables access to a variety of DBMS. The odbc
R package builds on the nanodbc
C++ library and supplies a DBI-compliant interface to the ODBC drivers.
Now you can connect to your database by specifying the driver and database connection information.
library(DBI)
conn <- dbConnect(odbc::odbc(),
driver = "PostgreSQL Driver",
database = "psql_db",
uid = "postgres",
pwd = "password",
host = "localhost",
port = 5432)
conn <- dbConnect(odbc::odbc(),
.connection_string = "Driver={PostgreSQL Driver};
Uid=postgres;Pwd=password;Host=localhost;Port=5432;
Database=psql_db;")
In practice you wouldn’t want to expose your database credentials in your code, so you might use a data source name (DSN) configuration file or store credentials as secure environment variables.
Note that odbc requires some setup. On macOS,
# Install the unixODBC library
brew install unixodbc
# SQL Server ODBC Drivers (Free TDS)
brew install freetds --with-unixodbc
# PostgreSQL ODBC ODBC Drivers
brew install psqlodbc
# MySQL ODBC Drivers (and database)
brew install mysql
# SQLite ODBC Drivers
brew install sqliteodbc
and then be sure to configure your ~/.odbcinst.ini
file
[PostgreSQL]
Driver = /usr/local/lib/psqlodbcw.so
pool
But wait, there’s more!
It’s important to manage your database connections and prevent connection leaks.
pool introduces a method for caching database connections so the connections can be reused. Establishing the connection is usually a performance bottleneck, relative to executing the actual query. Connection pools improve performance by maintaining a set of connections that can grow or shrink with demand.
How does all of this work in practice?
Let’s assume you’re running a local Postgres database. Just setup the connection as you normally would but use dbPool
instead of dbConnect
.
library(DBI)
library(odbc)
library(pool)
pool <- dbPool(odbc::odbc(),
driver = "PostgreSQL",
database = "mydb")
res <- dbGetQuery(pool, "SELECT * FROM ...")
Databases are ubitquitous. They impose structure and organization to your data storage – there’s no need for horridly named datafiles cluttering up your project directory! Databases are generally easier to scale, yield better performance, and can be more secure.