Docker for R package management

Oct 28, 2016 · 474 words · 3 minutes read docker • package management • R

I’ll be honest I didn’t get the Docker hype for a long time. I’m even sure I really get it now but let’s dive in.

Package management in R is abysmal although getting better as more packages are made available on GitHub. Revolution Analytics created checkpoint and RStudio developed packrat in attempts to solve the problem but adoption has been relatively poor. Neither do a great job supporting use of private packages, although packrat technically does (but only if your code is on GitHub).

[I’ve complained about this before.]({% post_url 2016-08-02-dpendency-hell %})

Introducing a Docker solution

Docker is a service-oriented containerization solution. Simply, Docker lets you run a minimal serivce (some code) with minimal overhead. There’s a lot of great information coming out about using Docker from places such as Digital Ocean.

The first thing we need to do is install Docker and some helpers.

brew install docker
brew install docker-machine
brew install docker-compose # optional

Next install the virtualbox driver. You can use other drivers if you want (e.g. generic, Digital Ocean, AWS, etc.).

brew cask install virtualbox

Now we have to create a docker machine using the virtualbox driver and give it a name. I’m calling mine dev. We also want to start the docker machine and set environment variables for connecting to dev.

docker-machine create --driver virtualbox dev
docker-machine start dev
eval $(docker-machine env dev)

Finally you should be able to run the container, r-base. Note the first time will require pulling the container which may take awhile.

docker run -it --memory=4g --memory-swap=4g \
  --entrypoint=/bin/bash \
  rocker/r-base:latest

This dumps you into a bash shell. Change the entrypoint to /usr/bin/R if you want to enter directly into a R session.

How does this help?

Now that we can run docker containers, we can create our own images. The image we create will serve as the base for developing our packages (or other images).

The image for my project, dbUtil, is based on the rocker/rstudio image and depends on dplyr 0.4.3.

Here is the Dockerfile:

FROM rocker/rstudio:latest

MAINTAINER "Ellis Valentiner" ellis.valentiner@gmail.com

RUN R -e 'install.packages("devtools")' \
      -e 'devtools::install_github("hadley/dplyr.git", ref="1405946")' \
      -e 'devtools::install_github("ellisvalentiner/dbUtil.git")'

I can now use that Docker image for new projects and my dependencies will be installed. This works great as a reproducible environment for development and production.

In the future when I have a new project, I can update the dependencies but keep existing project pointed to the prior image version.

Resolution

Use Docker to create a reproducible data science environment where important dependencies are managed.

Packrat might be useful for managing the dependencies at the higher-level Docker image. However this would require a way to pre-build the packrat.lock file and populate the version requirements. It would probably be easier to identify the package version requirements locally (or in an interative session) and then hard code those steps in the Dockerfile.