Docker for R package management
Oct 28, 2016 · 474 words · 3 minutes read
I’ll be honest I didn’t get the Docker hype for a long time. I’m even sure I really get it now but let’s dive in.
Package management in R is abysmal although getting better as more packages are made available on GitHub. Revolution Analytics created checkpoint
and RStudio developed packrat
in attempts to solve the problem but adoption has been relatively poor. Neither do a great job supporting use of private packages, although packrat
technically does (but only if your code is on GitHub).
[I’ve complained about this before.]({% post_url 2016-08-02-dpendency-hell %})
Introducing a Docker solution
Docker is a service-oriented containerization solution. Simply, Docker lets you run a minimal serivce (some code) with minimal overhead. There’s a lot of great information coming out about using Docker from places such as Digital Ocean.
The first thing we need to do is install Docker and some helpers.
brew install docker
brew install docker-machine
brew install docker-compose # optional
Next install the virtualbox driver. You can use other drivers if you want (e.g. generic, Digital Ocean, AWS, etc.).
brew cask install virtualbox
Now we have to create a docker machine using the virtualbox driver and give it a name. I’m calling mine dev
. We also want to start the docker machine and set environment variables for connecting to dev
.
docker-machine create --driver virtualbox dev
docker-machine start dev
eval $(docker-machine env dev)
Finally you should be able to run the container, r-base
. Note the first time will require pulling the container which may take awhile.
docker run -it --memory=4g --memory-swap=4g \
--entrypoint=/bin/bash \
rocker/r-base:latest
This dumps you into a bash shell. Change the entrypoint to /usr/bin/R
if you want to enter directly into a R session.
How does this help?
Now that we can run docker containers, we can create our own images. The image we create will serve as the base for developing our packages (or other images).
The image for my project, dbUtil
, is based on the rocker/rstudio
image and depends on dplyr
0.4.3.
Here is the Dockerfile:
FROM rocker/rstudio:latest
MAINTAINER "Ellis Valentiner" ellis.valentiner@gmail.com
RUN R -e 'install.packages("devtools")' \
-e 'devtools::install_github("hadley/dplyr.git", ref="1405946")' \
-e 'devtools::install_github("ellisvalentiner/dbUtil.git")'
I can now use that Docker image for new projects and my dependencies will be installed. This works great as a reproducible environment for development and production.
In the future when I have a new project, I can update the dependencies but keep existing project pointed to the prior image version.
Resolution
Use Docker to create a reproducible data science environment where important dependencies are managed.
Packrat might be useful for managing the dependencies at the higher-level Docker image. However this would require a way to pre-build the packrat.lock
file and populate the version requirements. It would probably be easier to identify the package version requirements locally (or in an interative session) and then hard code those steps in the Dockerfile.