I’ll be honest I didn’t get the Docker hype for a long time. I’m even sure I really get it now but let’s dive in.
Package management in R is abysmal although getting better as more packages are made available on GitHub. Revolution Analytics created
checkpoint and RStudio developed
packrat in attempts to solve the problem but adoption has been relatively poor. Neither do a great job supporting use of private packages, although
packrat technically does (but only if your code is on GitHub).
I’ve complained about this before.
Introducing a Docker solution
Docker is a service-oriented containerization solution. Simply, Docker lets you run a minimal serivce (some code) with minimal overhead. There’s a lot of great information coming out about using Docker from places such as Digital Ocean.
The first thing we need to do is install Docker and some helpers.
brew install docker brew install docker-machine brew install docker-compose # optional
Next install the virtualbox driver. You can use other drivers if you want (e.g. generic, Digital Ocean, AWS, etc.).
brew cask install virtualbox
Now we have to create a docker machine using the virtualbox driver and give it a name. I’m calling mine
dev. We also want to start the docker machine and set environment variables for connecting to
docker-machine create --driver virtualbox dev docker-machine start dev eval $(docker-machine env dev)
Finally you should be able to run the container,
r-base. Note the first time will require pulling the container which may take awhile.
docker run -it --memory=4g --memory-swap=4g \ --entrypoint=/bin/bash \ rocker/r-base:latest
This dumps you into a bash shell. Change the entrypoint to
/usr/bin/R if you want to enter directly into a R session.
How does this help?
Now that we can run docker containers, we can create our own images. The image we create will serve as the base for developing our packages (or other images).
The image for my project,
dbUtil, is based on the
rocker/rstudio image and depends on
Here is the Dockerfile:
FROM rocker/rstudio:latest MAINTAINER "Ellis Valentiner" email@example.com RUN R -e 'install.packages("devtools")' \ -e 'devtools::install_github("hadley/dplyr.git", ref="1405946")' \ -e 'devtools::install_github("ellisvalentiner/dbUtil.git")'
I can now use that Docker image for new projects and my dependencies will be installed. This works great as a reproducible environment for development and production.
In the future when I have a new project, I can update the dependencies but keep existing project pointed to the prior image version.
Use Docker to create a reproducible data science environment where important dependencies are managed.
Packrat might be useful for managing the dependencies at the higher-level Docker image. However this would require a way to pre-build the
packrat.lock file and populate the version requirements. It would probably be easier to identify the package version requirements locally (or in an interative session) and then hard code those steps in the Dockerfile.