R dependency hell
Aug 2, 2016 · 726 words · 4 minutes read
Package management in R is terrible. Don’t get me wrong, R has a lot of other great features but when it comes to package management it is far behind other languages.
In R, packages are installed in libraries. Libraries are just directories in the file system with subdirectories for each package installed there. Packages are installed with install.packages("pkg")
and loaded with the library(pkg)
function.
You can specify a number of options, like the repository to install the package from (e.g. CRAN, Bioconductor) and the local library location, but you can’t (easily) declare exact package versions.
Now there is another way to go about installing packages — the devtools
package by Hadley Wickham provides several convenience functions for installing packages from external repositories (e.g. GitHub, Bitbucket). Installing a package from a git repository supports git references (commits, tags, or branches).
devtools::install_github("usr/repo@ref")
This works sort of fine. There is also a install_version()
function in devtools to install specific versions from CRAN but both these functions require a lot of manual setup. By this I mean that it isn’t a very transportable solution.
Enter Packrat and Checkpoint.
Packrat and Checkpoint are package management packages from RStudio and Revolution Analytics (respectively). They both try to overcome the shortcomings of package management in R but do so in different ways.
Packrat
Packrat setups a library as a subdirectory of the project. This contains the package source and binary files for the project. A separate file, packrat.lock
, lists all of the packages and their versions, including all of their dependencies. This file is created and updated by Packrat automatically or when you call the packrat::snapshot()
function.
Despite all of the above and integration with RStudio, Packrat has some problems. Namely that it is still difficult to install exact versions of packages, especially because Packrat may upgrade dependencies without asking. I’ve found this to be a problem when installing packages that aren’t on CRAN.
If you are installing packages from private GitHub repositories be sure to generate an auth token and set GITHUB_PAT
in your .Renviron
file. Unfortunately auth tokens for private repositories in Bitbucket aren’t supported yet (but probably will be soon).
Packrat can work really well especially if you start your project with Packrat enabled. However turning it on mid-project or using packages from private repositories can make things complicated. This isn’t entirely Packrat’s fault as it uses functions from devtools to install packages from GitHub and Bitbucket – and these functions greedily upgrade dependencies (when it isn’t necessary) by default.
Checkpoint
Checkpoint uses a daily snapshot of CRAN. A library is created for each snapshot date and can be easily shared across projects. This simplifies package management quite a bit because you only have to specify a date. However this means Checkpoint can’t mix and match packages from different dates.
Checkpoint also only works with CRAN, so packages stored elsewhere (e.g. Bioconductor, GitHub, etc.) have to be managed separately. This can especially lead to complications with their dependencies.
One trick is to manually install packages from non-CRAN repositories using devtools and explicitly not install or upgrade dependencies.
library(checkpoint)
checkpoint("2016-07-30")
devtools::install_github("usr/repo@ref",
dependencies = FALSE,
upgrade_dependencies = FALSE)
However the whole point of a package management tools are to reduce the manual effort.
Someone had better figure this out soon
R has many other tools that make it great for reproducible research (e.g. R Markdown). More and more researchers are publishing their R code. But without better package management, researchers be disappointed.
Other languages have pretty much figured this out. Everyone is familiar with using pip
and requirement files in Python. Julia ships with a Pkg
module with functions for package management.
Going forward it is hard to tell what the end result will be. Researchers and data scientists may leave R as other languages develop more sophisticated analytical capabilities. SciPy is already robust and Julia’s statistical libraries continue to grow.
The existing package management tools may be extended or new tools may be developed. For some R users the existing tools may already be sufficient, but still many do not use them. In part this is due to undefined best practices. As these tools mature their adoption may become more widespread.
Finally issues of package management may continue to be largely ignored. Researchers may continue to rely on R. Reproducible research may continue to suffer. The world may continue to turn.