Using RStan to analyze the Old Faithful data

Sep 21, 2015 · 1724 words · 9 minutes read bayesian statisticsRstan

Stan is a probabilistic programming language used to fit full Bayesian models. It is relatively new (initial release was August 2012) but the language has a lot of support. It is the successor to WinBUGS and OpenBUGS and JAGS. A noteable feature of Stan is that it uses the No U-Turn Sampler (NUTS), an adaptive variant of Hamiltonian Monte Carlo (HMC) sampling.

While familiar with WinBUGS and INLA I have not fit many models using Stan. This is in part due to not working on problems that lend themselves to a full Bayesian analysis and also to not investing the time to learn to use Stan when I do have a problem for which Bayesian modeling is appropriate. Your standard chicken and egg problem.

Bob Carpenter recently spoke at the Ann Arbor R User Group about Stan. His talk focused on the motivation for Stan and the computational advantages of Stan over its predecessors. Unfortunately we did not have time to hear much about actually using Stan.

Fortunately there is good documentation and those familiar with WinBUGS should have little trouble converting. However not all of the examples in the manual are not self contained. For instance, those who are not familiar with Andrew Gelman’s schools example should read Bayesian Data Analysis (Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin, 2014) or Data analysis using regression and multilevel/hierarchical models (Gelman, Andrew, and Jennifer Hill, 2006) if they want to follow along with the manual. These are both good books.

One challenge to those learning Bayesian statistics is that they are often also learning hierarchical models (which need not be Bayesian but often are) and also learning how to fit these models. Here I fit a “simple” linear regression model using Stan. The model is simple in that it consists of one dependent variable and one independent variable. For this problem Bayesian modeling is certainly not needed, a Frequentist model works just fine. However starting with an example where Frequentist and Bayesian methods agree lets us comfortably conclude that we got the right answer and lets us explore the advantages of taking a Bayesian approach.

Analysis

The dataset is the base R variant of the Old Faithful dataset, containing 272 observations of eruption durations and waiting intervals between eruptions. The data are not nested (or have any other hierarchical/structured relationship) so there are two levels in the model. Level one contains the data – observations of eruption durations and waiting time. Level two contains the prior distributions for each model parameter.

Stan Installation and Setup

Installing RStan is very easy but requires a few more steps than most other packages on CRAN. Detailed instructions are on the RStan github page.

Preamble

Begin by loading the library and running the recommended code. This tells RStan to automatically write a compiled instance of the stanmodel-class to the hard drive and to run the Markov chains in parallel (by setting the number of CPU cores). This is not too much extra code but I would prefer that this was run automatically because it is sensible default behavior.

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

Stan Model

A Stan model can be saved as a standalone text file or as a character object in R, and consists of several components. Each code block is fairly self-explanatory. The only one that is required is the model block, although it would be unusual to not have blocks for at least the data and parameters.

The data block describes the data that will be received from R. In the data block I declare the number of observations, N, and parameters, p, as integers greater than or equal to zero, and the response, y, and predictor, x, as real number vectors. Arrays are supported as well – so I could have specified the data, x, to be a $N * p$ dimensional matrix.

data {
    int<lower=0> N;
    int<lower=0> p;
    real y[N];
    real x[N];
}

Next I declare my model parameters: the intercept, the coefficient for eruption duration, and the error. By default these parameters are assumed to come from an improper uniform distribution. The uniform distribution is considered improper because it spans the range of all real numbers (although I have specified a lower bound on the error term). Stan allows for this and has some smart ways to allow for uniform priors with unspecified bounds. In practice it is good to normalize your variables (which I didn’t do here).

parameters {
    real beta0;
    real beta1;
    real<lower=0> sigma;
}

The transformed parameters block lets us declare and define the predicted values, mu, for each observation.

transformed parameters {
    real mu[N];
    for (i in 1:N){
        mu[i] <- beta0 + beta1*x[i];
    }
}

In the model block I use a vectorized operations to fit the model.

model {
    y ~ normal(mu, sigma);
}

Now that the Stan code has been written, the blocks can be saved to a text file or as an object in R. I saved my code as an R object to keep all of my project code in a single .Rmd file.

Data

In the data block I described data that I need R to pass to Stan. RStan does not use the formula expression available with other R modeling functions (e.g. glm). There are some convenience functions in other packages for this and Bob indicated that this may be supported naitively in the future. Instead we have to setup the data in a list containing a vector for each variable (parameters and constants). These are the same variables declared in the data block of the Stan code.

faithful_dat <- list(N = nrow(faithful),
                     p = 1,
                     y = faithful$waiting,
                     x = faithful$eruptions)

Fitting the model

With the Stan code and list of data we can now fit the model. The Stan code is contained the an R object called mod so I use model_code. We can specify arguments such as the number of iterations and chains, length of the warmup period, and degree of chain thinning. We can also

fit <- stan(model_code = mod,    # stan model code
            data = faithful_dat, # list containing parameter data
            iter = 1000,         # number of iterations
            chains = 3)          # number of chains

Results

The model runs pretty quick and returns a S4 object containing the model name, names of parameters, parameter dimensions, mode of the fitted model, simulations (samples for the model), initial parameter values, arguments, Stan model code, object creation timestamp, and miscellaneous information (e.g. environment). Note that the mode of the fitted model refers to one of three types: sampling mode, test gradient mode, or error mode (no samples included).

> str(fit, max.level = 2)
Formal class 'stanfit' [package "rstan"] with 10 slots
  ..@ model_name: chr "da39be20ba67896bc71593083ad71b8c"
  ..@ model_pars: chr [1:5] "beta0" "beta1" "sigma" "mu" ...
  ..@ par_dims  :List of 5
  ..@ mode      : int 0
  ..@ sim       :List of 12
  ..@ inits     :List of 3
  ..@ stan_args :List of 3
  ..@ stanmodel :Formal class 'stanmodel' [package "rstan"] with 4 slots
  ..@ date      : chr "Sun Sep 20 22:11:56 2015"
  ..@ .MISC     :<environment: 0x119435580>

We can see posterior summaries by printing the stanfit object. By default the first half of each chain are excluded to allow for warmup, so these results reflect the 1500 post-warmup samples (500 per chain).

> print(fit)
Inference for Stan model: da39be20ba67896bc71593083ad71b8c.
3 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=1500.

           mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
beta0     33.46    0.05 1.11   31.28   32.69   33.49   34.18   35.71   413 1.00
beta1     10.73    0.01 0.30   10.13   10.53   10.74   10.93   11.30   416 1.00
sigma      5.95    0.01 0.25    5.48    5.77    5.94    6.12    6.46   420 1.00
mu[1]     72.10    0.01 0.36   71.39   71.85   72.11   72.33   72.83  1500 1.00
mu[2]     52.78    0.03 0.63   51.57   52.37   52.78   53.19   54.11   472 1.00
mu[3]     69.24    0.01 0.36   68.55   68.98   69.24   69.48   69.94  1500 1.00
...
mu[270]   80.87    0.01 0.45   79.98   80.57   80.86   81.17   81.77  1112 1.00
mu[271]   52.96    0.03 0.62   51.76   52.55   52.96   53.37   54.28   473 1.00
mu[272]   81.41    0.01 0.46   80.48   81.10   81.39   81.72   82.33  1062 1.00
lp__    -618.09    0.06 1.14 -620.83 -618.63 -617.81 -617.25 -616.75   350 1.01

Samples were drawn using NUTS(diag_e) at Sun Sep 20 23:41:31 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Warmup

For a model this simple excluding the first half of each chain might be overly aggressive. We can extract the simulations using extract and examine them. Note that extract is the name of a magrittr function and a rstan function so I am explicit here in calling the function from the rstan package. Also I keep the samples in their original order and include the warmup iterations (since that is what I am interested in seeing here anyway).

simulations <- rstan::extract(fit, pars=c('beta0', 'beta1', 'sigma'),
                              permuted=FALSE, inc_warmup=TRUE) %>%
    melt %>%
    data.table

We can then plot the values of each parameter for each chain at each iteration.

Warmup

Through visual inspection we can conclude the chains converge well before 500 iterations. Since we have the simulation values we could easily test this (calculate R-hat or another statistic). I am skipping that here.

Post-warmup

We can also plot the post-warmup iterations. The chains have converged and are mixing well.

Post Warmup

At this point we could examine diagnostics, such as checking for autocorrelation.

Wrap-up

It is fairly easy to use Stan to fit full Bayesian models although this was a very simple model. The results are consistent with what we expect from a Frequentist analysis (see previous post fitting a Frequentist model using Julia) but the Bayesian approach enables more robust inference. Extending this code to more sophisticated models, such as multilevel models with nested groupings, is relatively simple. The purpose of this exercise was to start at the bare minimum which can now be expanded on.

Obligatory R session information:

R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fontcm_1.1       extrafont_0.17   rstan_2.7.0-1    inline_0.3.14   
 [5] Rcpp_0.12.0      magrittr_1.5     reshape2_1.4.1   data.table_1.9.4
 [9] ggplot2_1.0.1    raster_2.4-18    sp_1.1-1        

loaded via a namespace (and not attached):
 [1] Rttf2pt1_1.3.3   knitr_1.11       MASS_7.3-43      munsell_0.4.2   
 [5] colorspace_1.2-6 lattice_0.20-33  stringr_1.0.0    plyr_1.8.3      
 [9] tools_3.2.2      parallel_3.2.2   grid_3.2.2       gtable_0.1.2    
[13] extrafontdb_1.0  fortunes_1.5-2   digest_0.6.8     codetools_0.2-14
[17] labeling_0.3     stringi_0.5-5    scales_0.3.0     stats4_3.2.2    
[21] chron_2.3-47     proto_0.3-10