Stan Puzzle 1

Nov 27, 2015 · 1361 words · 7 minutes read rstan

Back in September, Bob Carpenter posted a Stan puzzle to Andrew Gelman’s blog. I didn’t notice until Bob posted a second puzzle in November. I’m still learning to use Stan and thought this puzzles would be great exercises. So I decided to take a look and try to solve the puzzle myself (although the solution has already been posted).

Free Throws

Here is the puzzle as posted on Andrew Gelman’s blog:

Suppose a player is shooting free throws, but rather than recording for each attempt whether it was successful, she or he instead reports the length of her or his streaks of consecutive successes. For the sake of this puzzle, assume the the player makes a sequence of free throw attempts, $z = (z_1, z_2, \ldots)$, assumed to be i.i.d. Bernoulli trials with chance of success $\theta$, until $N$ streaks are recorded. The data recorded is only the length of the streaks, $y = (y_1, \ldots, y_N)$.

Puzzle: Write a Stan program to estimate $p(\theta | y)$.

Example: Suppose a player sets out to record 4 streaks and makes shots

$z = (0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0)$.

This produces the observed data

$N = 4$

$y = (3, 1, 2, 5)$.

Any number of initial misses (0 values) in $z$ would lead to the same $y$. Also, the sequence $z$ always ends with the the first failure after the $N$-th streak.

Hint: Feel free to assume a uniform prior $p(\theta)$. The trick is working out the likelihood $p(y | \theta, N)$, after which it is trivial to use Stan to compute $p(\theta | y)$ via sampling.

Another Hint: Marginalize the unobserved failures (0 values) out of $z$. This is non-trivial because we don’t know the length of $z$.

Extra Credit: Is anything lost in observing $y$ rather than $z$?

The data we have is the number of streaks, $N$, and the length of the streaks, $y$. Note that we aren’t actually allowed to use the complete data, $z$, and can only make use of $y$. The goal of the puzzle is to estimate $\theta$, the probability of success.

The last part of the problem contains an important piece of information—there is no distinction between a single miss (i.e. $0$) and a “streak” of misses (e.g. $0, 0, …$). This is really helpful because it reduces the problem to a single transition case (i.e. $1$ to $0$). In fact since we are observing $N$ streaks, and we’ve collapsed any streaks of zero cases (misses) to a single event, the probability of the transition from 0 to 1 can be considered 1. Now since the misses are marginalized, we want to be sure to only model the subsequent successes (if any). So given that the player has already made one shot, required to begin the streak, we are interested the probability of making another shot.

Since we are taking a Bayesian approach we phrase the problem as “what value of $\theta$ is most likely, given the data?”, or stated another way, “given what we’ve observed, what is the probability the player will make another shot?”.

Although our data is recorded as counts we are interested in the chance of making a shot, so this is a binomial problem (probability of making the shot). It is easy to think that the problem should be approached as a Poisson problem or negative binomial problem—that is to say since we are only given count data, the problem may be easily mistaken as one using either the Poisson or negative binomial distributions. Given that these distributions are related and the presentation of the problem, this would be an understandable error. However we are squarely tasked with estimating $p(\theta | y)$.

R Code

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

model = "
data {
  int n;
  int y[n];
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  theta ~ beta(1, 1);
  sum(y) - n ~ binomial(n, theta);
}"

z = c(0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0)
x = rle(z)
y = x$lengths[x$value == 1] - 1
n = length(y)
data = list("n"=n, "y"=y)

fit = stan(model_code = model, data=data, iter = 5000)

Printing the object displays the simulation results. The simulation consisted of 4 chains, each with 5000 iterations. The first 2500 iteractions were discarded from each chain to allow the sampler to warmup, yielding 2500 post-warmup iterations per chain. No thinning of the chains was performed.

Inference for Stan model: 441b9b22174fb849c0ecdf0100d53155.
4 chains, each with iter=5000; warmup=2500; thin=1;
post-warmup draws per chain=2500, total post-warmup draws=10000.

       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
theta  0.67    0.00 0.18  0.29  0.55  0.68  0.80  0.95  3508    1
lp__  -4.36    0.01 0.76 -6.55 -4.53 -4.06 -3.87 -3.82  2787    1

Samples were drawn using NUTS(diag_e) at Thu Nov 26 17:27:15 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Based on 3508 effective samples, the posterior mean for $\theta$ is 0.67 with a 95% posterior interval from 0.29 to 0.95. We can interpret this as the following: given the data we’ve observed the probability the basketball player will make another shot, $\theta$, is about 0.67. However this is qualified by a large 95% posterior interval (0.29, 0.95) which demonstrates a large amount of uncertainty for this estimate.

Retrospective

This problem was trickier than I first though but not for the reasons I would have expected. As it turns out Stan has pretty helpful error messages—if you take the time to read them and think about what they mean. In early attempts I forgot to remove the leading success from the streak lengths and I admit it took much longer to fix this problem than it should have. I should have spent less time Googling the error and more time thinking about what caused it.

As mentioned above it is tempting to think about this as a Poisson or negative binomial problem. In fact the negative binomial distribution might be a good way to think about the problem because you make draws with some probability of success until you’ve observed some number of failures. We can apply the negative binomial distribution where we stop observing after a single failure. I decided to try this out and compare the results.

data {
  int N;
  int y[N];
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  theta ~ beta(1, 1);
  sum(y) ~ neg_binomial(1, theta);
}
generated quantities {
  real pr_success;
  pr_success <- 1 - theta;
}

The code is pretty similar with small changes to the model statement and the inclusion of a generated quantites statement. Somewhat surprisingly it ran fine on the first try, but I guess Stan isn’t actually that difficult. Since we are using the negative binomial, $\theta$ is the probability of failure. Using the generated quantities statement we can easily get the probability of success instead.

Inference for Stan model: c3b39edc8cf7553e3b565397c19f76c7.
4 chains, each with iter=5000; warmup=2500; thin=1;
post-warmup draws per chain=2500, total post-warmup draws=10000.

            mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
theta       0.33    0.00 0.22  0.04  0.15  0.28  0.46  0.87  3371    1
pr_success  0.67    0.00 0.22  0.13  0.54  0.72  0.85  0.96  3371    1
lp__       -5.44    0.01 0.80 -7.72 -5.67 -5.12 -4.91 -4.84  2882    1

Samples were drawn using NUTS(diag_e) at Fri Nov 27 09:41:40 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

The posterior mean is consistent with our estimate using the binomial distribution, but the 95% posterior interval is wider (0.13, 0.96) and we have fewer effective samples so our sampling with the negative binomial distribution is slightly less efficient.

This was a fun exercise and gave me an excuse to use Stan. I’ll have to be sure to make an attempt on the second Stan puzzle. Hopefully Bob and Andrew keep putting these together because they are a great way to get familiar with using Stan to solve Bayesian problems.