Method of the month: custom likelihoods with Stan

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is custom likelihoods with Stan.

Principles

Regular readers of this blog will know that I am a fan of Bayesian methods. The exponential growth in personal computing power has opened up a whole new range of Bayesian models at home. WinBUGS and JAGS were the go-to pieces of software for estimating Bayesian models, both using Markov Chain Monte Carlo (MCMC) methods.  Theoretically, an MCMC chain will explore the posterior distribution. But MCMC has flaws. For example, if the target distribution has a high degree of curvature, such as many hierarchical models might exhibit, then MCMC chains can have trouble exploring it. To compensate, the chains stay in the ‘difficult’ bit of the space for longer before leaving to go elsewhere so its average oscillates around the true value. Asymptotically, these oscillations balance out, but in real, finite time, they ultimately lead to bias. And further, MCMC chains are very slow to converge to the target distribution, and for complex models can take a literal lifetime. An alternative, Hamiltonian Monte Carlo (HMC), provides a solution to these issues. Michael Betancourt’s introduction to HCM is great for anyone interested in the topic.

Stan is a ‘probabilistic programming language’ that implements HMC. A huge range of probability distributions are already implemented in the software, check out the manual for more information. And there is an R package, rstanarm, that estimates a number of standard models using normal R code that even means you can use these tools without learning the code. However, Stan may not have the necessary distributions for more complex econometric or statistical models. It used to be the case that you would have to build your own MCMC sampler – but given the problems with MCMC, this is now strongly advised against in lieu of HMC. Fortunately, we can implement our own probability density functions in Stan. So, if you can write down the (log) likelihood for your model, you can estimate it in Stan!

The aim of this post is to provide an example of implementing a custom probability function in Stan from the likelihood of our model. We will look at the nested logit model. These models have been widely used for multinomial choice problems. An area of interest among health economists is the choice of hospital provider. A basic multinomial choice model, such as a multinomial logit, requires an independence of irrelevant alternatives (IIA) assumption that says the odds of choosing one option over another is independent of any other alternative. For example, it would assume that the odds of me choosing the pharmacy in town over the hospital in town would be unaffected by a pharmacy opening on my own road. This is likely too strong. There are many ways to relax this assumption, the nested logit being one. The nested logit is useful when choices can be ‘nested’ in groups and assumes there is a correlation among choices with each group. For example, we can group health care providers into pharmacies, primary care clinics, hospitals, etc. such as this: Implementation

Econometric model

Firstly, we need a nesting structure for our choices, like that described above. We’ll consider a 2-level nesting structure, with branches and total choices, with Rt choices in each branch t. Like with most choice models we start from an additive random utility model, which is, for individual i=1,…,N, and with choice over branch and option: $U_{itr} = V_{itr} + \epsilon_{itr}$

Then the chosen option is the one with the highest utility. The motivating feature of the nested logit is that the hierarchical structure allows us to factorise the joint probability of choosing branch and option r into a conditional and marginal model: $p_{itr} = p_{it} \times p_{ir|t}$

Multinomial choice models arise when the errors are assumed to have a generalised extreme value (GEV) distribution, which gives use the multinomial logit model. We will model the deterministic part of the equation with branch-varying and option-varying variables: $V_{itr} = Z_{it}'\alpha + X_{itr}'\beta_t$

Then the model can be written as: $p_{itr} = p_{it} \times p_{ir|t} = \frac{exp(Z_{it}'\alpha + \rho_t I_{it})}{\sum_{k \in T} exp(Z_{ik}'\alpha + \rho_k I_{ik})} \times \frac{exp(X_{itr}'\beta_t/\rho_t)}{\sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t) }$

where $\rho_t$ is variously called a scale parameter, correlation parameter, etc. and defines the within branch correlation (arising from the GEV distribution). We also have the log-sum, which is also called the inclusive value: $I_{it} = log \left( \sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t) \right)$.

Now we have our model setup, the log likelihood over all individuals is $\sum_{i=1}^N \sum_{k \in T} \sum_{m \in R_t} y_{itr} \left[ Z_{it}'\alpha + \rho_t I_{it} - log \left( \sum_{k \in T} exp(Z_{ik}'\alpha + \rho_k I_{ik}) \right) + X_{itr}'\beta_t/\rho_t - log \left( \sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t) \right) \right]$

As a further note, for the model to be compatible with an ARUM specification, a number of conditions need to be satisfied. One of these is satisfied is $0<\rho_t \leq 1$, so we will make that restriction. We have also only included alternative-varying variables, but we are often interested in individual varying variables and allowing parameters to vary over alternatives, which can be simply added to this specification, but we will leave them out for now to keep things “simple”. We will also use basic weakly informative priors and leave prior specification as a separate issue we won’t consider further: $\alpha \sim normal(0,5), \beta_t \sim normal(0,5), \rho_t \sim Uniform(0,1)$

Software

DISCLAIMER: This code is unlikely to be the most efficient, nor can I guarantee it is 100% correct – use at your peril!

The following assumes a familiarity with Stan and R.

Stan programs are divided into blocks including data, parameters, and model. The functions block allows us to define custom (log) probability density functions. These take a form something like:

real xxx_lpdf(real y, ...){}

which says that the function outputs a real valued variable and take a real valued variable, y, as one of its arguments. The _lpdf suffix allows the function to act as a density function in the program (and equivalently _lpmf for log probability mass functions for discrete variables). Now we just have to convert the log likelihood above into a function. But first, let’s just consider what data we will be passing to the program:

• N, the number of observations;
• T, the number of branches;
• P, the number of branch-varying variables;
• Q, the number of choice-varying variables;
• R, a T x 1 vector with the number of choices in each branch, from which we can also derive the total number of options as sum(R). We will call the total number of options Rk for now;
• Y, a N x Rk vector, where Y[i,j] = 1 if individual i=1,…,N chose choice j=1,…,Rk;
• Z, a N x T x P array of branch-varying variables;
• X, a N x Rk x Q array of choice-varying variables.

And the parameters:

• $\rho$, a T x 1 vector of correlation parameters;
• $\alpha$, a P x 1 vector of branch-level covariates;
• $\beta$, a P x T matrix of choice-varying covariates.

Now, to develop the code, we will specify the function for individual observations of Y, rather than the whole matrix, and then perform the sum over all the individuals in the model block. So we only need to feed in each individual’s observations into the function rather than the whole data set. The model is specified in blocks as follows (with all the data and parameter as arguments to the function):

functions{
real nlogit_lpdf(real[] y, real[,] Z, real[,] X, int[] R,
vector alpha, matrix beta, vector tau){
//first define our additional local variables
real lprob; //variable to hold log prob
int count1; //keep track of which option in the loops
int count2; //keep track of which option in the loops
vector[size(R)] I_val; //inclusive values
real denom; //sum denominator of marginal model
//for the variables appearing in sum loops, set them to zero
lprob = 0;
count1 = 0;
count2 = 0;
denom = 0;

// determine the log-sum for each conditional model, p_ir|t,
//i.e. inclusive value
for(k in 1:size(R)){
I_val[k] = 0;
for(m in 1:R[k]){
count1 = count1 + 1;
I_val[k] = I_val[k] + exp(to_row_vector(X[count1,])*
beta[,k] /tau[k]);
}
I_val[k] = log(I_val[k]);
}

//determine the sum for the marginal model, p_it, denomininator
for(k in 1:size(R)){
denom = denom + exp(to_row_vector(Z[k,])*alpha + tau[k]*I_val[k]);
}

//put everything together in the log likelihood
for(k in 1:size(R)){
for(m in 1:R[k]){
count2 = count2 + 1;
lprob = lprob + y[count2]*(to_row_vector(Z[k,])*alpha +
tau[k]*I_val[k] - log(denom) +
to_row_vector(X[count2,])*beta[,k] - I_val[k]);
}
}
// return the log likelihood value
return lprob;
}
}
data{
int N; //number of observations
int T; //number of branches
int R[T]; //number of options per branch
int P; //dim of Z
int Q; //dim of X
real y[N,sum(R)]; //outcomes array
real Z[N,T,P]; //branch-varying variables array
real X[N,sum(R),Q]; //option-varying variables array
}
parameters{
vector<lower=0, upper=1>[T] rho; //scale-parameters
vector[P] alpha; //branch-varying parameters
matrix[Q,T] beta; //option-varying parameters
}
model{
//specify priors
for(p in 1:P) alpha[p] ~ normal(0,5);
for(q in 1:Q) for(t in 1:T) beta[q,t] ~ normal(0,5);

//loop over all observations with the data
for(i in 1:N){
y[i] ~ nlogit(Z[i,,],X[i,,],R,alpha,beta,rho);
}
}

Simulation model

To see whether our model is doing what we’re hoping it’s doing, we can run a simple test with simulated data. It may be useful to compare the result we get to those from other estimators; the nested logit is most frequently estimated using the FIML estimator. But, neither Stata nor R provide packages that estimate a model with branch-varying variables – another reason why we sometimes need to program our own models.

The code we’ll use to simulate the data is:

#### simulate 2-level nested logit data ###

N <- 300 #number of people
P <- 2 #number of branch variant variables
Q <- 2 #number of option variant variables
R <- c(2,2,2) #vector with number of options per branch
T <- length(R) #number of branches
Rk <- sum(R) #number of options

#simulate data

Z <- array(rnorm(N*T*P,0,0.5),dim = c(N,T,P))
X <- array(rnorm(N*Rk*Q,0,0.5), dim = c(N,Rk,Q))

#parameters
rho <- runif(3,0.5,1)
beta <- matrix(rnorm(T*Q,0,1),c(Q,T))
alpha <- rnorm(P,0,1)

#option models #change beta indexing as required
vals_opt <- cbind(exp(X[,1,]%*%beta[,1]/rho),exp(X[,2,]%*%beta[,1]/rho),exp(X[,3,]%*%beta[,2]/rho),
exp(X[,4,]%*%beta[,2]/rho),exp(X[,5,]%*%beta[,3]/rho),exp(X[,6,]%*%beta[,3]/rho))

incl_val <- cbind(vals_opt[,1]+vals_opt[,2],vals_opt[,3]+vals_opt[,4],vals_opt[,5]+vals_opt[,6])

vals_branch <- cbind(exp(Z[,1,]%*%alpha + rho*log(incl_val[,1])),
exp(Z[,2,]%*%alpha + rho*log(incl_val[,2])),
exp(Z[,3,]%*%alpha + rho*log(incl_val[,3])))

sum_branch <- rowSums(vals_branch)

probs <- cbind((vals_opt[,1]/incl_val[,1])*(vals_branch[,1]/sum_branch),
(vals_opt[,2]/incl_val[,1])*(vals_branch[,1]/sum_branch),
(vals_opt[,3]/incl_val[,2])*(vals_branch[,2]/sum_branch),
(vals_opt[,4]/incl_val[,2])*(vals_branch[,2]/sum_branch),
(vals_opt[,5]/incl_val[,3])*(vals_branch[,3]/sum_branch),
(vals_opt[,6]/incl_val[,3])*(vals_branch[,3]/sum_branch))

Y = t(apply(probs, 1, rmultinom, n = 1, size = 1))

Then we’ll put the data into a list and run the Stan program with 500 iterations and 3 chains:

data <- list(
y = Y,
X = X,
Z = Z,
R = R,
T = T,
N = N,
P = P,
Q = Q
)

require(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

fit <- stan("C:/Users/Samuel/Dropbox/Code/nlogit.stan",
data = data,
chains = 3,
iter = 500)

Which gives results (with 25t and 75th percentiles dropped to fit on screen):

> print(fit)
Inference for Stan model: nlogit.
3 chains, each with iter=500; warmup=250; thin=1;
post-warmup draws per chain=250, total post-warmup draws=750.

mean se_mean   sd    2.5%     50%     75%   97.5% n_eff Rhat
rho       1.00    0.00 0.00    0.99    1.00    1.00    1.00   750  1
rho       0.87    0.00 0.10    0.63    0.89    0.95    1.00   750  1
rho       0.95    0.00 0.04    0.84    0.97    0.99    1.00   750  1
alpha    -1.00    0.01 0.17   -1.38   -0.99   -0.88   -0.67   750  1
alpha    -0.56    0.01 0.16   -0.87   -0.56   -0.45   -0.26   750  1
beta[1,1]   -3.65    0.01 0.32   -4.31   -3.65   -3.44   -3.05   750  1
beta[1,2]   -0.28    0.01 0.24   -0.74   -0.27   -0.12    0.15   750  1
beta[1,3]    0.99    0.01 0.25    0.48    0.98    1.15    1.52   750  1
beta[2,1]   -0.15    0.01 0.25   -0.62   -0.16    0.00    0.38   750  1
beta[2,2]    0.28    0.01 0.24   -0.16    0.28    0.44    0.75   750  1
beta[2,3]    0.58    0.01 0.24    0.13    0.58    0.75    1.07   750  1
lp__      -412.84    0.14 2.53 -418.56 -412.43 -411.05 -409.06   326  1

Samples were drawn using NUTS(diag_e) at Sun May 06 14:16:43 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Which we can compare to the original parameters:

> beta
[,1]       [,2]      [,3]
[1,] -3.9381389 -0.3476054 0.7191652
[2,] -0.1182806  0.2736159 0.5237470
> alpha
 -0.9654045 -0.6505002
> rho
 0.9503473 0.9950653 0.5801372

You can see that the posterior means and quantiles of the distribution provide pretty good estimates of the original parameters. Convergence diagnostics such as Rhat and traceplots (not reproduced here) show good convergence of the chains. But, of course, this is not enough for us to rely on it completely – you would want to investigate further to ensure that the chains were actually exploring the posterior of interest.

Applications

I am not aware of any examples in health economics of using custom likelihoods in Stan. There are not even many examples of Bayesian nested logit models, one exception being a paper by Lahiri and Gao, who ‘analyse’ the nested logit using MCMC. But given the limitations of MCMC discussed above, one should prefer this implementation in the post rather than the MCMC samplers of that paper. It’s also a testament to computing advances and Stan that in 2001 an MCMC sampler and analysis could fill a paper in a decent econometrics journal and now we can knock one out for a blog post.

In terms of nested logit models in general in health economics, there are many examples going back 30 years (e.g. this article from 1987). More recent papers have preferred “mixed” or “random parameters” logit or probit specifications, which are much more flexible than the nested logit. We would advise these sorts of models for this reason. The nested logit was used as an illustrative example of estimating custom likelihoods for this post.

Credit

Method of the month: Semiparametric models with penalised splines

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is semiparametric models with penalised splines.

Principles

A common assumption of regression models is that effects are linear and additive. However, nothing is ever really that simple. One might respond that all models are wrong, but some are useful, as George Box once said. And the linear, additive regression model has coefficients that can be interpreted as average treatment effects under the right assumptions. Sometimes though we are interested in conditional average treatment effects and how the impact of an intervention varies according to the value of some variable of interest. Often this relationship is not linear and we don’t know its functional form. Splines provide a way of estimating curves (or surfaces) of unknown functional form and are a widely used tool for semiparametric regression models. The term ‘spline’ was derived from the tool shipbuilders and drafters used to construct smooth edges: a bendable piece of material that when fixed at a number of points would relax into the desired shape.

Implementation

Our interest lies in estimating the unknown function m: $y_i = m(x_i) + e_i$

A ‘spline’ in the mathematical sense is a function constructed piece-wise from polynomial functions. The places where the functions meet are known as knots and the spline has order equal to one more than the degree of the underlying polynomial terms. Basis-splines or B-splines are the typical starting point for spline functions. These are curves that are defined recursively as a sum of ‘basis functions’, which depend only on the polynomial degree and the knots. A spline function can be represented as a linear combination of B-splines, the parameters dictating this combination can be estimated using standard regression model estimation techniques. If we have $N$ B-splines then our regression function can be estimated as: $y_i = \sum_{j=1}^N ( \alpha_j B_j(x_i) ) + e_i$

by minimising $\sum_{i=1}^N \{ y_i - \sum_{j=1}^N ( \alpha_j B_j(x_i) ) \} ^2$. Where the $B_j$ are the B-splines and the $\alpha_j$ are coefficients to be estimated.

Useful technical explainers of splines and B-splines can be found here [PDF] and here [PDF].

One issue with fitting splines to data is that we run the risk of ‘overfitting’. Outliers might distort the curve we fit, damaging the external validity of conclusions we might make. To deal with this, we can enforce a certain level of smoothness using so-called penalty functions. The smoothness (or conversely the ‘roughness’) of a curve is often defined by the integral of the square of the second derivative of the curve function. Penalised-splines, or P-splines, were therefore proposed which added on this smoothness term multiplied by a smoothing parameter $\lambda$. In this case, we look to minimising: $\sum_{i=1}^N \{ y_i - \sum_{j=1}^N ( \alpha_j B_j(x_i) ) \}^2 + \lambda\int m''(x_i)^2 dx$

to estimate our parameters. Many other different variations on this penalty have been proposed. This article provides a good explanation of P-splines.

An attractive type of spline has become the ‘low rank thin plate spline‘. This type of spline is defined by its penalty, which has a physical analogy with the resistance that a thin sheet of metal puts up when it is bent. This type of spline removes the problem associated with thin plate splines of having too many parameters to estimate by taking a ‘low rank’ approximation, and it is generally insensitive to the choice of knots, which other penalised spline regression models are not.

Crainiceanu and colleagues show how the low rank thin plate smooth splines can be represented as a generalised linear mixed model. In particular, our model can be represented as: $m(x_i) = \beta_0 + \beta_1x_i + \sum_{k=1}^K u_k |x_i - \kappa_k|^3$

where $\kappa_k$, $k=1,...,K$, are the knots. The parameters, $\theta = (\beta_0,\beta_1,u_k)'$, can be estimated by minimising $\sum_{i=1}^N \{ y_i - m(x_i) \} ^2 + \frac{1}{\lambda} \theta ^T D \theta$ .

This is shown to give the mixed model $y_i = \beta_0 + \beta_1 + Z'b + u_i$

where each random coefficient in the vector $b$ is distributed as $N(0,\sigma^2_b)$ and $Z$ and $D$ are given in the paper cited above.

As a final note, we have discussed splines in one dimension, but they can be extended to more dimensions. A two-dimensional spline can be generated by taking the tensor product of the two one dimensional spline functions. I leave this as an exercise for the reader.

Software

R

• The package gamm4 provides the tools necessary for a frequentist analysis along the lines described in this post. It uses restricted maximum likelihood estimation with the package lme4 to estimate the parameters of the thin plate spline model.
• A Bayesian version of this functionality is implemented in the package rstanarm, which uses gamm4 to produce the matrices for thin plate spline models and Stan for the estimation through the stan_gamm4 function.

If you wanted to implement these models for yourself from scratch, Crainiceanu and colleagues provide the R code to generate the matrices necessary to estimate the spline function:

n<-length(covariate)
X<-cbind(rep(1,n),covariate)
knots<-quantile(unique(covariate),
seq(0,1,length=(num.knots+2))[-c(1,(num.knots+2))])
Z_K<-(abs(outer(covariate,knots,"-")))^3
OMEGA_all<-(abs(outer(knots,knots,"-")))^3
svd.OMEGA_all<-svd(OMEGA_all)
sqrt.OMEGA_all<-t(svd.OMEGA_all$v %*% (t(svd.OMEGA_all$u)*sqrt(svd.OMEGA_all$d))) Z<-t(solve(sqrt.OMEGA_all,t(Z_K))) Stata I will temper this advice by cautioning that I have never estimated a spline-based semi-parametric model in Stata, so what follows may be hopelessly incorrect. The only implementation of penalised splines in Stata is the package and associated function psplineHowever, I cannot find any information about the penalty function used, so I would advise some caution when implementing. An alternative is to program the model yourself, through conversion of the above R code in Mata to generate the matrix Z and then the parameters could be estimated with xtmixed. Applications Applications of these semi-parametric models in the world of health economics have tended to appear more in technical or statistical journals than health economics journals or economics more generally. For example, recent examples include Li et al who use penalised splines to estimate the relationship between disease duration and health care costs. Wunder and co look at how reported well-being varies over the course of the lifespan. And finally, we have Stollenwerk and colleagues who use splines to estimate flexible predictive models for cost-of-illness studies with ‘big data’. Credit Sam Watson’s journal round-up for 23rd January 2017 Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch. Short-term and long-term effects of GDP on traffic deaths in 18 OECD countries, 1960–2011. Journal of Epidemiology and Community Health [PubMedPublished February 2017 Understanding the relationships between different aspects of the economy or society in the aggregate can reveal to us knowledge about the world. However, they are more complicated than analyses of individuals who either did or did not receive an intervention, as the objects of aggregate analyses don’t ‘exist’ per se but are rather descriptions of average behaviour of the system. To make sense of these analyses an understanding of the system is therefore required. On these grounds I am a little unsure of the results of this paper, which estimates the effect of GDP on road traffic fatalities in OECD countries over time. It is noted that previous studies have shown that in the short-run, road traffic deaths are procyclical, but in the long-run they have declined, likely as a result of improved road and car safety. Indeed, this is what they find with their data and models. But, what does this result mean in the long-run? Have they picked up anything more than a correlation with time? Time is not included in the otherwise carefully specified models, so is the conclusion to policy makers, ‘just keep doing what you’re doing, whatever that is…’? Models of aggregate phenomena can be among the most interesting, but also among the least convincing (my own included!). That being said, this is better than most. Sources of geographic variation in health care: Evidence from patient migration. Quarterly Journal of Economics [RePEcPublished November 2016 There are large geographic differences in health care utilisation both between countries and within countries. In the US, for example, the average Medicare enrollee spent around$14,400 in 2010 in Miami, Florida compared with around \$7,800 in Minneapolis, Minnesota, even after adjusting for demographic differences. However, higher health care spending is generally not associated with better health outcomes. There is therefore an incentive for policy makers to legislate to reduce this disparity, but what will be effective depends on the causes of the variation. On one side, doctors may be dispensing treatments differently; for example, we previously featured a paper looking at the variation in overuse of medical testing by doctors. On the other side, patients may be sicker or have differing preferences on the intensity of their treatment. To try and distinguish between these two possible sources of variation, this paper uses geographical migration to look at utilisation among people who move from one area to another. They find that (a very specific) 47% of the difference in use of health care is attributable to patient characteristics. However, I (as ever) remain skeptical: a previous post brought up the challenge of ‘transformative treatments’, which may apply here as this paper has to rely on the assumption that patient preferences remain the same when they move. If moving from one city to another changes your preferences over healthcare, then their identification strategy no longer works well.

Seeing beyond 2020: an economic evaluation of contemporary and emerging strategies for elimination of Trypanosoma brucei gambiense. Lancet Global Health Published November 2016

African sleeping sickness, or Human African trypanosomiasis, is targeted for eradication in the next decade. However, the strategy to do so has not been determined, nor whether any such strategy would be a cost-effective use of resources. This paper aims to model all of these different strategies to estimate incremental cost-effectiveness threshold (ICERs). Infectious disease presents an interesting challenge for health economic evaluation as the disease transmission dynamics need to be captured over time, which they achieve here with a ‘standard’ epidemiological model using ordinary differential equations. To reach elimination targets, an approach incorporating case detection, treatment, and vector control would be required, they find.

A conceptual introduction to Hamiltonian Monte Carlo. ArXiv Published 10th January 2017

It is certainly possible to drive a car without understanding how the engine works. But if we want to get more out of the car or modify its components then we will have to start learning some mechanics. The same is true of statistical software. We can knock out a simple logistic regression without ever really knowing the theory or what the computer is doing. But this ‘black box’ approach to statistics has clear problems. How do we know the numbers on the screen mean what we think they mean? What if it doesn’t work or if it is running slowly, how do we diagnose the problem? Programs for Bayesian inference can sometimes seem even more opaque than others: one might well ask what are those chains actually exploring, if it’s even the distribution of interest. Well, over the last few years a new piece of kit, Stan, has become a brilliant and popular tool for Bayesian inference. It achieves fast convergence with less autocorrelation between chains and so it achieves a high effective sample size for relatively few iterations. This is due to its implementation of Hamiltonian Monte Carlo. But it’s founded in the mathematics of differential geometry, which has restricted the understanding of how it works to a limited few. This paper provides an excellent account of Hamiltonian Monte Carlo, how it works, and when it fails, all replete with figures. While it’s not necessary to become a theoretical or computational statistician, it is important, I think, to have a grasp of what the engine is doing if we’re going to play around with it.

Credits