Method of the month: custom likelihoods with Stan

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is custom likelihoods with Stan.

Principles

Regular readers of this blog will know that I am a fan of Bayesian methods. The exponential growth in personal computing power has opened up a whole new range of Bayesian models at home. WinBUGS and JAGS were the go-to pieces of software for estimating Bayesian models, both using Markov Chain Monte Carlo (MCMC) methods.  Theoretically, an MCMC chain will explore the posterior distribution. But MCMC has flaws. For example, if the target distribution has a high degree of curvature, such as many hierarchical models might exhibit, then MCMC chains can have trouble exploring it. To compensate, the chains stay in the ‘difficult’ bit of the space for longer before leaving to go elsewhere so its average oscillates around the true value. Asymptotically, these oscillations balance out, but in real, finite time, they ultimately lead to bias. And further, MCMC chains are very slow to converge to the target distribution, and for complex models can take a literal lifetime. An alternative, Hamiltonian Monte Carlo (HMC), provides a solution to these issues. Michael Betancourt’s introduction to HCM is great for anyone interested in the topic.

Stan is a ‘probabilistic programming language’ that implements HMC. A huge range of probability distributions are already implemented in the software, check out the manual for more information. And there is an R package, rstanarm, that estimates a number of standard models using normal R code that even means you can use these tools without learning the code. However, Stan may not have the necessary distributions for more complex econometric or statistical models. It used to be the case that you would have to build your own MCMC sampler – but given the problems with MCMC, this is now strongly advised against in lieu of HMC. Fortunately, we can implement our own probability density functions in Stan. So, if you can write down the (log) likelihood for your model, you can estimate it in Stan!

The aim of this post is to provide an example of implementing a custom probability function in Stan from the likelihood of our model. We will look at the nested logit model. These models have been widely used for multinomial choice problems. An area of interest among health economists is the choice of hospital provider. A basic multinomial choice model, such as a multinomial logit, requires an independence of irrelevant alternatives (IIA) assumption that says the odds of choosing one option over another is independent of any other alternative. For example, it would assume that the odds of me choosing the pharmacy in town over the hospital in town would be unaffected by a pharmacy opening on my own road. This is likely too strong. There are many ways to relax this assumption, the nested logit being one. The nested logit is useful when choices can be ‘nested’ in groups and assumes there is a correlation among choices with each group. For example, we can group health care providers into pharmacies, primary care clinics, hospitals, etc. such as this:

nested

Implementation

Econometric model

Firstly, we need a nesting structure for our choices, like that described above. We’ll consider a 2-level nesting structure, with branches and total choices, with Rt choices in each branch t. Like with most choice models we start from an additive random utility model, which is, for individual i=1,…,N, and with choice over branch and option:

U_{itr} = V_{itr} + \epsilon_{itr}

Then the chosen option is the one with the highest utility. The motivating feature of the nested logit is that the hierarchical structure allows us to factorise the joint probability of choosing branch and option r into a conditional and marginal model:

p_{itr} = p_{it} \times p_{ir|t}

Multinomial choice models arise when the errors are assumed to have a generalised extreme value (GEV) distribution, which gives use the multinomial logit model. We will model the deterministic part of the equation with branch-varying and option-varying variables:

V_{itr} = Z_{it}'\alpha + X_{itr}'\beta_t

Then the model can be written as:

p_{itr} = p_{it} \times p_{ir|t} = \frac{exp(Z_{it}'\alpha + \rho_t I_{it})}{\sum_{k \in T} exp(Z_{ik}'\alpha + \rho_k I_{ik})} \times \frac{exp(X_{itr}'\beta_t/\rho_t)}{\sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t) }

where \rho_t is variously called a scale parameter, correlation parameter, etc. and defines the within branch correlation (arising from the GEV distribution). We also have the log-sum, which is also called the inclusive value:

I_{it} = log \left(  \sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t)  \right).

Now we have our model setup, the log likelihood over all individuals is

\sum_{i=1}^N \sum_{k \in T} \sum_{m \in R_t} y_{itr} \left[ Z_{it}'\alpha + \rho_t I_{it} - log \left( \sum_{k \in T} exp(Z_{ik}'\alpha + \rho_k I_{ik}) \right) + X_{itr}'\beta_t/\rho_t - log \left(  \sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t) \right) \right]

As a further note, for the model to be compatible with an ARUM specification, a number of conditions need to be satisfied. One of these is satisfied is 0<\rho_t \leq 1, so we will make that restriction. We have also only included alternative-varying variables, but we are often interested in individual varying variables and allowing parameters to vary over alternatives, which can be simply added to this specification, but we will leave them out for now to keep things “simple”. We will also use basic weakly informative priors and leave prior specification as a separate issue we won’t consider further:

\alpha \sim normal(0,5), \beta_t \sim normal(0,5), \rho_t \sim Uniform(0,1)

Software

DISCLAIMER: This code is unlikely to be the most efficient, nor can I guarantee it is 100% correct – use at your peril!

The following assumes a familiarity with Stan and R.

Stan programs are divided into blocks including data, parameters, and model. The functions block allows us to define custom (log) probability density functions. These take a form something like:

real xxx_lpdf(real y, ...){}

which says that the function outputs a real valued variable and take a real valued variable, y, as one of its arguments. The _lpdf suffix allows the function to act as a density function in the program (and equivalently _lpmf for log probability mass functions for discrete variables). Now we just have to convert the log likelihood above into a function. But first, let’s just consider what data we will be passing to the program:

  • N, the number of observations;
  • T, the number of branches;
  • P, the number of branch-varying variables;
  • Q, the number of choice-varying variables;
  • R, a T x 1 vector with the number of choices in each branch, from which we can also derive the total number of options as sum(R). We will call the total number of options Rk for now;
  • Y, a N x Rk vector, where Y[i,j] = 1 if individual i=1,…,N chose choice j=1,…,Rk;
  • Z, a N x T x P array of branch-varying variables;
  • X, a N x Rk x Q array of choice-varying variables.

And the parameters:

  • \rho , a T x 1 vector of correlation parameters;
  • \alpha , a P x 1 vector of branch-level covariates;
  • \beta , a P x T matrix of choice-varying covariates.

Now, to develop the code, we will specify the function for individual observations of Y, rather than the whole matrix, and then perform the sum over all the individuals in the model block. So we only need to feed in each individual’s observations into the function rather than the whole data set. The model is specified in blocks as follows (with all the data and parameter as arguments to the function):

functions{
 real nlogit_lpdf(real[] y, real[,] Z, real[,] X, int[] R, 
   vector alpha, matrix beta, vector tau){
//first define our additional local variables
 real lprob; //variable to hold log prob
 int count1; //keep track of which option in the loops
 int count2; //keep track of which option in the loops
 vector[size(R)] I_val; //inclusive values
 real denom; //sum denominator of marginal model
//for the variables appearing in sum loops, set them to zero
 lprob = 0;
 count1 = 0;
 count2 = 0;
 denom = 0;
 
 // determine the log-sum for each conditional model, p_ir|t, 
 //i.e. inclusive value
 for(k in 1:size(R)){
    I_val[k] = 0;
    for(m in 1:R[k]){
       count1 = count1 + 1;
       I_val[k] = I_val[k] + exp(to_row_vector(X[count1,])*
          beta[,k] /tau[k]);
       }
    I_val[k] = log(I_val[k]);
 }
 
 //determine the sum for the marginal model, p_it, denomininator
 for(k in 1:size(R)){
    denom = denom + exp(to_row_vector(Z[k,])*alpha + tau[k]*I_val[k]);
 }
 
 //put everything together in the log likelihood
 for(k in 1:size(R)){
    for(m in 1:R[k]){
       count2 = count2 + 1;
       lprob = lprob + y[count2]*(to_row_vector(Z[k,])*alpha + 
         tau[k]*I_val[k] - log(denom) + 
         to_row_vector(X[count2,])*beta[,k] - I_val[k]);
       }
    }
// return the log likelihood value
 return lprob;
 }
}
data{
 int N; //number of observations
 int T; //number of branches
 int R[T]; //number of options per branch
 int P; //dim of Z
 int Q; //dim of X
 real y[N,sum(R)]; //outcomes array
 real Z[N,T,P]; //branch-varying variables array
 real X[N,sum(R),Q]; //option-varying variables array
}
parameters{
 vector<lower=0, upper=1>[T] rho; //scale-parameters
 vector[P] alpha; //branch-varying parameters
 matrix[Q,T] beta; //option-varying parameters
}
model{
//specify priors
 for(p in 1:P) alpha[p] ~ normal(0,5); 
 for(q in 1:Q) for(t in 1:T) beta[q,t] ~ normal(0,5);

//loop over all observations with the data 
 for(i in 1:N){
    y[i] ~ nlogit(Z[i,,],X[i,,],R,alpha,beta,rho);
 }
}

Simulation model

To see whether our model is doing what we’re hoping it’s doing, we can run a simple test with simulated data. It may be useful to compare the result we get to those from other estimators; the nested logit is most frequently estimated using the FIML estimator. But, neither Stata nor R provide packages that estimate a model with branch-varying variables – another reason why we sometimes need to program our own models.

The code we’ll use to simulate the data is:

#### simulate 2-level nested logit data ###

N <- 300 #number of people
P <- 2 #number of branch variant variables
Q <- 2 #number of option variant variables
R <- c(2,2,2) #vector with number of options per branch
T <- length(R) #number of branches
Rk <- sum(R) #number of options

#simulate data

Z <- array(rnorm(N*T*P,0,0.5),dim = c(N,T,P))
X <- array(rnorm(N*Rk*Q,0,0.5), dim = c(N,Rk,Q))

#parameters
rho <- runif(3,0.5,1)
beta <- matrix(rnorm(T*Q,0,1),c(Q,T))
alpha <- rnorm(P,0,1)

#option models #change beta indexing as required
vals_opt <- cbind(exp(X[,1,]%*%beta[,1]/rho[1]),exp(X[,2,]%*%beta[,1]/rho[1]),exp(X[,3,]%*%beta[,2]/rho[2]),
 exp(X[,4,]%*%beta[,2]/rho[2]),exp(X[,5,]%*%beta[,3]/rho[3]),exp(X[,6,]%*%beta[,3]/rho[3]))

incl_val <- cbind(vals_opt[,1]+vals_opt[,2],vals_opt[,3]+vals_opt[,4],vals_opt[,5]+vals_opt[,6])

vals_branch <- cbind(exp(Z[,1,]%*%alpha + rho[1]*log(incl_val[,1])),
 exp(Z[,2,]%*%alpha + rho[2]*log(incl_val[,2])),
 exp(Z[,3,]%*%alpha + rho[3]*log(incl_val[,3])))

sum_branch <- rowSums(vals_branch)

probs <- cbind((vals_opt[,1]/incl_val[,1])*(vals_branch[,1]/sum_branch),
 (vals_opt[,2]/incl_val[,1])*(vals_branch[,1]/sum_branch),
 (vals_opt[,3]/incl_val[,2])*(vals_branch[,2]/sum_branch),
 (vals_opt[,4]/incl_val[,2])*(vals_branch[,2]/sum_branch),
 (vals_opt[,5]/incl_val[,3])*(vals_branch[,3]/sum_branch),
 (vals_opt[,6]/incl_val[,3])*(vals_branch[,3]/sum_branch))

Y = t(apply(probs, 1, rmultinom, n = 1, size = 1))

Then we’ll put the data into a list and run the Stan program with 500 iterations and 3 chains:

data <- list(
 y = Y,
 X = X,
 Z = Z,
 R = R,
 T = T,
 N = N,
 P = P,
 Q = Q
)

require(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

fit <- stan("C:/Users/Samuel/Dropbox/Code/nlogit.stan",
 data = data,
 chains = 3,
 iter = 500)

Which gives results (with 25t and 75th percentiles dropped to fit on screen):

> print(fit)
Inference for Stan model: nlogit.
3 chains, each with iter=500; warmup=250; thin=1; 
post-warmup draws per chain=250, total post-warmup draws=750.

             mean se_mean   sd    2.5%     50%     75%   97.5% n_eff Rhat
rho[1]       1.00    0.00 0.00    0.99    1.00    1.00    1.00   750  1
rho[2]       0.87    0.00 0.10    0.63    0.89    0.95    1.00   750  1
rho[3]       0.95    0.00 0.04    0.84    0.97    0.99    1.00   750  1
alpha[1]    -1.00    0.01 0.17   -1.38   -0.99   -0.88   -0.67   750  1
alpha[2]    -0.56    0.01 0.16   -0.87   -0.56   -0.45   -0.26   750  1
beta[1,1]   -3.65    0.01 0.32   -4.31   -3.65   -3.44   -3.05   750  1
beta[1,2]   -0.28    0.01 0.24   -0.74   -0.27   -0.12    0.15   750  1
beta[1,3]    0.99    0.01 0.25    0.48    0.98    1.15    1.52   750  1
beta[2,1]   -0.15    0.01 0.25   -0.62   -0.16    0.00    0.38   750  1
beta[2,2]    0.28    0.01 0.24   -0.16    0.28    0.44    0.75   750  1
beta[2,3]    0.58    0.01 0.24    0.13    0.58    0.75    1.07   750  1
lp__      -412.84    0.14 2.53 -418.56 -412.43 -411.05 -409.06   326  1

Samples were drawn using NUTS(diag_e) at Sun May 06 14:16:43 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Which we can compare to the original parameters:

> beta
           [,1]       [,2]      [,3]
[1,] -3.9381389 -0.3476054 0.7191652
[2,] -0.1182806  0.2736159 0.5237470
> alpha
[1] -0.9654045 -0.6505002
> rho
[1] 0.9503473 0.9950653 0.5801372

You can see that the posterior means and quantiles of the distribution provide pretty good estimates of the original parameters. Convergence diagnostics such as Rhat and traceplots (not reproduced here) show good convergence of the chains. But, of course, this is not enough for us to rely on it completely – you would want to investigate further to ensure that the chains were actually exploring the posterior of interest.

Applications

I am not aware of any examples in health economics of using custom likelihoods in Stan. There are not even many examples of Bayesian nested logit models, one exception being a paper by Lahiri and Gao, who ‘analyse’ the nested logit using MCMC. But given the limitations of MCMC discussed above, one should prefer this implementation in the post rather than the MCMC samplers of that paper. It’s also a testament to computing advances and Stan that in 2001 an MCMC sampler and analysis could fill a paper in a decent econometrics journal and now we can knock one out for a blog post.

In terms of nested logit models in general in health economics, there are many examples going back 30 years (e.g. this article from 1987). More recent papers have preferred “mixed” or “random parameters” logit or probit specifications, which are much more flexible than the nested logit. We would advise these sorts of models for this reason. The nested logit was used as an illustrative example of estimating custom likelihoods for this post.

Credit

 

Sam Watson’s journal round-up for 23rd January 2017

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Short-term and long-term effects of GDP on traffic deaths in 18 OECD countries, 1960–2011. Journal of Epidemiology and Community Health [PubMedPublished February 2017

Understanding the relationships between different aspects of the economy or society in the aggregate can reveal to us knowledge about the world. However, they are more complicated than analyses of individuals who either did or did not receive an intervention, as the objects of aggregate analyses don’t ‘exist’ per se but are rather descriptions of average behaviour of the system. To make sense of these analyses an understanding of the system is therefore required. On these grounds I am a little unsure of the results of this paper, which estimates the effect of GDP on road traffic fatalities in OECD countries over time. It is noted that previous studies have shown that in the short-run, road traffic deaths are procyclical, but in the long-run they have declined, likely as a result of improved road and car safety. Indeed, this is what they find with their data and models. But, what does this result mean in the long-run? Have they picked up anything more than a correlation with time? Time is not included in the otherwise carefully specified models, so is the conclusion to policy makers, ‘just keep doing what you’re doing, whatever that is…’? Models of aggregate phenomena can be among the most interesting, but also among the least convincing (my own included!). That being said, this is better than most.

Sources of geographic variation in health care: Evidence from patient migration. Quarterly Journal of Economics [RePEcPublished November 2016

There are large geographic differences in health care utilisation both between countries and within countries. In the US, for example, the average Medicare enrollee spent around $14,400 in 2010 in Miami, Florida compared with around $7,800 in Minneapolis, Minnesota, even after adjusting for demographic differences. However, higher health care spending is generally not associated with better health outcomes. There is therefore an incentive for policy makers to legislate to reduce this disparity, but what will be effective depends on the causes of the variation. On one side, doctors may be dispensing treatments differently; for example, we previously featured a paper looking at the variation in overuse of medical testing by doctors. On the other side, patients may be sicker or have differing preferences on the intensity of their treatment. To try and distinguish between these two possible sources of variation, this paper uses geographical migration to look at utilisation among people who move from one area to another. They find that (a very specific) 47% of the difference in use of health care is attributable to patient characteristics. However, I (as ever) remain skeptical: a previous post brought up the challenge of ‘transformative treatments’, which may apply here as this paper has to rely on the assumption that patient preferences remain the same when they move. If moving from one city to another changes your preferences over healthcare, then their identification strategy no longer works well.

Seeing beyond 2020: an economic evaluation of contemporary and emerging strategies for elimination of Trypanosoma brucei gambiense. Lancet Global Health Published November 2016

African sleeping sickness, or Human African trypanosomiasis, is targeted for eradication in the next decade. However, the strategy to do so has not been determined, nor whether any such strategy would be a cost-effective use of resources. This paper aims to model all of these different strategies to estimate incremental cost-effectiveness threshold (ICERs). Infectious disease presents an interesting challenge for health economic evaluation as the disease transmission dynamics need to be captured over time, which they achieve here with a ‘standard’ epidemiological model using ordinary differential equations. To reach elimination targets, an approach incorporating case detection, treatment, and vector control would be required, they find.

A conceptual introduction to Hamiltonian Monte Carlo. ArXiv Published 10th January 2017

It is certainly possible to drive a car without understanding how the engine works. But if we want to get more out of the car or modify its components then we will have to start learning some mechanics. The same is true of statistical software. We can knock out a simple logistic regression without ever really knowing the theory or what the computer is doing. But this ‘black box’ approach to statistics has clear problems. How do we know the numbers on the screen mean what we think they mean? What if it doesn’t work or if it is running slowly, how do we diagnose the problem? Programs for Bayesian inference can sometimes seem even more opaque than others: one might well ask what are those chains actually exploring, if it’s even the distribution of interest. Well, over the last few years a new piece of kit, Stan, has become a brilliant and popular tool for Bayesian inference. It achieves fast convergence with less autocorrelation between chains and so it achieves a high effective sample size for relatively few iterations. This is due to its implementation of Hamiltonian Monte Carlo. But it’s founded in the mathematics of differential geometry, which has restricted the understanding of how it works to a limited few. This paper provides an excellent account of Hamiltonian Monte Carlo, how it works, and when it fails, all replete with figures. While it’s not necessary to become a theoretical or computational statistician, it is important, I think, to have a grasp of what the engine is doing if we’re going to play around with it.

Credits