Method of the month: custom likelihoods with Stan

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is custom likelihoods with Stan.


Regular readers of this blog will know that I am a fan of Bayesian methods. The exponential growth in personal computing power has opened up a whole new range of Bayesian models at home. WinBUGS and JAGS were the go-to pieces of software for estimating Bayesian models, both using Markov Chain Monte Carlo (MCMC) methods.  Theoretically, an MCMC chain will explore the posterior distribution. But MCMC has flaws. For example, if the target distribution has a high degree of curvature, such as many hierarchical models might exhibit, then MCMC chains can have trouble exploring it. To compensate, the chains stay in the ‘difficult’ bit of the space for longer before leaving to go elsewhere so its average oscillates around the true value. Asymptotically, these oscillations balance out, but in real, finite time, they ultimately lead to bias. And further, MCMC chains are very slow to converge to the target distribution, and for complex models can take a literal lifetime. An alternative, Hamiltonian Monte Carlo (HMC), provides a solution to these issues. Michael Betancourt’s introduction to HCM is great for anyone interested in the topic.

Stan is a ‘probabilistic programming language’ that implements HMC. A huge range of probability distributions are already implemented in the software, check out the manual for more information. And there is an R package, rstanarm, that estimates a number of standard models using normal R code that even means you can use these tools without learning the code. However, Stan may not have the necessary distributions for more complex econometric or statistical models. It used to be the case that you would have to build your own MCMC sampler – but given the problems with MCMC, this is now strongly advised against in lieu of HMC. Fortunately, we can implement our own probability density functions in Stan. So, if you can write down the (log) likelihood for your model, you can estimate it in Stan!

The aim of this post is to provide an example of implementing a custom probability function in Stan from the likelihood of our model. We will look at the nested logit model. These models have been widely used for multinomial choice problems. An area of interest among health economists is the choice of hospital provider. A basic multinomial choice model, such as a multinomial logit, requires an independence of irrelevant alternatives (IIA) assumption that says the odds of choosing one option over another is independent of any other alternative. For example, it would assume that the odds of me choosing the pharmacy in town over the hospital in town would be unaffected by a pharmacy opening on my own road. This is likely too strong. There are many ways to relax this assumption, the nested logit being one. The nested logit is useful when choices can be ‘nested’ in groups and assumes there is a correlation among choices with each group. For example, we can group health care providers into pharmacies, primary care clinics, hospitals, etc. such as this:



Econometric model

Firstly, we need a nesting structure for our choices, like that described above. We’ll consider a 2-level nesting structure, with branches and total choices, with Rt choices in each branch t. Like with most choice models we start from an additive random utility model, which is, for individual i=1,…,N, and with choice over branch and option:

U_{itr} = V_{itr} + \epsilon_{itr}

Then the chosen option is the one with the highest utility. The motivating feature of the nested logit is that the hierarchical structure allows us to factorise the joint probability of choosing branch and option r into a conditional and marginal model:

p_{itr} = p_{it} \times p_{ir|t}

Multinomial choice models arise when the errors are assumed to have a generalised extreme value (GEV) distribution, which gives use the multinomial logit model. We will model the deterministic part of the equation with branch-varying and option-varying variables:

V_{itr} = Z_{it}'\alpha + X_{itr}'\beta_t

Then the model can be written as:

p_{itr} = p_{it} \times p_{ir|t} = \frac{exp(Z_{it}'\alpha + \rho_t I_{it})}{\sum_{k \in T} exp(Z_{ik}'\alpha + \rho_k I_{ik})} \times \frac{exp(X_{itr}'\beta_t/\rho_t)}{\sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t) }

where \rho_t is variously called a scale parameter, correlation parameter, etc. and defines the within branch correlation (arising from the GEV distribution). We also have the log-sum, which is also called the inclusive value:

I_{it} = log \left(  \sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t)  \right).

Now we have our model setup, the log likelihood over all individuals is

\sum_{i=1}^N \sum_{k \in T} \sum_{m \in R_t} y_{itr} \left[ Z_{it}'\alpha + \rho_t I_{it} - log \left( \sum_{k \in T} exp(Z_{ik}'\alpha + \rho_k I_{ik}) \right) + X_{itr}'\beta_t/\rho_t - log \left(  \sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t) \right) \right]

As a further note, for the model to be compatible with an ARUM specification, a number of conditions need to be satisfied. One of these is satisfied is 0<\rho_t \leq 1, so we will make that restriction. We have also only included alternative-varying variables, but we are often interested in individual varying variables and allowing parameters to vary over alternatives, which can be simply added to this specification, but we will leave them out for now to keep things “simple”. We will also use basic weakly informative priors and leave prior specification as a separate issue we won’t consider further:

\alpha \sim normal(0,5), \beta_t \sim normal(0,5), \rho_t \sim Uniform(0,1)


DISCLAIMER: This code is unlikely to be the most efficient, nor can I guarantee it is 100% correct – use at your peril!

The following assumes a familiarity with Stan and R.

Stan programs are divided into blocks including data, parameters, and model. The functions block allows us to define custom (log) probability density functions. These take a form something like:

real xxx_lpdf(real y, ...){}

which says that the function outputs a real valued variable and take a real valued variable, y, as one of its arguments. The _lpdf suffix allows the function to act as a density function in the program (and equivalently _lpmf for log probability mass functions for discrete variables). Now we just have to convert the log likelihood above into a function. But first, let’s just consider what data we will be passing to the program:

  • N, the number of observations;
  • T, the number of branches;
  • P, the number of branch-varying variables;
  • Q, the number of choice-varying variables;
  • R, a T x 1 vector with the number of choices in each branch, from which we can also derive the total number of options as sum(R). We will call the total number of options Rk for now;
  • Y, a N x Rk vector, where Y[i,j] = 1 if individual i=1,…,N chose choice j=1,…,Rk;
  • Z, a N x T x P array of branch-varying variables;
  • X, a N x Rk x Q array of choice-varying variables.

And the parameters:

  • \rho , a T x 1 vector of correlation parameters;
  • \alpha , a P x 1 vector of branch-level covariates;
  • \beta , a P x T matrix of choice-varying covariates.

Now, to develop the code, we will specify the function for individual observations of Y, rather than the whole matrix, and then perform the sum over all the individuals in the model block. So we only need to feed in each individual’s observations into the function rather than the whole data set. The model is specified in blocks as follows (with all the data and parameter as arguments to the function):

 real nlogit_lpdf(real[] y, real[,] Z, real[,] X, int[] R, 
   vector alpha, matrix beta, vector tau){
//first define our additional local variables
 real lprob; //variable to hold log prob
 int count1; //keep track of which option in the loops
 int count2; //keep track of which option in the loops
 vector[size(R)] I_val; //inclusive values
 real denom; //sum denominator of marginal model
//for the variables appearing in sum loops, set them to zero
 lprob = 0;
 count1 = 0;
 count2 = 0;
 denom = 0;
 // determine the log-sum for each conditional model, p_ir|t, 
 //i.e. inclusive value
 for(k in 1:size(R)){
    I_val[k] = 0;
    for(m in 1:R[k]){
       count1 = count1 + 1;
       I_val[k] = I_val[k] + exp(to_row_vector(X[count1,])*
          beta[,k] /tau[k]);
    I_val[k] = log(I_val[k]);
 //determine the sum for the marginal model, p_it, denomininator
 for(k in 1:size(R)){
    denom = denom + exp(to_row_vector(Z[k,])*alpha + tau[k]*I_val[k]);
 //put everything together in the log likelihood
 for(k in 1:size(R)){
    for(m in 1:R[k]){
       count2 = count2 + 1;
       lprob = lprob + y[count2]*(to_row_vector(Z[k,])*alpha + 
         tau[k]*I_val[k] - log(denom) + 
         to_row_vector(X[count2,])*beta[,k] - I_val[k]);
// return the log likelihood value
 return lprob;
 int N; //number of observations
 int T; //number of branches
 int R[T]; //number of options per branch
 int P; //dim of Z
 int Q; //dim of X
 real y[N,sum(R)]; //outcomes array
 real Z[N,T,P]; //branch-varying variables array
 real X[N,sum(R),Q]; //option-varying variables array
 vector<lower=0, upper=1>[T] rho; //scale-parameters
 vector[P] alpha; //branch-varying parameters
 matrix[Q,T] beta; //option-varying parameters
//specify priors
 for(p in 1:P) alpha[p] ~ normal(0,5); 
 for(q in 1:Q) for(t in 1:T) beta[q,t] ~ normal(0,5);

//loop over all observations with the data 
 for(i in 1:N){
    y[i] ~ nlogit(Z[i,,],X[i,,],R,alpha,beta,rho);

Simulation model

To see whether our model is doing what we’re hoping it’s doing, we can run a simple test with simulated data. It may be useful to compare the result we get to those from other estimators; the nested logit is most frequently estimated using the FIML estimator. But, neither Stata nor R provide packages that estimate a model with branch-varying variables – another reason why we sometimes need to program our own models.

The code we’ll use to simulate the data is:

#### simulate 2-level nested logit data ###

N <- 300 #number of people
P <- 2 #number of branch variant variables
Q <- 2 #number of option variant variables
R <- c(2,2,2) #vector with number of options per branch
T <- length(R) #number of branches
Rk <- sum(R) #number of options

#simulate data

Z <- array(rnorm(N*T*P,0,0.5),dim = c(N,T,P))
X <- array(rnorm(N*Rk*Q,0,0.5), dim = c(N,Rk,Q))

rho <- runif(3,0.5,1)
beta <- matrix(rnorm(T*Q,0,1),c(Q,T))
alpha <- rnorm(P,0,1)

#option models #change beta indexing as required
vals_opt <- cbind(exp(X[,1,]%*%beta[,1]/rho[1]),exp(X[,2,]%*%beta[,1]/rho[1]),exp(X[,3,]%*%beta[,2]/rho[2]),

incl_val <- cbind(vals_opt[,1]+vals_opt[,2],vals_opt[,3]+vals_opt[,4],vals_opt[,5]+vals_opt[,6])

vals_branch <- cbind(exp(Z[,1,]%*%alpha + rho[1]*log(incl_val[,1])),
 exp(Z[,2,]%*%alpha + rho[2]*log(incl_val[,2])),
 exp(Z[,3,]%*%alpha + rho[3]*log(incl_val[,3])))

sum_branch <- rowSums(vals_branch)

probs <- cbind((vals_opt[,1]/incl_val[,1])*(vals_branch[,1]/sum_branch),

Y = t(apply(probs, 1, rmultinom, n = 1, size = 1))

Then we’ll put the data into a list and run the Stan program with 500 iterations and 3 chains:

data <- list(
 y = Y,
 X = X,
 Z = Z,
 R = R,
 T = T,
 N = N,
 P = P,
 Q = Q

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

fit <- stan("C:/Users/Samuel/Dropbox/Code/nlogit.stan",
 data = data,
 chains = 3,
 iter = 500)

Which gives results (with 25t and 75th percentiles dropped to fit on screen):

> print(fit)
Inference for Stan model: nlogit.
3 chains, each with iter=500; warmup=250; thin=1; 
post-warmup draws per chain=250, total post-warmup draws=750.

             mean se_mean   sd    2.5%     50%     75%   97.5% n_eff Rhat
rho[1]       1.00    0.00 0.00    0.99    1.00    1.00    1.00   750  1
rho[2]       0.87    0.00 0.10    0.63    0.89    0.95    1.00   750  1
rho[3]       0.95    0.00 0.04    0.84    0.97    0.99    1.00   750  1
alpha[1]    -1.00    0.01 0.17   -1.38   -0.99   -0.88   -0.67   750  1
alpha[2]    -0.56    0.01 0.16   -0.87   -0.56   -0.45   -0.26   750  1
beta[1,1]   -3.65    0.01 0.32   -4.31   -3.65   -3.44   -3.05   750  1
beta[1,2]   -0.28    0.01 0.24   -0.74   -0.27   -0.12    0.15   750  1
beta[1,3]    0.99    0.01 0.25    0.48    0.98    1.15    1.52   750  1
beta[2,1]   -0.15    0.01 0.25   -0.62   -0.16    0.00    0.38   750  1
beta[2,2]    0.28    0.01 0.24   -0.16    0.28    0.44    0.75   750  1
beta[2,3]    0.58    0.01 0.24    0.13    0.58    0.75    1.07   750  1
lp__      -412.84    0.14 2.53 -418.56 -412.43 -411.05 -409.06   326  1

Samples were drawn using NUTS(diag_e) at Sun May 06 14:16:43 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Which we can compare to the original parameters:

> beta
           [,1]       [,2]      [,3]
[1,] -3.9381389 -0.3476054 0.7191652
[2,] -0.1182806  0.2736159 0.5237470
> alpha
[1] -0.9654045 -0.6505002
> rho
[1] 0.9503473 0.9950653 0.5801372

You can see that the posterior means and quantiles of the distribution provide pretty good estimates of the original parameters. Convergence diagnostics such as Rhat and traceplots (not reproduced here) show good convergence of the chains. But, of course, this is not enough for us to rely on it completely – you would want to investigate further to ensure that the chains were actually exploring the posterior of interest.


I am not aware of any examples in health economics of using custom likelihoods in Stan. There are not even many examples of Bayesian nested logit models, one exception being a paper by Lahiri and Gao, who ‘analyse’ the nested logit using MCMC. But given the limitations of MCMC discussed above, one should prefer this implementation in the post rather than the MCMC samplers of that paper. It’s also a testament to computing advances and Stan that in 2001 an MCMC sampler and analysis could fill a paper in a decent econometrics journal and now we can knock one out for a blog post.

In terms of nested logit models in general in health economics, there are many examples going back 30 years (e.g. this article from 1987). More recent papers have preferred “mixed” or “random parameters” logit or probit specifications, which are much more flexible than the nested logit. We would advise these sorts of models for this reason. The nested logit was used as an illustrative example of estimating custom likelihoods for this post.



Bad science in health economics: complementary medicine, costs and mortality

By Chris Sampson, David Whitehurst and Andrew Street

In December 2012, an article was published in The European Journal of Health Economics with the title ‘Patients whose GP knows complementary medicine tend to have lower costs and live longer’. We spotted a number of shortcomings in the analysis and reporting, to which we felt a response was worthwhile. Subsequently the authors of the original piece, Professor Peter Kooreman and Dr Erik Baars, wrote a reply. In this blog post we summarise the debate and offer some concluding thoughts.

The study

The study employed a large dataset (n~150,000) from a Dutch health insurer. The objective of the study was “to explore the cost-effectiveness of CAM compared with conventional medicine”. The study sought to find out whether different levels of cost or mortality were observed depending on whether or not an individual’s general practitioner (GP) was trained in complementary and alternative medicine (CAM). The authors specifically looked at GPs trained in anthroposophy, homeopathy and acupuncture.

The authors implemented both a linear and log-linear regression model to estimate the cost differences associated with different types of CAM-training. Separate regressions were carried out for each type of CAM, for four different age groups and for five different cost categories. This gave a total of 120 different coefficients (2 (models) x 3 (CAM approaches) x 4 (age groups) x 5 (cost categories)) for the cost difference associated with CAM-training. Eighteen (15%) of these coefficients were negative (indicating positive findings attributable to CAM training) and statistically significant at the 5% level. Three (2.5%) coefficients showed a greater cost associated with CAM training.

For mortality effects, the authors implemented both a fixed effects logit and a fixed effects linear probability model (LPM). In this case the groups were split by sex and, again, by type of CAM-training; additionally an overall effect of CAM-training was included. This gave a total of 24 different coefficients for the mortality difference associated with CAM-training. Four (16.7%) of these were lower and statistically significant at the 5% level; all from the LPM.

The authors concluded that “patients whose GP has additional CAM training have 0–30% lower healthcare costs and mortality rates, depending on age groups and type of CAM”; adding that “since the differences are obtained while controlling for confounders… the lower costs and longer lives are unlikely to be related to differences in socioeconomic status.”

The study’s faults

A major problem with the study is one of selection. Selection is important in this study; there is selection of individuals who decide whether or not to register with CAM-trained GPs and selection of GPs who choose to pursue CAM. Patients that register with CAM-trained GPs may have different characteristics from those who do not, and exhibit different levels of cost and mortality as a result of these characteristics, rather than of CAM itself. The risk-adjustment the authors perform is the only way they deal with selection, and the set of risk-adjusters is very small; including only age, gender and postal code. The authors defend their position by citing a paper suggesting that selection bias might operate in the other direction. Neither we nor the authors can prove this one way or another. To thoroughly address selection, a larger set of risk-adjusters should be included and an approach such as propensity score matching would have been superior to the model adopted by the authors.

In reporting and reflecting upon their analyses, the authors do not recognise the problems associated with multiple testing. The authors appear to misunderstand the familywise error rate and the implications of this for the results that are currently shown as statistically significant. The authors should have accounted for this, using a method such as the Bonferroni correction.

The primary claims of the study are that patients with CAM-trained GPs had “0–30% lower costs” and “0–30% lower mortality rates”. These claims can be found throughout the original study, including the title, and in the authors’ subsequent dealings with the media. We believe that the first claim is a ‘cherry-picked’ finding; the second is simply false.

With regard to costs, as identified in the authors’ reply, the 30% relates specifically to patients “aged 75 and above with an anthroposophic GP-CAM”. But there are some coefficients that show a greater cost associated with CAM-trained GPs. Yet the paper’s title and publicity statements focus on this significant result alone. This is not an accurate reflection of the cost implications for patients in general, and highlighting this cherry-picked result is a misleading representation of the overall effects. A more appropriate way of reporting the results would have been to present the expected cost differences across the whole sample.

The analysis of mortality is simply incorrect. Mortality risk is bounded by 0,1 but the linear probability model is unbounded; making it inappropriate to model mortality data. The logit model is designed for binary outcomes, and when this is employed the significance of the mortality differences disappears or is less than 5%. But even the logit is inappropriate for these data because mortality is an infrequent event (around 3% of the sample died). A probit model would be preferable and we suspect that, had a probit been employed, no significant differences would be found. In short, the ‘significant’ effects that the authors identify are due to incorrect model specification.

In their responses, the authors retreated from their original emphasis on the significance of the mortality results saying that “our results do not show any evidence that patients of GP-CAMs have higher mortality rates”. We agree with this re-statement. Nevertheless, the title of the paper remains “Patients whose GP knows complementary medicine tend to … live longer”, which the authors now appear to admit is false.

Closing remarks

The study was available in its current form, as well as earlier versions, long before it was published in the EJHE. As a result, the study’s inaccurate claims have been repeated in a number of papers that cite the work in relation to herbal medicine and CAM in primary care. The publicity sounding these claims, and the authors’ conduct with the media, has been discussed elsewhere (English translation).

We believe that the original study and the response pieces might be used as a case study to aid teaching. To this end we have provided material to the Health Economics Education website. In addition, please do consider commenting below to develop the discussion – whatever your thoughts on the matter. Do you see other flaws in the study design? Or maybe you think some of our comments are unfounded? Are there better ways of studying important questions such as these?

To ‘y’ or not to ‘y’: Dichotomous outcomes and endogenous regressors

Empirical research into health and health related outcomes is characterised by a predominance of binary outcomes. One of the most popular outcomes (in a statistical sense) is mortality, for example. This type of outcome warrants its own type of model that allows the outcome to be observed with probability p – the outcome is a draw from a Bernoulli distribution with this probability; is allowed to vary across individuals. One distinction arises between the economics literature and the medical literature in that the linear probability model (LPM) is fairly popular in the former whereas it is never seen in the latter. The LPM is simply OLS estimation of the binary outcome on the regressors and is popular due to its easily interpretable marginal effects (equal to the estimated coefficient) and ease with which other procedures such as instrumental variables estimation can be used. However, the LPM may predict probabilities outside of the zero to one range and may even be inconsistent in many cases (see here). The probit model is also popular in the economics literature but is likewise rarely seen in medical studies. It is the logit which is ubiquitous in these analyses. But, in the medical literature, studies don’t usually try to address endogeneity whereas in economics we often do. Usually instrumental variables are employed to tackle this problem, but how do we use them in logit models?

In a linear model the two stage least squares (2SLS) estimator can be used. The endogenous variables are regressed on all the exogenous variables including the instruments, then the predicted value of the endogenous variable is used in place of its actual value in a regression. But in a model where the outcome is a nonlinear function of the regressors, such as a logit, this method would be inconsistent. To see why, note that we are trying to estimate a model that assumes:

E(y|\textbf{x},\textbf{z},c)=m(\textbf{x}_1'\boldsymbol{\beta}+\textbf{z}'\gamma+c) (1)

Where x is assumed to be exogenous of which x_1 is a subset, c is unobserved and z is allowed to be correlated with c so that it is potentially endogenous. We model:




This issue is that if ρ≠0 then z is endogenous. For our estimates to be consistent we essentially require the conditional mean in (1) to be correctly specified. We have two options, we can estimate \hat{\textbf{z}} and substitute it for z in (1) or we can eliminate c. In the latter case, assuming E(e)=0, we can rewrite (1) as:

E(y|\textbf{x},\textbf{z},\textbf{v})=m(\textbf{x}_1'\boldsymbol{\beta}+\textbf{z}'\gamma+\textbf{v}'\boldsymbol{\rho}) (2)

But, we do not observe v, however, we can consistently estimate it as \hat{\textbf{v}}=z-\textbf{x}'\hat{\boldsymbol{\Pi}} and include these values in our regression. This method is known as two stage residual inclusion (2SRI).

Our other method however is inconsistent; we can use estimates of \hat{\textbf{z}} in place of z but this does not eliminate c since, even though E(c)=0, the expectation does not ‘pass through’ the nonlinear function m(.).

A further useful feature of this comes from the fact that exogeneity of z only happens when ρ=0. We can test this empirically when we estimate (2). This is equivalent to a Hausman test for the exogeneity of z.

The standard errors won’t be correct when estimating (2). Calculating the correct standard errors is not too difficult. But, often in health econometric applications, we want to adjust for clustering within hospitals or regions. To accommodate this into our standard errors makes the calculation considerably more difficult, if not intractable. In this case bootstrapping is the preferred solution.

I have noted in previous posts that endogeneity could be a serious issue in health econometrics, particularly when these types of studies are used to inform healthcare policy. Clearly there are methods for dealing with this, though having a suitable methodology is only the first hurdle. The next one is convincing non-economists why you are using it.

See more

2SRI in health econometrics is discussed in this paper. For a more thorough discussion see Wooldridge Chapter 12.