# Method of the month: custom likelihoods with Stan

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is custom likelihoods with Stan.

# Principles

Regular readers of this blog will know that I am a fan of Bayesian methods. The exponential growth in personal computing power has opened up a whole new range of Bayesian models at home. WinBUGS and JAGS were the go-to pieces of software for estimating Bayesian models, both using Markov Chain Monte Carlo (MCMC) methods.  Theoretically, an MCMC chain will explore the posterior distribution. But MCMC has flaws. For example, if the target distribution has a high degree of curvature, such as many hierarchical models might exhibit, then MCMC chains can have trouble exploring it. To compensate, the chains stay in the ‘difficult’ bit of the space for longer before leaving to go elsewhere so its average oscillates around the true value. Asymptotically, these oscillations balance out, but in real, finite time, they ultimately lead to bias. And further, MCMC chains are very slow to converge to the target distribution, and for complex models can take a literal lifetime. An alternative, Hamiltonian Monte Carlo (HMC), provides a solution to these issues. Michael Betancourt’s introduction to HCM is great for anyone interested in the topic.

Stan is a ‘probabilistic programming language’ that implements HMC. A huge range of probability distributions are already implemented in the software, check out the manual for more information. And there is an R package, rstanarm, that estimates a number of standard models using normal R code that even means you can use these tools without learning the code. However, Stan may not have the necessary distributions for more complex econometric or statistical models. It used to be the case that you would have to build your own MCMC sampler – but given the problems with MCMC, this is now strongly advised against in lieu of HMC. Fortunately, we can implement our own probability density functions in Stan. So, if you can write down the (log) likelihood for your model, you can estimate it in Stan!

The aim of this post is to provide an example of implementing a custom probability function in Stan from the likelihood of our model. We will look at the nested logit model. These models have been widely used for multinomial choice problems. An area of interest among health economists is the choice of hospital provider. A basic multinomial choice model, such as a multinomial logit, requires an independence of irrelevant alternatives (IIA) assumption that says the odds of choosing one option over another is independent of any other alternative. For example, it would assume that the odds of me choosing the pharmacy in town over the hospital in town would be unaffected by a pharmacy opening on my own road. This is likely too strong. There are many ways to relax this assumption, the nested logit being one. The nested logit is useful when choices can be ‘nested’ in groups and assumes there is a correlation among choices with each group. For example, we can group health care providers into pharmacies, primary care clinics, hospitals, etc. such as this:

# Implementation

## Econometric model

Firstly, we need a nesting structure for our choices, like that described above. We’ll consider a 2-level nesting structure, with branches and total choices, with Rt choices in each branch t. Like with most choice models we start from an additive random utility model, which is, for individual i=1,…,N, and with choice over branch and option:

$U_{itr} = V_{itr} + \epsilon_{itr}$

Then the chosen option is the one with the highest utility. The motivating feature of the nested logit is that the hierarchical structure allows us to factorise the joint probability of choosing branch and option r into a conditional and marginal model:

$p_{itr} = p_{it} \times p_{ir|t}$

Multinomial choice models arise when the errors are assumed to have a generalised extreme value (GEV) distribution, which gives use the multinomial logit model. We will model the deterministic part of the equation with branch-varying and option-varying variables:

$V_{itr} = Z_{it}'\alpha + X_{itr}'\beta_t$

Then the model can be written as:

$p_{itr} = p_{it} \times p_{ir|t} = \frac{exp(Z_{it}'\alpha + \rho_t I_{it})}{\sum_{k \in T} exp(Z_{ik}'\alpha + \rho_k I_{ik})} \times \frac{exp(X_{itr}'\beta_t/\rho_t)}{\sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t) }$

where $\rho_t$ is variously called a scale parameter, correlation parameter, etc. and defines the within branch correlation (arising from the GEV distribution). We also have the log-sum, which is also called the inclusive value:

$I_{it} = log \left( \sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t) \right)$.

Now we have our model setup, the log likelihood over all individuals is

$\sum_{i=1}^N \sum_{k \in T} \sum_{m \in R_t} y_{itr} \left[ Z_{it}'\alpha + \rho_t I_{it} - log \left( \sum_{k \in T} exp(Z_{ik}'\alpha + \rho_k I_{ik}) \right) + X_{itr}'\beta_t/\rho_t - log \left( \sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t) \right) \right]$

As a further note, for the model to be compatible with an ARUM specification, a number of conditions need to be satisfied. One of these is satisfied is $0<\rho_t \leq 1$, so we will make that restriction. We have also only included alternative-varying variables, but we are often interested in individual varying variables and allowing parameters to vary over alternatives, which can be simply added to this specification, but we will leave them out for now to keep things “simple”. We will also use basic weakly informative priors and leave prior specification as a separate issue we won’t consider further:

$\alpha \sim normal(0,5), \beta_t \sim normal(0,5), \rho_t \sim Uniform(0,1)$

## Software

DISCLAIMER: This code is unlikely to be the most efficient, nor can I guarantee it is 100% correct – use at your peril!

The following assumes a familiarity with Stan and R.

Stan programs are divided into blocks including data, parameters, and model. The functions block allows us to define custom (log) probability density functions. These take a form something like:

real xxx_lpdf(real y, ...){}

which says that the function outputs a real valued variable and take a real valued variable, y, as one of its arguments. The _lpdf suffix allows the function to act as a density function in the program (and equivalently _lpmf for log probability mass functions for discrete variables). Now we just have to convert the log likelihood above into a function. But first, let’s just consider what data we will be passing to the program:

• N, the number of observations;
• T, the number of branches;
• P, the number of branch-varying variables;
• Q, the number of choice-varying variables;
• R, a T x 1 vector with the number of choices in each branch, from which we can also derive the total number of options as sum(R). We will call the total number of options Rk for now;
• Y, a N x Rk vector, where Y[i,j] = 1 if individual i=1,…,N chose choice j=1,…,Rk;
• Z, a N x T x P array of branch-varying variables;
• X, a N x Rk x Q array of choice-varying variables.

And the parameters:

• $\rho$, a T x 1 vector of correlation parameters;
• $\alpha$, a P x 1 vector of branch-level covariates;
• $\beta$, a P x T matrix of choice-varying covariates.

Now, to develop the code, we will specify the function for individual observations of Y, rather than the whole matrix, and then perform the sum over all the individuals in the model block. So we only need to feed in each individual’s observations into the function rather than the whole data set. The model is specified in blocks as follows (with all the data and parameter as arguments to the function):

functions{
real nlogit_lpdf(real[] y, real[,] Z, real[,] X, int[] R,
vector alpha, matrix beta, vector tau){
//first define our additional local variables
real lprob; //variable to hold log prob
int count1; //keep track of which option in the loops
int count2; //keep track of which option in the loops
vector[size(R)] I_val; //inclusive values
real denom; //sum denominator of marginal model
//for the variables appearing in sum loops, set them to zero
lprob = 0;
count1 = 0;
count2 = 0;
denom = 0;

// determine the log-sum for each conditional model, p_ir|t,
//i.e. inclusive value
for(k in 1:size(R)){
I_val[k] = 0;
for(m in 1:R[k]){
count1 = count1 + 1;
I_val[k] = I_val[k] + exp(to_row_vector(X[count1,])*
beta[,k] /tau[k]);
}
I_val[k] = log(I_val[k]);
}

//determine the sum for the marginal model, p_it, denomininator
for(k in 1:size(R)){
denom = denom + exp(to_row_vector(Z[k,])*alpha + tau[k]*I_val[k]);
}

//put everything together in the log likelihood
for(k in 1:size(R)){
for(m in 1:R[k]){
count2 = count2 + 1;
lprob = lprob + y[count2]*(to_row_vector(Z[k,])*alpha +
tau[k]*I_val[k] - log(denom) +
to_row_vector(X[count2,])*beta[,k] - I_val[k]);
}
}
// return the log likelihood value
return lprob;
}
}
data{
int N; //number of observations
int T; //number of branches
int R[T]; //number of options per branch
int P; //dim of Z
int Q; //dim of X
real y[N,sum(R)]; //outcomes array
real Z[N,T,P]; //branch-varying variables array
real X[N,sum(R),Q]; //option-varying variables array
}
parameters{
vector<lower=0, upper=1>[T] rho; //scale-parameters
vector[P] alpha; //branch-varying parameters
matrix[Q,T] beta; //option-varying parameters
}
model{
//specify priors
for(p in 1:P) alpha[p] ~ normal(0,5);
for(q in 1:Q) for(t in 1:T) beta[q,t] ~ normal(0,5);

//loop over all observations with the data
for(i in 1:N){
y[i] ~ nlogit(Z[i,,],X[i,,],R,alpha,beta,rho);
}
}

## Simulation model

To see whether our model is doing what we’re hoping it’s doing, we can run a simple test with simulated data. It may be useful to compare the result we get to those from other estimators; the nested logit is most frequently estimated using the FIML estimator. But, neither Stata nor R provide packages that estimate a model with branch-varying variables – another reason why we sometimes need to program our own models.

The code we’ll use to simulate the data is:

#### simulate 2-level nested logit data ###

N <- 300 #number of people
P <- 2 #number of branch variant variables
Q <- 2 #number of option variant variables
R <- c(2,2,2) #vector with number of options per branch
T <- length(R) #number of branches
Rk <- sum(R) #number of options

#simulate data

Z <- array(rnorm(N*T*P,0,0.5),dim = c(N,T,P))
X <- array(rnorm(N*Rk*Q,0,0.5), dim = c(N,Rk,Q))

#parameters
rho <- runif(3,0.5,1)
beta <- matrix(rnorm(T*Q,0,1),c(Q,T))
alpha <- rnorm(P,0,1)

#option models #change beta indexing as required
vals_opt <- cbind(exp(X[,1,]%*%beta[,1]/rho[1]),exp(X[,2,]%*%beta[,1]/rho[1]),exp(X[,3,]%*%beta[,2]/rho[2]),
exp(X[,4,]%*%beta[,2]/rho[2]),exp(X[,5,]%*%beta[,3]/rho[3]),exp(X[,6,]%*%beta[,3]/rho[3]))

incl_val <- cbind(vals_opt[,1]+vals_opt[,2],vals_opt[,3]+vals_opt[,4],vals_opt[,5]+vals_opt[,6])

vals_branch <- cbind(exp(Z[,1,]%*%alpha + rho[1]*log(incl_val[,1])),
exp(Z[,2,]%*%alpha + rho[2]*log(incl_val[,2])),
exp(Z[,3,]%*%alpha + rho[3]*log(incl_val[,3])))

sum_branch <- rowSums(vals_branch)

probs <- cbind((vals_opt[,1]/incl_val[,1])*(vals_branch[,1]/sum_branch),
(vals_opt[,2]/incl_val[,1])*(vals_branch[,1]/sum_branch),
(vals_opt[,3]/incl_val[,2])*(vals_branch[,2]/sum_branch),
(vals_opt[,4]/incl_val[,2])*(vals_branch[,2]/sum_branch),
(vals_opt[,5]/incl_val[,3])*(vals_branch[,3]/sum_branch),
(vals_opt[,6]/incl_val[,3])*(vals_branch[,3]/sum_branch))

Y = t(apply(probs, 1, rmultinom, n = 1, size = 1))

Then we’ll put the data into a list and run the Stan program with 500 iterations and 3 chains:

data <- list(
y = Y,
X = X,
Z = Z,
R = R,
T = T,
N = N,
P = P,
Q = Q
)

require(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

fit <- stan("C:/Users/Samuel/Dropbox/Code/nlogit.stan",
data = data,
chains = 3,
iter = 500)

Which gives results (with 25t and 75th percentiles dropped to fit on screen):

> print(fit)
Inference for Stan model: nlogit.
3 chains, each with iter=500; warmup=250; thin=1;
post-warmup draws per chain=250, total post-warmup draws=750.

mean se_mean   sd    2.5%     50%     75%   97.5% n_eff Rhat
rho[1]       1.00    0.00 0.00    0.99    1.00    1.00    1.00   750  1
rho[2]       0.87    0.00 0.10    0.63    0.89    0.95    1.00   750  1
rho[3]       0.95    0.00 0.04    0.84    0.97    0.99    1.00   750  1
alpha[1]    -1.00    0.01 0.17   -1.38   -0.99   -0.88   -0.67   750  1
alpha[2]    -0.56    0.01 0.16   -0.87   -0.56   -0.45   -0.26   750  1
beta[1,1]   -3.65    0.01 0.32   -4.31   -3.65   -3.44   -3.05   750  1
beta[1,2]   -0.28    0.01 0.24   -0.74   -0.27   -0.12    0.15   750  1
beta[1,3]    0.99    0.01 0.25    0.48    0.98    1.15    1.52   750  1
beta[2,1]   -0.15    0.01 0.25   -0.62   -0.16    0.00    0.38   750  1
beta[2,2]    0.28    0.01 0.24   -0.16    0.28    0.44    0.75   750  1
beta[2,3]    0.58    0.01 0.24    0.13    0.58    0.75    1.07   750  1
lp__      -412.84    0.14 2.53 -418.56 -412.43 -411.05 -409.06   326  1

Samples were drawn using NUTS(diag_e) at Sun May 06 14:16:43 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Which we can compare to the original parameters:

> beta
[,1]       [,2]      [,3]
[1,] -3.9381389 -0.3476054 0.7191652
[2,] -0.1182806  0.2736159 0.5237470
> alpha
[1] -0.9654045 -0.6505002
> rho
[1] 0.9503473 0.9950653 0.5801372

You can see that the posterior means and quantiles of the distribution provide pretty good estimates of the original parameters. Convergence diagnostics such as Rhat and traceplots (not reproduced here) show good convergence of the chains. But, of course, this is not enough for us to rely on it completely – you would want to investigate further to ensure that the chains were actually exploring the posterior of interest.

# Applications

I am not aware of any examples in health economics of using custom likelihoods in Stan. There are not even many examples of Bayesian nested logit models, one exception being a paper by Lahiri and Gao, who ‘analyse’ the nested logit using MCMC. But given the limitations of MCMC discussed above, one should prefer this implementation in the post rather than the MCMC samplers of that paper. It’s also a testament to computing advances and Stan that in 2001 an MCMC sampler and analysis could fill a paper in a decent econometrics journal and now we can knock one out for a blog post.

In terms of nested logit models in general in health economics, there are many examples going back 30 years (e.g. this article from 1987). More recent papers have preferred “mixed” or “random parameters” logit or probit specifications, which are much more flexible than the nested logit. We would advise these sorts of models for this reason. The nested logit was used as an illustrative example of estimating custom likelihoods for this post.

Credit