Method of the month: Semiparametric models with penalised splines

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is semiparametric models with penalised splines.

Principles

A common assumption of regression models is that effects are linear and additive. However, nothing is ever really that simple. One might respond that all models are wrong, but some are useful, as George Box once said. And the linear, additive regression model has coefficients that can be interpreted as average treatment effects under the right assumptions. Sometimes though we are interested in conditional average treatment effects and how the impact of an intervention varies according to the value of some variable of interest. Often this relationship is not linear and we don’t know its functional form. Splines provide a way of estimating curves (or surfaces) of unknown functional form and are a widely used tool for semiparametric regression models. The term ‘spline’ was derived from the tool shipbuilders and drafters used to construct smooth edges: a bendable piece of material that when fixed at a number of points would relax into the desired shape.

Implementation

Our interest lies in estimating the unknown function m:

y_i = m(x_i) + e_i

A ‘spline’ in the mathematical sense is a function constructed piece-wise from polynomial functions. The places where the functions meet are known as knots and the spline has order equal to one more than the degree of the underlying polynomial terms. Basis-splines or B-splines are the typical starting point for spline functions. These are curves that are defined recursively as a sum of ‘basis functions’, which depend only on the polynomial degree and the knots. A spline function can be represented as a linear combination of B-splines, the parameters dictating this combination can be estimated using standard regression model estimation techniques. If we have N B-splines then our regression function can be estimated as:

y_i = \sum_{j=1}^N ( \alpha_j B_j(x_i) ) + e_i

by minimising \sum_{i=1}^N \{ y_i - \sum_{j=1}^N ( \alpha_j B_j(x_i) ) \} ^2. Where the B_j are the B-splines and the \alpha_j are coefficients to be estimated.

Useful technical explainers of splines and B-splines can be found here [PDF] and here [PDF].

One issue with fitting splines to data is that we run the risk of ‘overfitting’. Outliers might distort the curve we fit, damaging the external validity of conclusions we might make. To deal with this, we can enforce a certain level of smoothness using so-called penalty functions. The smoothness (or conversely the ‘roughness’) of a curve is often defined by the integral of the square of the second derivative of the curve function. Penalised-splines, or P-splines, were therefore proposed which added on this smoothness term multiplied by a smoothing parameter \lambda. In this case, we look to minimising:

\sum_{i=1}^N \{ y_i - \sum_{j=1}^N ( \alpha_j B_j(x_i) ) \}^2 + \lambda\int m''(x_i)^2 dx

to estimate our parameters. Many other different variations on this penalty have been proposed. This article provides a good explanation of P-splines.

An attractive type of spline has become the ‘low rank thin plate spline‘. This type of spline is defined by its penalty, which has a physical analogy with the resistance that a thin sheet of metal puts up when it is bent. This type of spline removes the problem associated with thin plate splines of having too many parameters to estimate by taking a ‘low rank’ approximation, and it is generally insensitive to the choice of knots, which other penalised spline regression models are not.

Crainiceanu and colleagues show how the low rank thin plate smooth splines can be represented as a generalised linear mixed model. In particular, our model can be represented as:

m(x_i) = \beta_0 + \beta_1x_i  + \sum_{k=1}^K u_k |x_i - \kappa_k|^3

where \kappa_k, k=1,...,K, are the knots. The parameters, \theta = (\beta_0,\beta_1,u_k)', can be estimated by minimising

\sum_{i=1}^N \{ y_i - m(x_i) \} ^2 + \frac{1}{\lambda} \theta ^T D \theta .

This is shown to give the mixed model

y_i = \beta_0 + \beta_1 + Z'b + u_i

where each random coefficient in the vector b is distributed as N(0,\sigma^2_b) and Z and D are given in the paper cited above.

As a final note, we have discussed splines in one dimension, but they can be extended to more dimensions. A two-dimensional spline can be generated by taking the tensor product of the two one dimensional spline functions. I leave this as an exercise for the reader.

Software

R

  • The package gamm4 provides the tools necessary for a frequentist analysis along the lines described in this post. It uses restricted maximum likelihood estimation with the package lme4 to estimate the parameters of the thin plate spline model.
  • A Bayesian version of this functionality is implemented in the package rstanarm, which uses gamm4 to produce the matrices for thin plate spline models and Stan for the estimation through the stan_gamm4 function.

If you wanted to implement these models for yourself from scratch, Crainiceanu and colleagues provide the R code to generate the matrices necessary to estimate the spline function:

n<-length(covariate)
X<-cbind(rep(1,n),covariate)
knots<-quantile(unique(covariate),
 seq(0,1,length=(num.knots+2))[-c(1,(num.knots+2))])
Z_K<-(abs(outer(covariate,knots,"-")))^3
OMEGA_all<-(abs(outer(knots,knots,"-")))^3
svd.OMEGA_all<-svd(OMEGA_all)
sqrt.OMEGA_all<-t(svd.OMEGA_all$v %*%
 (t(svd.OMEGA_all$u)*sqrt(svd.OMEGA_all$d)))
Z<-t(solve(sqrt.OMEGA_all,t(Z_K)))

Stata

I will temper this advice by cautioning that I have never estimated a spline-based semi-parametric model in Stata, so what follows may be hopelessly incorrect. The only implementation of penalised splines in Stata is the package and associated function psplineHowever, I cannot find any information about the penalty function used, so I would advise some caution when implementing. An alternative is to program the model yourself, through conversion of the above R code in Mata to generate the matrix Z and then the parameters could be estimated with xtmixed. 

Applications

Applications of these semi-parametric models in the world of health economics have tended to appear more in technical or statistical journals than health economics journals or economics more generally. For example, recent examples include Li et al who use penalised splines to estimate the relationship between disease duration and health care costs. Wunder and co look at how reported well-being varies over the course of the lifespan. And finally, we have Stollenwerk and colleagues who use splines to estimate flexible predictive models for cost-of-illness studies with ‘big data’.

Credit

 

STAN: A tool for Bayesian inference and MCMC

This a short post just to bring to your attention STAN, a tool for Bayesian statistical inference. It’s been around for a few years now but like many of these things it can be slow to percolate down to the end users. Like many people who use MCMC and do Bayesian statistical inference I was a user of BUGS, either OpenBUGS or WinBUGS, which is a great piece of kit. But it could be slow, it throws up indecipherable error messages, and sometimes struggled with convergence. Then I discovered STAN, which is named after Stanislaw Ulam the inventor of the Monte Carol method. It’s generally faster than BUGS, especially if you use vectors rather than loops; the error messages make it easy to debug code; and, it’s easy to install and use with R. The code is also simple to pick up – the website has equivalent STAN code for many BUGS models and there is a detailed manual. There’s a simple example at the bottom of the page for a random effects meta-analysis.

STAN is not better than BUGS is every respect. If BUGS works it works, why change? But for complex multilevel models, handling large amounts of data, and more flexibility with prior distributions it’s definitely worth making the change. Hopefully, I’ll post some more health economic examples in the future.

Random Effects Meta-analysis model example

BUGS
I’ve omitted the data, params, and inits part as BUGS users should be familiar with that.

model{ 
   for(i in 1:N) { 
      y[i] ~ dnorm(theta[i], prec2[i]) 
      theta[i] ~ dnorm (mu, prec.tau2) 
      prec2[i] <-(1/se[i])*(1/se[i]) 
   }
 
   prec.tau2 ~dgamma(0.001,0.001) 
   theta ~ dnorm(0.0,1.0E-6) 
   tau <- sqrt(1 / prec.tau2) 
}

STAN
When calling STAN from R, you put the data in a list and then pass it to the stan() function along with the code.

data {
 int N; //number of studies
 vector[N] y; //estimated treatment effects
 vector<lower=0>[N] se; //s.e. of treatment effects
}
parameters {
 vector[N] theta; 
 real mu; //pooled mean
 real<lower=0> tau; //between study variance
}
model {
 y ~ normal(theta, sigma);
 theta ~ normal(mu, tau);
 mu ~ normal(0, 1000);
}

Solution for Trap 66 (Postcondition violated)

WinBUGS is a widely used free software program within health-economics. It allows for Bayesian statistical modelling, using Gibbs sampling. (Hence the name: the Windows version of Bayesian inference Using Gibbs Sampling). One of the drawbacks of WinBUGS is the notoriously uninformative error messages you can receive. While Google is usually a Fountain of Knowledge on solving errors, where WinBUGS is concerned it often only serves up other people asking the same question, and hardly any answers. This post is about one error message that I found, the solution that’s sometimes offered which I think only partly solves the problem and the solution I found which solves it completely.

The error message itself is “Trap 66 (postcondition violated)”. Variance priors have been identified as the culprits. The suggested solutions I could find (for example here, here and here) all point towards those priors being too big. The standard advice is then to reduce the priors (for example from dunif(0,100) to dunif(0,10)) and rerun it. This usually solves the problem.

However, to me, this doesn’t make a whole lot of sense theoretically. And, in a rare case of the two aligning, it also didn’t solve my problems in practice. I have been performing a simulation study, in which I have about 8,000 similar, but different data sets (8 scenarios, 1000 repetitions in each). They all represent mixed treatment comparisons (MTC), which are analyzed by WinBUGS. I used SAS to create the data, send it to WinBUGS and collect and analyse the outcomes. When I started the random effects MTC, the “Trap 66 (postcondition violated)” popped up around dataset 45. Making the priors smaller, as suggested, solved the problem for this particular dataset, but it came back on data set 95. The funny thing is that making the priors higher also solved the problem for the original dataset, but once again the same problem arose at a different data set (this time number 16).

Whenever I tried to recreate the problem, it would give the same error message at the exact same point in time, even though it’s a random sampler. From this it seems to be that the reason why the suggested solution works for one data is that the generated ‘chains’, as they are called in WinBUGS, are the same with the same priors and initial values. Defining a smaller prior will give a different chain which is likely not to cause problems. But so will a larger prior or a different initial value. However, it didn’t really solve the problem.

The solution I have found to work for all 8,000 data sets is to not look at the maximum value of the prior, but at the minimum value. The prior that is given for a standard error usually looks something like dunif(0,X). In my case, I did an extra step, with a prior on a variable called tau, for which I specify a uniform prior. The precision (one divided by the variance) that goes into the link function is defined by

prec <- pow(tau,-2)

This does not make any difference for the problem or the solution. My hypothesis is that when Trap 66 comes up, the chain generates a tau (or standard error, if that’s what you modelled directly) equal to 0, which resolves into a precision equal to 1 divided by 0, or infinity. The solution is to let the prior not start at 0, but at a small epsilon. I used dunif(0.001,10), which solved all my problems.

This solution is related to a different problem I once had when programming a probability. I mistakenly used a dunif(0,1) prior. Every now and then the chain will generate exactly 0 or 1, which does not sit well with the binomial link function. The error message is different (“Undefined real result”), but the solution is again to use a prior that does not include the extremes. In my case, using a flat dbeta instead (which I should have done to begin with) solved that problem.

Any suggestions and comments are much appreciated. You can download WinBUGS, the free immortality key and lots of examples from the BUGS project website here. It also contains a list of common error messages and their solutions. Not Trap 66, obviously.