*Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is permutation tests.*

### Principle

One of the main objections to the use of *p*-values for statistical inference is that they are often misunderstood. They can be interpreted as the probability a null hypothesis is true, which they are not. Part of the cause of this problem is the black-box approach to statistical software. You can plug in data to Stata, R, or any other package, ask it to run a regression or test with ease, and it will spit out a load of p-values. Many people will just take it on trust that the software is returning the test of the hypothesis of interest and that the method has the correct properties, like type one error rate. But if one had to go through the process to obtain a test-statistic and the relevant distribution to compare it to, perhaps then the p-value wouldn’t be so misunderstood. For trials involving randomisation, permutation tests (or randomisation tests or exact tests) are just such a process.

Permutation tests were first proposed by Ronald Fisher in the early 20th Century. The basic principle is that to test differences between two groups assigned at random we can determine the exact distribution of a test statistic (such as a difference in means) under the null hypothesis by calculating the value of the test statistic for all possible ways of arranging our units into the two groups. The value of the test statistic for the actual assignment can be compared to this distribution to determine the p-value.

The simplest example would be to test a difference in means for a continuous outcome between two groups assigned in a randomised controlled trial. Let’s generate some data (we’ll do this in R) from a simple trial with 50 individuals per arm. In the control arm the data come from a distribution and in the treatment arm they come from a distribution:

n <- 100 #number of individuals

D <- sample(c(rep(0,n/2),rep(1,n/2)),n) #treatment assignment

y <- rnorm(n,mean=D*0.5,sd=1) #generate normal outcomes

T.diff <- mean(y[(D*1:n)])-mean(y[(-1*(D-1)*1:n)]) #actual difference in means

Now let’s add a function to randomly re-assign units to treatment and control and calculate the difference in means we would observe under the null of no difference based on our generated data. We will then plot this distribution and add a line showing our where the actual difference in means lies.

#function to generate differences in means

permStat <- function(n,y){

D <- sample(c(rep(0,n/2),rep(1,n/2)),n) #generate new assignment

T.diff <- mean(y[(D*1:n)])-mean(y[(-1*(D-1)*1:n)])

return(T.diff)

}

T.dist <- sapply(1:500,function(i)permStat(n,y)) #apply it 500 times

qplot(T.dist)+geom_vline(xintercept=T.diff,col="red") #plot

Our 2-sided p-value here is 0.04, i.e. the proportion of values at least as extreme as our test statistic.

For a more realistic example we can consider a cluster randomised trial with a binary outcome. The reason for choosing this example is that estimating non-linear mixed models is difficult. Calculating test statistics, especially when the number of clusters is relatively small, is even harder. The methods used in most statistics packages have inflated type one errors, unbeknownst to many. So let’s set up the following trial: two-arms with 8 clusters per arm, and 100 patients per cluster, which is representative of trials of, say, hospitals. The data generating mechanism is for patient in cluster

So no individual level covariates, four Bernoulli(0.3) covariates with , and a treatment indicator with treatment effect (to look at type one error rates). The cluster effect is and is chosen to give an intraclass correlation coefficient of 0.05. We’ll simulate data from this model and then estimate the model above and test the null hypothesis in two ways. First, we’ll use the popular R package lme4 and the command glmer, which uses adaptive Gaussian quadrature to estimate the parameters and covariance matrix; the built in p-values are derived from standard Wald t-statistics. Second, we’ll use our permutation tests.

Gail et al. (1996) examine permutation tests for these kinds of models. They propose the following residual-based test (although one can use other tests based on the likelihood): (i) estimate the simple model under the null with no treatment effect and no hierarchical effect, i.e. ; (ii) for each individual generate their predicted values and residuals ; (iii) generate the cluster average residuals . Then the test statistic is

Under the null and given equal cluster sizes, the residual means are exchangeable. So the exact distribution of can be obtained by calculating it under all possible randomisation schemes. The p-value is then the quantile of this distribution under which the test statistic falls for the actual randomisation scheme. For larger numbers of clusters, it is not feasible to permute every possible randomisation scheme, so we approximate the distribution of using 500 randomly generated schemes. The following figure shows the estimated type one error rates using the two different methods (and 200 simulations):

The figure clearly shows an inflated type one error rates for the standard p-values reported by glmer especially for smaller numbers of clusters per arm. By contrast the residual permutation test shows approximately correct type one error rates (given more simulations there should be less noise in these estimates).

### Implementation

Implementation of these tests is straightforward in different software packages. In Stata, one can use the command permute, for which you specify the different groups, number of permutations and command to estimate the treatment effect. In R, there are various packages, like coin, that perform a similar function. For more complex models particular non-linear ones and ones involving adjustment, one has to be careful about how to specify the appropriate test statistic and model under the null hypothesis, which may involve a little programming, but it is relatively straightforward to do so.

### Applications

These methods have widespread applications for anyone looking to use null hypothesis significance testing. So a complete overview of the literature is not possible. Instead, we highlight a few uses of these methods.

In a previous post in this series we covered synthetic control methods; one of the ways of computing test statistics for this method has been called ‘placebo tests’, which are an exact parallel to the permutation tests discussed here. Krief and others discuss the use of these methods for evaluating health policies. Another example from a regression-based analysis is provided by Dunn and Shapiro. And Jacob, Ludwig, and Miller examine the impact of a lottery for vouchers to move to another area and employ these tests.

Sugar et al derive health states for depression from the SF-12 and use permutation test methods to validate the health states. Barber and Thompson use these tests to examine costs data from an RCT.

**Credit **