Method of the month: Permutation tests

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is permutation tests.


One of the main objections to the use of p-values for statistical inference is that they are often misunderstood. They can be interpreted as the probability a null hypothesis is true, which they are not. Part of the cause of this problem is the black-box approach to statistical software. You can plug in data to Stata, R, or any other package, ask it to run a regression or test with ease, and it will spit out a load of p-values. Many people will just take it on trust that the software is returning the test of the hypothesis of interest and that the method has the correct properties, like type one error rate. But if one had to go through the process to obtain a test-statistic and the relevant distribution to compare it to, perhaps then the p-value wouldn’t be so misunderstood. For trials involving randomisation, permutation tests (or randomisation tests or exact tests) are just such a process.

Permutation tests were first proposed by Ronald Fisher in the early 20th Century. The basic principle is that to test differences between two groups assigned at random we can determine the exact distribution of a test statistic (such as a difference in means) under the null hypothesis by calculating the value of the test statistic for all possible ways of arranging our units into the two groups. The value of the test statistic for the actual assignment can be compared to this distribution to determine the p-value.

The simplest example would be to test a difference in means for a continuous outcome between two groups assigned in a randomised controlled trial. Let’s generate some data (we’ll do this in R) from a simple trial with 50 individuals per arm. In the control arm the data come from a N(0,1) distribution and in the treatment arm they come from a N(0.5,1) distribution:

n <- 100 #number of individuals
D <- sample(c(rep(0,n/2),rep(1,n/2)),n) #treatment assignment
y <- rnorm(n,mean=D*0.5,sd=1) #generate normal outcomes
T.diff <- mean(y[(D*1:n)])-mean(y[(-1*(D-1)*1:n)]) #actual difference in means

Now let’s add a function to randomly re-assign units to treatment and control and calculate the difference in means we would observe under the null of no difference based on our generated data. We will then plot this distribution and add a line showing our where the actual difference in means lies.

#function to generate differences in means
permStat <- function(n,y){
D <- sample(c(rep(0,n/2),rep(1,n/2)),n) #generate new assignment
T.diff <- mean(y[(D*1:n)])-mean(y[(-1*(D-1)*1:n)])
T.dist <- sapply(1:500,function(i)permStat(n,y)) #apply it 500 times
qplot(T.dist)+geom_vline(xintercept=T.diff,col="red") #plot
Exact distribution for test of a difference in means

Our 2-sided p-value here is 0.04, i.e. the proportion of values at least as extreme as our test statistic.

For a more realistic example we can consider a cluster randomised trial with a binary outcome. The reason for choosing this example is that estimating non-linear mixed models is difficult. Calculating test statistics, especially when the number of clusters is relatively small, is even harder. The methods used in most statistics packages have inflated type one errors, unbeknownst to many. So let’s set up the following trial: two-arms with 8 clusters per arm, and 100 patients per cluster, which is representative of trials of, say, hospitals. The data generating mechanism is for patient i in cluster j

y_{ij} = Bernoulli(p_{ij})

p_{ij} = Logit(\alpha_j + x_j'\beta + D_{j}\gamma)

So no individual level covariates, four Bernoulli(0.3) covariates x_j with \beta = [1,1,1,1], and a treatment indicator D_j with treatment effect \gamma=0 (to look at type one error rates). The cluster effect is \alpha_j \sim N(0,\sigma^2_\alpha) and \sigma^2_\alpha is chosen to give an intraclass correlation coefficient of 0.05. We’ll simulate data from this model and then estimate the model above and test the null hypothesis H_0:\gamma=0 in two ways. First, we’ll use the popular R package lme4 and the command glmer, which uses adaptive Gaussian quadrature to estimate the parameters and covariance matrix; the built in p-values are derived from standard Wald t-statistics. Second, we’ll use our permutation tests.

Gail et al. (1996) examine permutation tests for these kinds of models. They propose the following residual-based test (although one can use other tests based on the likelihood): (i) estimate the simple model under the null with no treatment effect and no hierarchical effect, i.e. p_{ij}=Logit(\alpha+x_{ij}'\beta); (ii) for each individual generate their predicted values and residuals r_{ij}; (iii) generate the cluster average residuals \bar{r}_{.j}=N_j^{-1}\sum_{i=1}^{N_j} r_{ij}. Then the test statistic is

U=N^{-1}_{j} \left( \sum_{j=1}^{2N_j}D_{jg}\bar{r}_{.j} - \sum_{j=1}^{2N_j}(D_{j}-1)\bar{r}_{.j} \right) = N^{-1}_{j} \sum_{j=1}^{2N_j}(2D_{j}-1)\bar{r}_{.j}

Under the null and given equal cluster sizes, the residual means are exchangeable. So the exact distribution of U can be obtained by calculating it under all possible randomisation schemes. The p-value is then the quantile of this distribution under which the test statistic falls for the actual randomisation scheme. For larger numbers of clusters, it is not feasible to permute every possible randomisation scheme, so we approximate the distribution of U using 500 randomly generated schemes. The following figure shows the estimated type one error rates using the two different methods (and 200 simulations):

The figure clearly shows an inflated type one error rates for the standard p-values reported by glmer especially for smaller numbers of clusters per arm. By contrast the residual permutation test shows approximately correct type one error rates (given more simulations there should be less noise in these estimates).


Implementation of these tests is straightforward in different software packages. In Stata, one can use the command permute, for which you specify the different groups, number of permutations and command to estimate the treatment effect. In R, there are various packages, like coin, that perform a similar function. For more complex models particular non-linear ones and ones involving adjustment, one has to be careful about how to specify the appropriate test statistic and model under the null hypothesis, which may involve a little programming, but it is relatively straightforward to do so.


These methods have widespread applications for anyone looking to use null hypothesis significance testing. So a complete overview of the literature is not possible. Instead, we highlight a few uses of these methods.

In a previous post in this series we covered synthetic control methods; one of the ways of computing test statistics for this method has been called ‘placebo tests’, which are an exact parallel to the permutation tests discussed here. Krief and others discuss the use of these methods for evaluating health policies. Another example from a regression-based analysis is provided by Dunn and Shapiro. And Jacob, Ludwig, and Miller examine the impact of a lottery for vouchers to move to another area and employ these tests.

Sugar et al derive health states for depression from the SF-12 and use permutation test methods to validate the health states. Barber and Thompson use these tests to examine costs data from an RCT.


Widespread misuse of statistical significance in health economics

Despite widespread cautionary messages, p-values and claims of statistical significance are continuously misused. One of the most common errors is to mistake statistical significance for economic, clinical, or political significance. This error may manifest itself by authors interpreting only ‘statistically significant’ results as important, or even neglecting to examine the magnitude of estimated coefficients. For example, we’ve written previously about a claim of how statistically insignificant results are ‘meaningless’. Another common error is to ‘transpose the conditional’, that is to interpret the p-value as the posterior probability of a null hypothesis. For example, in an exchange on Twitter recently, David Colquhoun, whose discussions of p-values we’ve also previously covered, made the statement:

However, the p-value does not provide probability/evidence of a null hypothesis (that an effect ‘exists’). P-values are correlated with the posterior probability of the null hypothesis in a way that depends on statistical power, choice of significance level, and prior probability of the null. But observing a significant p-value only means that the data were unlikely to be produced by a particular model, not that the alternative hypothesis is true. Indeed, the null hypothesis may be a poor explanation for the observed data, but that does not mean it is a better explanation than the alternative. This is the essence of Lindley’s paradox.

So what can we say about p-values? The six principles of the ASA’s statement on p-values are:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.


In 1996, Deirdre McClosky and Stephen Ziliak surveyed economics papers published in the American Economic Review in the 1980s for p-value misuse. Overall, 70% did not distinguish statistical from economic significance and 96% misused a test statistic in some way. Things hadn’t improved when they repeated the study ten years later. Unfortunately, these problems are not exclusive to the AER. A quick survey of a top health economics journal, Health Economics, finds similar misuse as we discuss below. This journal is not singled out for any particular reason beyond that it’s one of the key journals in the field covered by this blog, and frequently features in our journal round-ups. Similarly, no comment is made on the quality of the studies or authors beyond the claims and use of statistical significance. Nevertheless, where there are p-values, there are problems. For such a pivotal statistic, one that careers can be made or broken on, we should at least get it right!

Nine studies were published in the May 2017 issue of Health Economics. The list below shows some examples of p-value errors in the text of the articles. The most common issue was using the p-value to interpret whether an effect exists or not, or using it as the (only) evidence to support or reject a particular hypothesis. As described above, the statistical significance of a coefficient does not imply the existence of an effect. Some of the statements claimed below to be erroneous may be contentious as, in the broader context of the paper, they may make sense. For example, claiming that a statistically significant estimate is evidence of an effect may be correct where the broader totality of the evidence suggests that any observed data would be incompatible with a particular model. However, this is generally not the way the p‘s are used.

Examples of p-value (mis-)statements

Even the CMI has no statistically significant effect on the facilitation ratio. Thus, the diversity and complexity of treated patients do not play a role for the subsidy level of hospitals.

the coefficient for the baserate is statistically significant for PFP hospitals in the FE model, indicating that a higher price level is associated with a lower level of subsidies.

Using the GLM we achieved nine significant effects, including, among others, Parkinson’s disease and osteoporosis. In all components we found more significant effects compared with the GLM approach. The number of significant effects decreases from component 2 (44 significant effects) to component 4 (29 significant effects). Although the GLM lead to significant results for intestinal diverticulosis, none of the component showed equivalent results. This might give a hint that taking the component based heterogeneity into account, intestinal diverticulosis does not significantly affect costs in multimorbidity patients. Besides this, certain coefficients are significant in only one component.

[It is unclear what ‘significant’ and ‘not significant’ refer to or how they are calculated but appear to refer to t>1.96. Not clear if corrections for multiple comparisons.]

There is evidence of upcoding as the coefficient of spreadp_posis statistically significant.

Neither [variable for upcoding] is statistically significant. The incentive for upcoding is, according to these results, independent of the statutory nature of hospitals.

The checkup significantly raises the willingness to pay any positive amount, although it does not significantly affect the amount reported by those willing to pay some positive amount.

[The significance is with reference to statistical significance].

Similarly, among the intervention group, there were lower probabilities of unhappiness or depression (−0.14, p = 0.045), being constantly under strain (0.098, p = 0.013), and anxiety or depression (−0.10, p = 0.016). There was no difference between the intervention group and control group 1 (eligible non-recipients) in terms of the change in the likelihood of hearing problems (p = 0.64), experiencing elevate blood pressure (p = 0.58), and the number of cigarettes smoked (p = 0.26).

The ∆CEs are also statistically significant in some educational categories. At T + 1, the only significant ∆CE is observed for cancer survivors with a university degree for whom the cancer effect on the probability of working is 2.5 percentage points higher than the overall effect. At T + 3, the only significant ∆CE is observed for those with no high school diploma; it is 2.2 percentage points lower than the overall cancer effect on the probability of working at T + 3.

And, just for balance, here is a couple from this year’s winner of the Arrow prize at iHEA, which gets bonus points for the phrase ‘marginally significant’, which can be used both to confirm and refute a hypothesis depending on the inclination of the author:

Our estimated net effect of waiting times for high-income patients (i.e., adding the waiting time coefficient and the interaction of waiting times and high income) is positive, but only marginally significant (p-value 0.055).

We find that patients care about distance to the hospital and both of the distance coefficients are highly significant in the patient utility function.


As we’ve argued before, p-values should not be the primary result reported. Their interpretation is complex and so often leads to mistakes. Our goal is to understand economic systems and to determine the economic, clinical, or policy relevant effects of interventions or modifiable characteristics. The p-value does provide some useful information but not enough to support the claims made from it.