Method of the month: Permutation tests

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is permutation tests.


One of the main objections to the use of p-values for statistical inference is that they are often misunderstood. They can be interpreted as the probability a null hypothesis is true, which they are not. Part of the cause of this problem is the black-box approach to statistical software. You can plug in data to Stata, R, or any other package, ask it to run a regression or test with ease, and it will spit out a load of p-values. Many people will just take it on trust that the software is returning the test of the hypothesis of interest and that the method has the correct properties, like type one error rate. But if one had to go through the process to obtain a test-statistic and the relevant distribution to compare it to, perhaps then the p-value wouldn’t be so misunderstood. For trials involving randomisation, permutation tests (or randomisation tests or exact tests) are just such a process.

Permutation tests were first proposed by Ronald Fisher in the early 20th Century. The basic principle is that to test differences between two groups assigned at random we can determine the exact distribution of a test statistic (such as a difference in means) under the null hypothesis by calculating the value of the test statistic for all possible ways of arranging our units into the two groups. The value of the test statistic for the actual assignment can be compared to this distribution to determine the p-value.

The simplest example would be to test a difference in means for a continuous outcome between two groups assigned in a randomised controlled trial. Let’s generate some data (we’ll do this in R) from a simple trial with 50 individuals per arm. In the control arm the data come from a N(0,1) distribution and in the treatment arm they come from a N(0.5,1) distribution:

n <- 100 #number of individuals
D <- sample(c(rep(0,n/2),rep(1,n/2)),n) #treatment assignment
y <- rnorm(n,mean=D*0.5,sd=1) #generate normal outcomes
T.diff <- mean(y[(D*1:n)])-mean(y[(-1*(D-1)*1:n)]) #actual difference in means

Now let’s add a function to randomly re-assign units to treatment and control and calculate the difference in means we would observe under the null of no difference based on our generated data. We will then plot this distribution and add a line showing our where the actual difference in means lies.

#function to generate differences in means
permStat <- function(n,y){
D <- sample(c(rep(0,n/2),rep(1,n/2)),n) #generate new assignment
T.diff <- mean(y[(D*1:n)])-mean(y[(-1*(D-1)*1:n)])
T.dist <- sapply(1:500,function(i)permStat(n,y)) #apply it 500 times
qplot(T.dist)+geom_vline(xintercept=T.diff,col="red") #plot
Exact distribution for test of a difference in means

Our 2-sided p-value here is 0.04, i.e. the proportion of values at least as extreme as our test statistic.

For a more realistic example we can consider a cluster randomised trial with a binary outcome. The reason for choosing this example is that estimating non-linear mixed models is difficult. Calculating test statistics, especially when the number of clusters is relatively small, is even harder. The methods used in most statistics packages have inflated type one errors, unbeknownst to many. So let’s set up the following trial: two-arms with 8 clusters per arm, and 100 patients per cluster, which is representative of trials of, say, hospitals. The data generating mechanism is for patient i in cluster j

y_{ij} = Bernoulli(p_{ij})

p_{ij} = Logit(\alpha_j + x_j'\beta + D_{j}\gamma)

So no individual level covariates, four Bernoulli(0.3) covariates x_j with \beta = [1,1,1,1], and a treatment indicator D_j with treatment effect \gamma=0 (to look at type one error rates). The cluster effect is \alpha_j \sim N(0,\sigma^2_\alpha) and \sigma^2_\alpha is chosen to give an intraclass correlation coefficient of 0.05. We’ll simulate data from this model and then estimate the model above and test the null hypothesis H_0:\gamma=0 in two ways. First, we’ll use the popular R package lme4 and the command glmer, which uses adaptive Gaussian quadrature to estimate the parameters and covariance matrix; the built in p-values are derived from standard Wald t-statistics. Second, we’ll use our permutation tests.

Gail et al. (1996) examine permutation tests for these kinds of models. They propose the following residual-based test (although one can use other tests based on the likelihood): (i) estimate the simple model under the null with no treatment effect and no hierarchical effect, i.e. p_{ij}=Logit(\alpha+x_{ij}'\beta); (ii) for each individual generate their predicted values and residuals r_{ij}; (iii) generate the cluster average residuals \bar{r}_{.j}=N_j^{-1}\sum_{i=1}^{N_j} r_{ij}. Then the test statistic is

U=N^{-1}_{j} \left( \sum_{j=1}^{2N_j}D_{jg}\bar{r}_{.j} - \sum_{j=1}^{2N_j}(D_{j}-1)\bar{r}_{.j} \right) = N^{-1}_{j} \sum_{j=1}^{2N_j}(2D_{j}-1)\bar{r}_{.j}

Under the null and given equal cluster sizes, the residual means are exchangeable. So the exact distribution of U can be obtained by calculating it under all possible randomisation schemes. The p-value is then the quantile of this distribution under which the test statistic falls for the actual randomisation scheme. For larger numbers of clusters, it is not feasible to permute every possible randomisation scheme, so we approximate the distribution of U using 500 randomly generated schemes. The following figure shows the estimated type one error rates using the two different methods (and 200 simulations):

The figure clearly shows an inflated type one error rates for the standard p-values reported by glmer especially for smaller numbers of clusters per arm. By contrast the residual permutation test shows approximately correct type one error rates (given more simulations there should be less noise in these estimates).


Implementation of these tests is straightforward in different software packages. In Stata, one can use the command permute, for which you specify the different groups, number of permutations and command to estimate the treatment effect. In R, there are various packages, like coin, that perform a similar function. For more complex models particular non-linear ones and ones involving adjustment, one has to be careful about how to specify the appropriate test statistic and model under the null hypothesis, which may involve a little programming, but it is relatively straightforward to do so.


These methods have widespread applications for anyone looking to use null hypothesis significance testing. So a complete overview of the literature is not possible. Instead, we highlight a few uses of these methods.

In a previous post in this series we covered synthetic control methods; one of the ways of computing test statistics for this method has been called ‘placebo tests’, which are an exact parallel to the permutation tests discussed here. Krief and others discuss the use of these methods for evaluating health policies. Another example from a regression-based analysis is provided by Dunn and Shapiro. And Jacob, Ludwig, and Miller examine the impact of a lottery for vouchers to move to another area and employ these tests.

Sugar et al derive health states for depression from the SF-12 and use permutation test methods to validate the health states. Barber and Thompson use these tests to examine costs data from an RCT.


Sam Watson’s journal round-up for 11th February 2019

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Contest models highlight inherent inefficiencies of scientific funding competitions. PLoS Biology [PubMed] Published 2nd January 2019

If you work in research you will have no doubt thought to yourself at one point that you spend more time applying to do research than actually doing it. You can spend weeks working on (what you believe to be) a strong proposal only for it to fail against other strong bids. That time could have been spent collecting and analysing data. Indeed, the opportunity cost of writing extensive proposals can be very high. The question arises as to whether there is another method of allocating research funding that reduces this waste and inefficiency. This paper compares the proposal competition to a partial lottery. In this lottery system, proposals are short, and among those that meet some qualifying standard those that are funded are selected at random. This system has the benefit of not taking up too much time but has the cost of reducing the average scientific value of the winning proposals. The authors compare the two approaches using an economic model of contests, which takes into account factors like proposal strength, public benefits, benefits to the scientist like reputation and prestige, and scientific value. Ultimately they conclude that, when the number of awards is smaller than the number of proposals worthy of funding, the proposal competition is inescapably inefficient. It means that researchers have to invest heavily to get a good project funded, and even if it is good enough it may still not get funded. The stiffer the competition the more researchers have to work to win the award. And what little evidence there is suggests that the format of the application makes little difference to the amount of time spent by researchers on writing it. The lottery mechanism only requires the researcher to propose something that is good enough to get into the lottery. Far less time would therefore be devoted to writing it and more time spent on actual science. I’m all for it!

Preventability of early versus late hospital readmissions in a national cohort of general medicine patients. Annals of Internal Medicine [PubMed] Published 5th June 2018

Hospital quality is hard to judge. We’ve discussed on this blog before the pitfalls of using measures such as adjusted mortality differences for this purpose. Just because a hospital has higher than expected mortality does not mean those death could have been prevented with higher quality care. More thorough methods assess errors and preventable harm in care. Case note review studies have suggested as little as 5% of deaths might be preventable in England and Wales. Another paper we have covered previously suggests then that the predictive value of standardised mortality ratios for preventable deaths may be less than 10%.

Another commonly used metric is readmission rates. Poor care can mean patients have to return to the hospital. But again, the question remains as to how preventable these readmissions are. Indeed, there may also be substantial differences between those patients who are readmitted shortly after discharge and those for whom it may take a longer time. This article explores the preventability of early and late readmissions in ten hospitals in the US. It uses case note review and a number of reviewers to evaluate preventability. The headline figures are that 36% of early readmissions are considered preventable compared to 23% of late readmissions. Moreover, it was considered that the early readmissions were most likely to have been preventable at the hospital whereas for late readmissions, an outpatient clinic or the home would have had more impact. All in all, another paper which provides evidence to suggest crude, or even adjusted rates, are not good indicators of hospital quality.

Visualisation in Bayesian workflow. Journal of the Royal Statistical Society: Series A (Statistics in Society) [RePEc] Published 15th January 2019

This article stems from a broader programme of work from these authors on good “Bayesian workflow”. That is to say, if we’re taking a Bayesian approach to analysing data, what steps ought we to be taking to ensure our analyses are as robust and reliable as possible? I’ve been following this work for a while as this type of pragmatic advice is invaluable. I’ve often read empirical papers where the authors have chosen, say, a logistic regression model with covariates x, y, and z and reported the outcomes, but at no point ever justified why this particular model might be any good at all for these data or the research objective. The key steps of the workflow include, first, exploratory data analysis to help set up a model, and second, performing model checks before estimating model parameters. This latter step is important: one can generate data from a model and set of prior distributions, and if the data that this model generates looks nothing like what we would expect the real data to look like, then clearly the model is not very good. Following this, we should check whether our inference algorithm is doing its job, for example, are the MCMC chains converging? We can also conduct posterior predictive model checks. These have had their criticisms in the literature for using the same data to both estimate and check the model which could lead to the model generalising poorly to new data. Indeed in a recent paper of my own, posterior predictive checks showed poor fit of a model to my data and that a more complex alternative was better fitting. But other model fit statistics, which penalise numbers of parameters, led to the alternative conclusions. So the simpler model was preferred on the grounds that the more complex model was overfitting the data. So I would argue posterior predictive model checks are a sensible test to perform but must be interpreted carefully as one step among many. Finally, we can compare models using tools like cross-validation.

This article discusses the use of visualisation to aid in this workflow. They use the running example of building a model to estimate exposure to small particulate matter from air pollution across the world. Plots are produced for each of the steps and show just how bad some models can be and how we can refine our model step by step to arrive at a convincing analysis. I agree wholeheartedly with the authors when they write, “Visualization is probably the most important tool in an applied statistician’s toolbox and is an important complement to quantitative statistical procedures.”



How important is healthcare for population health?

How important is a population’s access to healthcare as a determinant of population health? I have heard the claim that “as little as 10% of a population’s health is linked to access to healthcare”, or some variant of it, in many places. Some examples include the Health Foundation, the AHRQ, the King’s Fund, the WHO, and This claim is appealing: it feels counter-intuitive and it brings to the fore questions of public health and health-related behaviour. But it’s not clear what it means.

I can think of two possible interpretations. One, 10% of the variation in population health outcomes is explained by variation in healthcare access. Or two, access to healthcare leads to a 10% change in population health outcomes compared to no access to healthcare. Both of these claims would be very hard to evaluate empirically. Within many countries, particularly the highest income countries, there is little variation in access to healthcare relative to possible levels of access across the world. Inter-country comparisons would provide a greater range of variation to compare to population outcomes. But even the most sophisticated statistical analysis will struggle to separate out the effects of other economic determinants of health.

It would also be difficult to make sense of any study that purported to estimate the effect of adding or removing healthcare beyond any within-country variation. The labour and capital resource needs of the most sophisticated hospitals are too great for the poorest settings, and it is unlikely that the wealthiest democratic countries could end up with the level of healthcare the world’s poorest face.

But what is the evidence for the claim of 10%? There are a handful of key citations, all of which were summarised at the time in a widely cited article in Health Affairs in 2014. For each of the two ways we could think about the contribution of healthcare above, we would need to look at estimates of the probability of health conditional on different levels of healthcare, Pr(health|healthcare). Each of the references for the 10% figure above in fact provides evidence for the proportion of deaths associated with ‘inadequate’ healthcare, or to put in another way, the probability of having received ‘inadequate’ care given death, Pr(healthcare|health). This is known as transposing the conditional: we have got our conditional probability the wrong way round. Even if we accept mortality rates as an acceptable proxy for population health, the two probabilities are not equal to one another.

Interpretation of this evidence is also complex. Smoking tobacco, for example, would be considered a behavioural determinant of health and deaths caused by it would be attributed to a behavioural cause rather than healthcare. But survival rates for lung cancers have improved dramatically over the last few decades due to improvements in healthcare. While it would be foolish to attribute a death in the past to a lack of access to treatments which had not been invented, contemporary lung cancer deaths in low income settings may well have been prevented by access to better healthcare. Thus using cause-of-death statistics to estimate the contributions of different factors to population health only typically picks up those deaths resulting from medical error or negligence. They are a wholly unreliable guide to the role of healthcare in determining population health.

A study published recently in The Lancet, timed to coincide with a commission on healthcare quality, adopted a different approach. The study aimed to estimate the annual number of deaths worldwide due to a lack of access to high-quality care. To do this they compared the mortality rates of conditions amenable to healthcare intervention around the world with those in the wealthiest nations. Any differences were attributed to either non-utilisation of or lack of access to high-quality care. 15.6 million ‘excess deaths’ were estimated. However, to attribute to these deaths a cause of inadequate healthcare access, one would need to conceive of a counter-factual world in which everyone was treated in the best healthcare systems. This is surely implausible in the extreme. A comparable question might be to ask how many people around the world are dying because their incomes are not as high as those of the top 10% of Americans.

On the normative question, there is little disagreement with the goal of achieving universal health coverage and improving population health. But these dramatic, eye-catching, or counter-intuitive figures do little to support achieving these ends: they can distort policy priorities and create unattainable goals and expectations. Health systems are not built overnight; an incremental approach is needed to ensure sustainability and affordability. Evidence to support this is where great strides are being made both methodologically and empirically, but it is not nearly as exciting as claiming healthcare isn’t very important or that millions of people are dying every year due to poor healthcare access. Healthcare systems are an integral and important part of overall population health; assigning a number to this importance is not.

Picture credit: pixabay