Sam Watson’s journal round-up for 6th May 2019

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Channeling Fisher: randomisation tests and the statistical insignificance of seemingly experimental results. Quarterly Journal of Economics Published May 2019

Anyone who pays close attention to the statistics literature may feel that a paradigm shift is underway. While papers cautioning on the use of null hypothesis significance testing (NHST) have been published for decades, a number of articles in recent years have highlighted large numbers of problems in published studies. For example, only 39% of replications of 100 experiments in social psychology were considered successful. Publication in prestigious journals like Science and Nature is no guarantee of replicability either. There is a growing number of voices calling for improvements in study reporting and conduct, changes to use of p-values or even their abandonment altogether.

Some of the failures of studies using NHST methods are due to poor experimental design, poorly defined interventions, or “noise-mining”. But even well-designed experiments that are theoretically correctly analysed are not immune from false inferences in the NHST paradigm. This article looks at the reliability of statistical significance claims in 53 experimental studies published in the journals of the American Economic Association.

Statistical significance is typically determined in experimental economic papers using the econometric techniques widely taught to all economics students. In particular, the t-statistic of a regression coefficient is calculated using either homoskedastic or robust standard errors, which is then compared to a t-distribution with the appropriate degrees of freedom. An alternative method to determine p-values is a permutation or randomisation test, which we have featured in a previous Method of the Month. The permutation test provides the exact distribution of the test statistic and is therefore highly reliable. This article compares results from permutation tests the author conducts to the reported p-values in the 53 selected experimental studies. It finds between 13% and 22% fewer statistically significant results than reported in the papers and in tests of multiple treatment effects, 33% to 49% fewer.

This discrepancy is explained in part by the leverage of certain observations in each study. Results are often sensitive to the removal of single observations. The more of an impact an observation has, the greater its leverage; in balanced experimental designs leverage is uniformly distributed. In regressions with multiple treatments and treatment interactions leverage becomes concentrated and standard errors become volatile. Needless to say, this article presents yet another piece of compelling evidence that NHST is unreliable and strengthens the case for abandoning statistical significance as the primary inferential tool.

Effect of a resuscitation strategy targeting peripheral perfusion status vs serum lactate levels on 28-day mortality among patients with septic shock. The ANDROMEDA-SHOCK randomized clinical trial. Journal of the American Medical Association [PubMed] Published 17th February 2019

This article gets a mention in this round-up not for its health or economic content but because it is a very good example how not to use statistical significance. In previous articles on the blog we’ve discussed the misuse and misinterpretation of p-values, but I generally don’t go as far as advocating their complete abandonment as a recent mass-signed letter in Nature has. What is crucial is that researchers stop making the mistake that statistical insignificance means no effect. Making this error can lead to pernicious consequences when it comes to patient treatment and the lack of adoption of effective and cost-effective technologies, which is exactly what this article does.

I first saw this ridiculous use of statistical significance when it was Tweeted by David Spiegelhalter. The trial (in JAMA, no less) compares two different methods of managing resuscitation in patients with septic shock. The key result is:

By day 28, 74 patients (34.9%) in the peripheral perfusion group and 92 patients (43.4%) in the lactate group had died (hazard ratio, 0.75 [95% CI, 0.55 to 1.02]; P = .06; risk difference, −8.5% [95% CI, −18.2% to 1.2%]).

And the conclusion?

Among patients with septic shock, a resuscitation strategy targeting normalization of capillary refill time, compared with a strategy targeting serum lactate levels, did not reduce all-cause 28-day mortality.

Which is determined solely on the basis of statistical significance. Certainly it is possible that the result is just chance variation. But the study was conducted because it was believed that there was a difference in survival between these methods, and a 25% reduction in mortality risk is significant indeed. Rather than take an abductive or Bayesian approach, which would see this result as providing some degree of evidence in support of one treatment, the authors abandon any attempt at thinking and just mechanically follow statistical significance logic. This is a good case study for anyone wanting to discuss interpretation of p-values, but more significantly (every pun intended) the reliance on statistical significance may well be jeopardising patient lives.

Value of information: sensitivity analysis and research design in Bayesian evidence synthesis. Journal of the American Statistical Association Published 30th April 2019.

Three things are necessary to make a decision in the decision theoretical sense. First, a set of possible decisions; second, a set of parameters describing the state of the world; and third, a loss (or utility) function. Given these three things the decision that is chosen is the one that minimises losses (or maximises utility) given the state of the world. Of course, the state of the world may not be known for sure. There can be some uncertainty about the parameters and hence the best course of action, which might lead to losses relative to the decision we would make if we knew everything perfectly. Thus, we can determine the benefits of collecting more information. This is the basis of value of information (VoI) analysis.

We can distinguish between different quantities of interest in VoI analyses. The expected value of perfect information (EVPI) is the difference in the expected loss under the optimal decision made with current information, and the expected loss under the decision we would make if we knew all the parameters exactly. The expected value of partial perfect information (EVPPI) is similar to the previous definition expect it considers only the difference to if we knew one of the parameters exactly. Finally, the expected value of sample information (EVSI) compares the losses under our current decision to those under the decision we would make if we had the information on our parameters from a particular study design. If we know the costs of conducting a given study then we can take the benefits estimated in the EVSI to get the expected net benefit of sampling.

Calculating EVPPI and EVSI is no easy feat though, particularly for more complex models. This article proposes a relatively straightforward and computationally feasible way of estimating these quantities for complex evidence synthesis models. For their example they use a model commonly used to estimate overall HIV prevalence. Since not all HIV cases are known or disclosed, one has to combine different sets of data to get to a reliable estimate. For example, it is known how many people attend sexual health clinics and what proportion of those have HIV, so it is also known how many do not attend sexual health clinics just not how many of those might be HIV positive. There are many epidemiological parameters in this complex model and the aim of the paper is to demonstrate how the principle sources of uncertainty can be determined in terms of EVPPI and EVSI.


Method of the month: Permutation tests

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is permutation tests.


One of the main objections to the use of p-values for statistical inference is that they are often misunderstood. They can be interpreted as the probability a null hypothesis is true, which they are not. Part of the cause of this problem is the black-box approach to statistical software. You can plug in data to Stata, R, or any other package, ask it to run a regression or test with ease, and it will spit out a load of p-values. Many people will just take it on trust that the software is returning the test of the hypothesis of interest and that the method has the correct properties, like type one error rate. But if one had to go through the process to obtain a test-statistic and the relevant distribution to compare it to, perhaps then the p-value wouldn’t be so misunderstood. For trials involving randomisation, permutation tests (or randomisation tests or exact tests) are just such a process.

Permutation tests were first proposed by Ronald Fisher in the early 20th Century. The basic principle is that to test differences between two groups assigned at random we can determine the exact distribution of a test statistic (such as a difference in means) under the null hypothesis by calculating the value of the test statistic for all possible ways of arranging our units into the two groups. The value of the test statistic for the actual assignment can be compared to this distribution to determine the p-value.

The simplest example would be to test a difference in means for a continuous outcome between two groups assigned in a randomised controlled trial. Let’s generate some data (we’ll do this in R) from a simple trial with 50 individuals per arm. In the control arm the data come from a N(0,1) distribution and in the treatment arm they come from a N(0.5,1) distribution:

n <- 100 #number of individuals
D <- sample(c(rep(0,n/2),rep(1,n/2)),n) #treatment assignment
y <- rnorm(n,mean=D*0.5,sd=1) #generate normal outcomes
T.diff <- mean(y[(D*1:n)])-mean(y[(-1*(D-1)*1:n)]) #actual difference in means

Now let’s add a function to randomly re-assign units to treatment and control and calculate the difference in means we would observe under the null of no difference based on our generated data. We will then plot this distribution and add a line showing our where the actual difference in means lies.

#function to generate differences in means
permStat <- function(n,y){
D <- sample(c(rep(0,n/2),rep(1,n/2)),n) #generate new assignment
T.diff <- mean(y[(D*1:n)])-mean(y[(-1*(D-1)*1:n)])
T.dist <- sapply(1:500,function(i)permStat(n,y)) #apply it 500 times
qplot(T.dist)+geom_vline(xintercept=T.diff,col="red") #plot
Exact distribution for test of a difference in means

Our 2-sided p-value here is 0.04, i.e. the proportion of values at least as extreme as our test statistic.

For a more realistic example we can consider a cluster randomised trial with a binary outcome. The reason for choosing this example is that estimating non-linear mixed models is difficult. Calculating test statistics, especially when the number of clusters is relatively small, is even harder. The methods used in most statistics packages have inflated type one errors, unbeknownst to many. So let’s set up the following trial: two-arms with 8 clusters per arm, and 100 patients per cluster, which is representative of trials of, say, hospitals. The data generating mechanism is for patient i in cluster j

y_{ij} = Bernoulli(p_{ij})

p_{ij} = Logit(\alpha_j + x_j'\beta + D_{j}\gamma)

So no individual level covariates, four Bernoulli(0.3) covariates x_j with \beta = [1,1,1,1], and a treatment indicator D_j with treatment effect \gamma=0 (to look at type one error rates). The cluster effect is \alpha_j \sim N(0,\sigma^2_\alpha) and \sigma^2_\alpha is chosen to give an intraclass correlation coefficient of 0.05. We’ll simulate data from this model and then estimate the model above and test the null hypothesis H_0:\gamma=0 in two ways. First, we’ll use the popular R package lme4 and the command glmer, which uses adaptive Gaussian quadrature to estimate the parameters and covariance matrix; the built in p-values are derived from standard Wald t-statistics. Second, we’ll use our permutation tests.

Gail et al. (1996) examine permutation tests for these kinds of models. They propose the following residual-based test (although one can use other tests based on the likelihood): (i) estimate the simple model under the null with no treatment effect and no hierarchical effect, i.e. p_{ij}=Logit(\alpha+x_{ij}'\beta); (ii) for each individual generate their predicted values and residuals r_{ij}; (iii) generate the cluster average residuals \bar{r}_{.j}=N_j^{-1}\sum_{i=1}^{N_j} r_{ij}. Then the test statistic is

U=N^{-1}_{j} \left( \sum_{j=1}^{2N_j}D_{jg}\bar{r}_{.j} - \sum_{j=1}^{2N_j}(D_{j}-1)\bar{r}_{.j} \right) = N^{-1}_{j} \sum_{j=1}^{2N_j}(2D_{j}-1)\bar{r}_{.j}

Under the null and given equal cluster sizes, the residual means are exchangeable. So the exact distribution of U can be obtained by calculating it under all possible randomisation schemes. The p-value is then the quantile of this distribution under which the test statistic falls for the actual randomisation scheme. For larger numbers of clusters, it is not feasible to permute every possible randomisation scheme, so we approximate the distribution of U using 500 randomly generated schemes. The following figure shows the estimated type one error rates using the two different methods (and 200 simulations):

The figure clearly shows an inflated type one error rates for the standard p-values reported by glmer especially for smaller numbers of clusters per arm. By contrast the residual permutation test shows approximately correct type one error rates (given more simulations there should be less noise in these estimates).


Implementation of these tests is straightforward in different software packages. In Stata, one can use the command permute, for which you specify the different groups, number of permutations and command to estimate the treatment effect. In R, there are various packages, like coin, that perform a similar function. For more complex models particular non-linear ones and ones involving adjustment, one has to be careful about how to specify the appropriate test statistic and model under the null hypothesis, which may involve a little programming, but it is relatively straightforward to do so.


These methods have widespread applications for anyone looking to use null hypothesis significance testing. So a complete overview of the literature is not possible. Instead, we highlight a few uses of these methods.

In a previous post in this series we covered synthetic control methods; one of the ways of computing test statistics for this method has been called ‘placebo tests’, which are an exact parallel to the permutation tests discussed here. Krief and others discuss the use of these methods for evaluating health policies. Another example from a regression-based analysis is provided by Dunn and Shapiro. And Jacob, Ludwig, and Miller examine the impact of a lottery for vouchers to move to another area and employ these tests.

Sugar et al derive health states for depression from the SF-12 and use permutation test methods to validate the health states. Barber and Thompson use these tests to examine costs data from an RCT.