Method of the month: constrained randomisation

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is constrained randomisation.


Randomised experimental studies are one of the best ways of estimating the causal effects of an intervention. They have become more and more widely used in economics; Banerjee and Duflo are often credited with popularising them among economists. When done well, randomly assigning a treatment ensures both observable and unobservable factors are independent of treatment status and likely to be balanced between treatment and control units.

Many of the interventions economists are interested in are at a ‘cluster’ level, be it a school, hospital, village, or otherwise. So the appropriate experimental design would be a cluster randomised controlled trial (cRCT), in which the clusters are randomised to treatment or control and individuals within each cluster are observed either cross-sectionally or longitudinally. But, except in cases of large budgets, the number of clusters participating can be fairly small. When randomising a relatively small number of clusters we could by chance end up with a quite severe imbalance in key covariates between trial arms. This presents a problem if we suspect a priori that these covariates have an influence on key outcomes.

One solution to the problem of potential imbalance is covariate-based constrained randomisation. The principle here is to conduct a large number of randomisations, assess the balance of covariates in each one using some balance metric, and then to randomly choose one of the most balanced according to this metric. This method preserves the important random treatment assignment while ensuring covariate balance. Stratified randomisation also has a similar goal, but in many cases may not be possible if there are continuous covariates of interest or too few clusters to distribute among many strata.


Conducting covariate constrained randomisation is straightforward and involves the following steps:

  1. Specifying the important baseline covariates to balance the clusters on. For each cluster j we have L covariates x_{il}; l=1,...L.
  2. Characterising each cluster in terms of these covariates, i.e. creating the x_{il}.
  3. Enumerating all potential randomisation schemes or simulating a large number of them. For each one, we will need to measure the balance of the x_{il} between trial arms.
  4. Selecting a candidate set of randomisation schemes that are sufficiently balanced according to some pre-specified criterion from which we can randomly choose our treatment allocation.

Balance scores

A key ingredient in the above steps is the balance score. This score needs to be some univariate measure of potentially multivariate imbalance between two (or more) groups. A commonly used score is that proposed by Raab and Butcher:

\sum_{l=1}^{L} \omega_l (\bar{x}_{1l}-\bar{x}_{0l})^2

where \bar{x}_{1l} and \bar{x}_{0l} are the mean values of covariate l in the treatment and control groups respectively, and \omega_l is some weight, which is often the inverse standard deviation of the covariate. Conceptually the score is a sum of standardised differences in means, so lower values indicate greater balance. But other scores would also work. Indeed, any statistic that measures the distance between the distributions of two variables would work and could be summed up over the covariates. This could include the maximum distance:

max_l |x_{1l} - x_{0l}|

the Manhattan distance:

\sum_{l=1}^{L} |x_{1l}-x_{0l}|

or even the Symmetrised Bayesian Kullback-Leibler divergence (I can’t be bothered to type this one out). Grischott has developed a Shiny application to estimate all these distances in a constrained randomisation framework, detailed in this paper.

Things become more complex if there are more than two trial arms. All of the above scores are only able to compare two groups. However, there already exist a number of univariate measures of multivariate balance in the form of MANOVA (multivariate analysis of variance) test statistics. For example, if we have G trial arms and let X_{jg} = \left[ x_{jg1},...,x_{jgL} \right]' then the between group covariance matrix is:

B = \sum_{g=1}^G N_g(\bar{X}_{.g} - \bar{X}_{..})(\bar{X}_{.g} - \bar{X}_{..})'

and the within group covariance matrix is:

W = \sum_{g=1}^G \sum_{j=1}^{N_g} (X_{jg}-\bar{X}_{.g})(X_{jg}-\bar{X}_{.g})'

which we can use in a variety of statistics including Wilks’ Lambda, for example:

\Lambda = \frac{det(W)}{det(W+B)}

No trial has previously used covariate constrained randomisation with multiple groups, as far as I am aware, but this is the subject of an ongoing paper investigating these scores – so watch this space!

Once the scores have been calculated for all possible schemes or a very large number of possible schemes, we select from among those which are most balanced. The most balanced are defined according to some quantile of the balance score, say the top 15%.

As a simple simulated example of how this might be coded in R, let’s consider a trial of 8 clusters with two standard-normally distributed covariates. We’ll use the Raab and Butcher score from above:

#simulate the covariates
n <- 8
x1 <- rnorm(n)
x2 <- rnorm(n)
x <- matrix(c(x1,x2),ncol=2)
#enumerate all possible schemes - you'll need the partitions package here
schemes <- partitions::setparts(c(n/2,n/2))
#write a function that will estimate the score
#for each scheme which we can apply over our
#set of schemes
balance_score <- function(scheme,covs){
treat.idx <- I(scheme==2)
control.idx <- I(scheme==1)
treat.means <- apply(covs[treat.idx,],2,mean)
control.means <- apply(covs[control.idx,],2,mean)
cov.sds <- apply(covs,2,sd)
#Raab-butcher score
score <- sum((treat.means - control.means)^2/cov.sds)
#apply the function
scores <- apply(schemes,2,function(i)balance_score(i,x))
#find top 15% of schemes (lowest scores)
scheme.set <- which(scores <= quantile(scores,0.15))
#choose one at random
scheme.number <- sample(scheme.set,1)
scheme.chosen <- schemes[,scheme.number]


A commonly used method of cluster trial analysis is by estimating a mixed-model, i.e. a hierarchical model with cluster-level random effects. Two key questions are whether to control for the covariates used in the randomisation, and which test to use for treatment effects. Fan Li has two great papers answering these questions for linear models and binomial models. One key conclusion is that the appropriate type I error rates are only achieved in models adjusted for the covariates used in the randomisation. For non-linear models type I error rates can be way off for many estimators especially with small numbers of clusters, which is often the reason for doing constrained randomisation in the first place, so a careful choice is needed here. I would recommend adjusted permutation tests if in doubt to ensure the appropriate type I error rates. Of course, one could take a Bayesian approach to analysis, although there is no analysis that I’m aware of, of the performance of these models for these analyses (another case of “watch this space!”).


There are many trials that used this procedure and listing even a fraction would be a daunting task. But I would be remiss for not noting a trial of my own that uses covariate constrained randomisation. It is investigating the effect of providing an incentive to small and medium sized enterprises to adhere to a workplace well-being programme. There are good applications used as examples in Fan Li’s papers mentioned above. A trial that featured in a journal round-up in February used covariate constrained randomisation to balance a very small number of clusters in a trial of a medicines access programme in Kenya.


Sam Watson’s journal round-up for 6th May 2019

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Channeling Fisher: randomisation tests and the statistical insignificance of seemingly experimental results. Quarterly Journal of Economics Published May 2019

Anyone who pays close attention to the statistics literature may feel that a paradigm shift is underway. While papers cautioning on the use of null hypothesis significance testing (NHST) have been published for decades, a number of articles in recent years have highlighted large numbers of problems in published studies. For example, only 39% of replications of 100 experiments in social psychology were considered successful. Publication in prestigious journals like Science and Nature is no guarantee of replicability either. There is a growing number of voices calling for improvements in study reporting and conduct, changes to use of p-values or even their abandonment altogether.

Some of the failures of studies using NHST methods are due to poor experimental design, poorly defined interventions, or “noise-mining”. But even well-designed experiments that are theoretically correctly analysed are not immune from false inferences in the NHST paradigm. This article looks at the reliability of statistical significance claims in 53 experimental studies published in the journals of the American Economic Association.

Statistical significance is typically determined in experimental economic papers using the econometric techniques widely taught to all economics students. In particular, the t-statistic of a regression coefficient is calculated using either homoskedastic or robust standard errors, which is then compared to a t-distribution with the appropriate degrees of freedom. An alternative method to determine p-values is a permutation or randomisation test, which we have featured in a previous Method of the Month. The permutation test provides the exact distribution of the test statistic and is therefore highly reliable. This article compares results from permutation tests the author conducts to the reported p-values in the 53 selected experimental studies. It finds between 13% and 22% fewer statistically significant results than reported in the papers and in tests of multiple treatment effects, 33% to 49% fewer.

This discrepancy is explained in part by the leverage of certain observations in each study. Results are often sensitive to the removal of single observations. The more of an impact an observation has, the greater its leverage; in balanced experimental designs leverage is uniformly distributed. In regressions with multiple treatments and treatment interactions leverage becomes concentrated and standard errors become volatile. Needless to say, this article presents yet another piece of compelling evidence that NHST is unreliable and strengthens the case for abandoning statistical significance as the primary inferential tool.

Effect of a resuscitation strategy targeting peripheral perfusion status vs serum lactate levels on 28-day mortality among patients with septic shock. The ANDROMEDA-SHOCK randomized clinical trial. Journal of the American Medical Association [PubMed] Published 17th February 2019

This article gets a mention in this round-up not for its health or economic content but because it is a very good example how not to use statistical significance. In previous articles on the blog we’ve discussed the misuse and misinterpretation of p-values, but I generally don’t go as far as advocating their complete abandonment as a recent mass-signed letter in Nature has. What is crucial is that researchers stop making the mistake that statistical insignificance means no effect. Making this error can lead to pernicious consequences when it comes to patient treatment and the lack of adoption of effective and cost-effective technologies, which is exactly what this article does.

I first saw this ridiculous use of statistical significance when it was Tweeted by David Spiegelhalter. The trial (in JAMA, no less) compares two different methods of managing resuscitation in patients with septic shock. The key result is:

By day 28, 74 patients (34.9%) in the peripheral perfusion group and 92 patients (43.4%) in the lactate group had died (hazard ratio, 0.75 [95% CI, 0.55 to 1.02]; P = .06; risk difference, −8.5% [95% CI, −18.2% to 1.2%]).

And the conclusion?

Among patients with septic shock, a resuscitation strategy targeting normalization of capillary refill time, compared with a strategy targeting serum lactate levels, did not reduce all-cause 28-day mortality.

Which is determined solely on the basis of statistical significance. Certainly it is possible that the result is just chance variation. But the study was conducted because it was believed that there was a difference in survival between these methods, and a 25% reduction in mortality risk is significant indeed. Rather than take an abductive or Bayesian approach, which would see this result as providing some degree of evidence in support of one treatment, the authors abandon any attempt at thinking and just mechanically follow statistical significance logic. This is a good case study for anyone wanting to discuss interpretation of p-values, but more significantly (every pun intended) the reliance on statistical significance may well be jeopardising patient lives.

Value of information: sensitivity analysis and research design in Bayesian evidence synthesis. Journal of the American Statistical Association Published 30th April 2019.

Three things are necessary to make a decision in the decision theoretical sense. First, a set of possible decisions; second, a set of parameters describing the state of the world; and third, a loss (or utility) function. Given these three things the decision that is chosen is the one that minimises losses (or maximises utility) given the state of the world. Of course, the state of the world may not be known for sure. There can be some uncertainty about the parameters and hence the best course of action, which might lead to losses relative to the decision we would make if we knew everything perfectly. Thus, we can determine the benefits of collecting more information. This is the basis of value of information (VoI) analysis.

We can distinguish between different quantities of interest in VoI analyses. The expected value of perfect information (EVPI) is the difference in the expected loss under the optimal decision made with current information, and the expected loss under the decision we would make if we knew all the parameters exactly. The expected value of partial perfect information (EVPPI) is similar to the previous definition expect it considers only the difference to if we knew one of the parameters exactly. Finally, the expected value of sample information (EVSI) compares the losses under our current decision to those under the decision we would make if we had the information on our parameters from a particular study design. If we know the costs of conducting a given study then we can take the benefits estimated in the EVSI to get the expected net benefit of sampling.

Calculating EVPPI and EVSI is no easy feat though, particularly for more complex models. This article proposes a relatively straightforward and computationally feasible way of estimating these quantities for complex evidence synthesis models. For their example they use a model commonly used to estimate overall HIV prevalence. Since not all HIV cases are known or disclosed, one has to combine different sets of data to get to a reliable estimate. For example, it is known how many people attend sexual health clinics and what proportion of those have HIV, so it is also known how many do not attend sexual health clinics just not how many of those might be HIV positive. There are many epidemiological parameters in this complex model and the aim of the paper is to demonstrate how the principle sources of uncertainty can be determined in terms of EVPPI and EVSI.


Sam Watson’s journal round-up for 25th February 2019

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Democracy does cause growth. Journal of Political Economy [RePEc] Published January 2019

Citizens of a country with a democratic system of government are able to affect change in its governors and influence policy. This threat of voting out the poorly performing from power provides an incentive for the government to legislate in a way that benefits the population. However, democracy is certainly no guarantee of good governance, economic growth, or population health as many events in the last ten years will testify. Similarly, non-democracies can also enact policy that benefits the people. A benevolent dictator is not faced with the same need to satisfy voters and can enact politically challenging but beneficial policies. People often point to China as a key example of this. So there remains the question as to whether democracy per se has any tangible economic or health benefits.

In a past discussion of an article on democratic reform and child health, I concluded that “Democratic reform is neither a sufficient nor necessary condition for improvements in child mortality.” Nevertheless democracy may still be beneficial, on average, given the in-built safeguards against poor leaders. This paper, which has been doing the rounds for years as a working paper, is another examination of the question of the impact of becoming democratic. Principally the article is focused on economic growth, but health and education outcomes feature (very) briefly. The concern I have with the article mentioned at the beginning of this paragraph and with this newly published article are that they do not consider in great detail why democratisation occurred. As much political science work points out, democratic reform can be demanded in poor economic conditions due to poor governance. For these endogenous changes economic growth causes democracy. Whereas in other countries democracy could come about in a more exogenous manner. Lumping them all in together may be misleading.

While the authors of this paper provide pages after pages of different regression specifications, including auto-regressive models and instrumental variables models, I remain unconvinced. For example, the instrument relies on ‘waves’ of transitions: a country is more likely to shift politically if its regional neighbours do, like the Arab Spring. But neither economic nor political conditions in a given country are independent of its neighbours. In somewhat of a rebuttal, Ruiz Pozuelo and other authors conducted a survey to try to identify and separate out those countries which transitioned to democracy endogenously and exogenously (from economic conditions). Their work suggests that the countries that transitioned exogenously did not experience growth benefits. Taken together this shows the importance of theory to guide empirical work, and not the other way round.

Effect of Novartis Access on availability and price of non-communicable disease medicines in Kenya: a cluster-randomised controlled trial. Lancet: Global Health Published February 2019

Access to medicines is one of the key barriers to achieving universal health care. The cost-effectiveness threshold for many low income countries rules out many potentially beneficial medicines. This is in part driven though by the high prices charged by pharmaceutical countries to purchase medicine, which often do not discriminate between purchasers with high and low abilities to pay. Novartis launched a scheme – Novartis Access – to provide access to medicines to low and middle income countries at a price of US$1 per treatment per month. This article presents a cluster randomised trial of this scheme in eight counties of Kenya.

The trial provided access to four treatment counties and used four counties as controls. Individuals selected at random within the counties with non-communicable diseases and pharmacies were the principal units within the counties at which outcomes were analysed. Given the small number of clusters, a covariate-constrained randomisation procedure was used, which generates randomisation that ensures a decent balance of covariates between arms. However, the analysis does not control for the covariates used in the constrained randomisation, which can lead to lower power and incorrect type one error rates. This problem is emphasized by the use of statistical significance to decide on what was and was not affected by the Novartis Access program. While practically all the drugs investigated show an improved availability, only the two with p<0.05 are reported to have improved. Given the very small sample of clusters, this is a tricky distinction to make! Significance aside, the programme appears to have had some success in improving access to diabetes and asthma medication, but not quite as much as hoped. Introductory microeconomics though would show how savings are not all passed on to the consumer.