Method of the month: constrained randomisation

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is constrained randomisation.


Randomised experimental studies are one of the best ways of estimating the causal effects of an intervention. They have become more and more widely used in economics; Banerjee and Duflo are often credited with popularising them among economists. When done well, randomly assigning a treatment ensures both observable and unobservable factors are independent of treatment status and likely to be balanced between treatment and control units.

Many of the interventions economists are interested in are at a ‘cluster’ level, be it a school, hospital, village, or otherwise. So the appropriate experimental design would be a cluster randomised controlled trial (cRCT), in which the clusters are randomised to treatment or control and individuals within each cluster are observed either cross-sectionally or longitudinally. But, except in cases of large budgets, the number of clusters participating can be fairly small. When randomising a relatively small number of clusters we could by chance end up with a quite severe imbalance in key covariates between trial arms. This presents a problem if we suspect a priori that these covariates have an influence on key outcomes.

One solution to the problem of potential imbalance is covariate-based constrained randomisation. The principle here is to conduct a large number of randomisations, assess the balance of covariates in each one using some balance metric, and then to randomly choose one of the most balanced according to this metric. This method preserves the important random treatment assignment while ensuring covariate balance. Stratified randomisation also has a similar goal, but in many cases may not be possible if there are continuous covariates of interest or too few clusters to distribute among many strata.


Conducting covariate constrained randomisation is straightforward and involves the following steps:

  1. Specifying the important baseline covariates to balance the clusters on. For each cluster j we have L covariates x_{il}; l=1,...L.
  2. Characterising each cluster in terms of these covariates, i.e. creating the x_{il}.
  3. Enumerating all potential randomisation schemes or simulating a large number of them. For each one, we will need to measure the balance of the x_{il} between trial arms.
  4. Selecting a candidate set of randomisation schemes that are sufficiently balanced according to some pre-specified criterion from which we can randomly choose our treatment allocation.

Balance scores

A key ingredient in the above steps is the balance score. This score needs to be some univariate measure of potentially multivariate imbalance between two (or more) groups. A commonly used score is that proposed by Raab and Butcher:

\sum_{l=1}^{L} \omega_l (\bar{x}_{1l}-\bar{x}_{0l})^2

where \bar{x}_{1l} and \bar{x}_{0l} are the mean values of covariate l in the treatment and control groups respectively, and \omega_l is some weight, which is often the inverse standard deviation of the covariate. Conceptually the score is a sum of standardised differences in means, so lower values indicate greater balance. But other scores would also work. Indeed, any statistic that measures the distance between the distributions of two variables would work and could be summed up over the covariates. This could include the maximum distance:

max_l |x_{1l} - x_{0l}|

the Manhattan distance:

\sum_{l=1}^{L} |x_{1l}-x_{0l}|

or even the Symmetrised Bayesian Kullback-Leibler divergence (I can’t be bothered to type this one out). Grischott has developed a Shiny application to estimate all these distances in a constrained randomisation framework, detailed in this paper.

Things become more complex if there are more than two trial arms. All of the above scores are only able to compare two groups. However, there already exist a number of univariate measures of multivariate balance in the form of MANOVA (multivariate analysis of variance) test statistics. For example, if we have G trial arms and let X_{jg} = \left[ x_{jg1},...,x_{jgL} \right]' then the between group covariance matrix is:

B = \sum_{g=1}^G N_g(\bar{X}_{.g} - \bar{X}_{..})(\bar{X}_{.g} - \bar{X}_{..})'

and the within group covariance matrix is:

W = \sum_{g=1}^G \sum_{j=1}^{N_g} (X_{jg}-\bar{X}_{.g})(X_{jg}-\bar{X}_{.g})'

which we can use in a variety of statistics including Wilks’ Lambda, for example:

\Lambda = \frac{det(W)}{det(W+B)}

No trial has previously used covariate constrained randomisation with multiple groups, as far as I am aware, but this is the subject of an ongoing paper investigating these scores – so watch this space!

Once the scores have been calculated for all possible schemes or a very large number of possible schemes, we select from among those which are most balanced. The most balanced are defined according to some quantile of the balance score, say the top 15%.

As a simple simulated example of how this might be coded in R, let’s consider a trial of 8 clusters with two standard-normally distributed covariates. We’ll use the Raab and Butcher score from above:

#simulate the covariates
n <- 8
x1 <- rnorm(n)
x2 <- rnorm(n)
x <- matrix(c(x1,x2),ncol=2)
#enumerate all possible schemes - you'll need the partitions package here
schemes <- partitions::setparts(c(n/2,n/2))
#write a function that will estimate the score
#for each scheme which we can apply over our
#set of schemes
balance_score <- function(scheme,covs){
treat.idx <- I(scheme==2)
control.idx <- I(scheme==1)
treat.means <- apply(covs[treat.idx,],2,mean)
control.means <- apply(covs[control.idx,],2,mean)
cov.sds <- apply(covs,2,sd)
#Raab-butcher score
score <- sum((treat.means - control.means)^2/cov.sds)
#apply the function
scores <- apply(schemes,2,function(i)balance_score(i,x))
#find top 15% of schemes (lowest scores)
scheme.set <- which(scores <= quantile(scores,0.15))
#choose one at random
scheme.number <- sample(scheme.set,1)
scheme.chosen <- schemes[,scheme.number]


A commonly used method of cluster trial analysis is by estimating a mixed-model, i.e. a hierarchical model with cluster-level random effects. Two key questions are whether to control for the covariates used in the randomisation, and which test to use for treatment effects. Fan Li has two great papers answering these questions for linear models and binomial models. One key conclusion is that the appropriate type I error rates are only achieved in models adjusted for the covariates used in the randomisation. For non-linear models type I error rates can be way off for many estimators especially with small numbers of clusters, which is often the reason for doing constrained randomisation in the first place, so a careful choice is needed here. I would recommend adjusted permutation tests if in doubt to ensure the appropriate type I error rates. Of course, one could take a Bayesian approach to analysis, although there is no analysis that I’m aware of, of the performance of these models for these analyses (another case of “watch this space!”).


There are many trials that used this procedure and listing even a fraction would be a daunting task. But I would be remiss for not noting a trial of my own that uses covariate constrained randomisation. It is investigating the effect of providing an incentive to small and medium sized enterprises to adhere to a workplace well-being programme. There are good applications used as examples in Fan Li’s papers mentioned above. A trial that featured in a journal round-up in February used covariate constrained randomisation to balance a very small number of clusters in a trial of a medicines access programme in Kenya.


Method of the month: Permutation tests

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is permutation tests.


One of the main objections to the use of p-values for statistical inference is that they are often misunderstood. They can be interpreted as the probability a null hypothesis is true, which they are not. Part of the cause of this problem is the black-box approach to statistical software. You can plug in data to Stata, R, or any other package, ask it to run a regression or test with ease, and it will spit out a load of p-values. Many people will just take it on trust that the software is returning the test of the hypothesis of interest and that the method has the correct properties, like type one error rate. But if one had to go through the process to obtain a test-statistic and the relevant distribution to compare it to, perhaps then the p-value wouldn’t be so misunderstood. For trials involving randomisation, permutation tests (or randomisation tests or exact tests) are just such a process.

Permutation tests were first proposed by Ronald Fisher in the early 20th Century. The basic principle is that to test differences between two groups assigned at random we can determine the exact distribution of a test statistic (such as a difference in means) under the null hypothesis by calculating the value of the test statistic for all possible ways of arranging our units into the two groups. The value of the test statistic for the actual assignment can be compared to this distribution to determine the p-value.

The simplest example would be to test a difference in means for a continuous outcome between two groups assigned in a randomised controlled trial. Let’s generate some data (we’ll do this in R) from a simple trial with 50 individuals per arm. In the control arm the data come from a N(0,1) distribution and in the treatment arm they come from a N(0.5,1) distribution:

n <- 100 #number of individuals
D <- sample(c(rep(0,n/2),rep(1,n/2)),n) #treatment assignment
y <- rnorm(n,mean=D*0.5,sd=1) #generate normal outcomes
T.diff <- mean(y[(D*1:n)])-mean(y[(-1*(D-1)*1:n)]) #actual difference in means

Now let’s add a function to randomly re-assign units to treatment and control and calculate the difference in means we would observe under the null of no difference based on our generated data. We will then plot this distribution and add a line showing our where the actual difference in means lies.

#function to generate differences in means
permStat <- function(n,y){
D <- sample(c(rep(0,n/2),rep(1,n/2)),n) #generate new assignment
T.diff <- mean(y[(D*1:n)])-mean(y[(-1*(D-1)*1:n)])
T.dist <- sapply(1:500,function(i)permStat(n,y)) #apply it 500 times
qplot(T.dist)+geom_vline(xintercept=T.diff,col="red") #plot
Exact distribution for test of a difference in means

Our 2-sided p-value here is 0.04, i.e. the proportion of values at least as extreme as our test statistic.

For a more realistic example we can consider a cluster randomised trial with a binary outcome. The reason for choosing this example is that estimating non-linear mixed models is difficult. Calculating test statistics, especially when the number of clusters is relatively small, is even harder. The methods used in most statistics packages have inflated type one errors, unbeknownst to many. So let’s set up the following trial: two-arms with 8 clusters per arm, and 100 patients per cluster, which is representative of trials of, say, hospitals. The data generating mechanism is for patient i in cluster j

y_{ij} = Bernoulli(p_{ij})

p_{ij} = Logit(\alpha_j + x_j'\beta + D_{j}\gamma)

So no individual level covariates, four Bernoulli(0.3) covariates x_j with \beta = [1,1,1,1], and a treatment indicator D_j with treatment effect \gamma=0 (to look at type one error rates). The cluster effect is \alpha_j \sim N(0,\sigma^2_\alpha) and \sigma^2_\alpha is chosen to give an intraclass correlation coefficient of 0.05. We’ll simulate data from this model and then estimate the model above and test the null hypothesis H_0:\gamma=0 in two ways. First, we’ll use the popular R package lme4 and the command glmer, which uses adaptive Gaussian quadrature to estimate the parameters and covariance matrix; the built in p-values are derived from standard Wald t-statistics. Second, we’ll use our permutation tests.

Gail et al. (1996) examine permutation tests for these kinds of models. They propose the following residual-based test (although one can use other tests based on the likelihood): (i) estimate the simple model under the null with no treatment effect and no hierarchical effect, i.e. p_{ij}=Logit(\alpha+x_{ij}'\beta); (ii) for each individual generate their predicted values and residuals r_{ij}; (iii) generate the cluster average residuals \bar{r}_{.j}=N_j^{-1}\sum_{i=1}^{N_j} r_{ij}. Then the test statistic is

U=N^{-1}_{j} \left( \sum_{j=1}^{2N_j}D_{jg}\bar{r}_{.j} - \sum_{j=1}^{2N_j}(D_{j}-1)\bar{r}_{.j} \right) = N^{-1}_{j} \sum_{j=1}^{2N_j}(2D_{j}-1)\bar{r}_{.j}

Under the null and given equal cluster sizes, the residual means are exchangeable. So the exact distribution of U can be obtained by calculating it under all possible randomisation schemes. The p-value is then the quantile of this distribution under which the test statistic falls for the actual randomisation scheme. For larger numbers of clusters, it is not feasible to permute every possible randomisation scheme, so we approximate the distribution of U using 500 randomly generated schemes. The following figure shows the estimated type one error rates using the two different methods (and 200 simulations):

The figure clearly shows an inflated type one error rates for the standard p-values reported by glmer especially for smaller numbers of clusters per arm. By contrast the residual permutation test shows approximately correct type one error rates (given more simulations there should be less noise in these estimates).


Implementation of these tests is straightforward in different software packages. In Stata, one can use the command permute, for which you specify the different groups, number of permutations and command to estimate the treatment effect. In R, there are various packages, like coin, that perform a similar function. For more complex models particular non-linear ones and ones involving adjustment, one has to be careful about how to specify the appropriate test statistic and model under the null hypothesis, which may involve a little programming, but it is relatively straightforward to do so.


These methods have widespread applications for anyone looking to use null hypothesis significance testing. So a complete overview of the literature is not possible. Instead, we highlight a few uses of these methods.

In a previous post in this series we covered synthetic control methods; one of the ways of computing test statistics for this method has been called ‘placebo tests’, which are an exact parallel to the permutation tests discussed here. Krief and others discuss the use of these methods for evaluating health policies. Another example from a regression-based analysis is provided by Dunn and Shapiro. And Jacob, Ludwig, and Miller examine the impact of a lottery for vouchers to move to another area and employ these tests.

Sugar et al derive health states for depression from the SF-12 and use permutation test methods to validate the health states. Barber and Thompson use these tests to examine costs data from an RCT.


Method of the month: Shared parameter models

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is shared parameter models.


Missing data and data errors are an inevitability rather than a possibility. If these data were missing as a result of a random computer error, then there would be no problem, no bias would result in estimators of statistics from these data. But, this is probably not why they’re missing. People drop out of surveys and trials often because they choose to, if they move away, or worse if they die. The trouble with this is that those factors that influence these decisions and events are typically also those that affect the outcomes of interest in our studies, thus leading to bias. Unfortunately, missing data is often improperly dealt with. For example, a study of randomised controlled trials (RCTs) in the big four medical journals found that 95% had some missing data, and around 85% of those did not deal with it in a suitable way. An instructive article in the BMJ illustrated the potentially massive biases that dropout in RCTs can generate. Similar effects should be expected from dropout in panel studies and other analyses. Now, if the data are missing at random – i.e. the probability of missing data or dropout is independent of the data conditional on observed covariates – then we could base our inferences on just the observed data. But this is often not the case, so what do we do in these circumstances?


If we have a full set of data Y and a set of indicators for whether each observation is missing R, plus some parameters \theta and \phi, then we can factorise their joint distribution, f(Y,R;\theta,\phi) in three ways:

Selection model


Perhaps most familiar to econometricians, this factorisation involves the marginal distribution of the full data and the conditional distribution of missingness given the data. The Heckman selection model is an example of this factorisation. For example, one could specify a probit model for dropout and a normally distributed outcome, and then the full likelihood would involve the product of the two.

Pattern-mixture model


This approach specifies a marginal distribution for the missingness or dropout mechanism and then the distribution of the data differs according to the type of missingness or dropout. The data are a mixture of different patterns, i.e. distributions. This type of model is implied when non-response is not considered missing data per se, and we’re interested in inferences within each sub-population. For example, when estimating quality of life at a given age, the quality of life of those that have died is not of interest, but their dying can bias the estimates.

Shared parameter model


Now, the final way we can model these data posits unobserved variables, \alpha, conditional on which Y and R are independent. These models are most appropriate when the dropout or missingness is attributable to some underlying process changing over time, such as disease progression or household attitudes, or an unobserved variable, such as health status.

At the simplest level, one could consider two separate models with correlated random effects, for example, adding in covariates x and having a linear mixed model and probit selection model for person i at time t

Y_{it} = x_{it}'\theta + \alpha_{1,i} + u_{it}

R_{it} = \Phi(x_{it}'\theta + \alpha_{2,i})

(\alpha_{1,i},\alpha_{2,i}) \sim MVN(0,\Sigma) and u_{it} \sim N(0,\sigma^2)

so that the random effects are multivariate normally distributed.

A more complex and flexible specification for longitudinal settings would permit the random effects to vary over time, differently between models and individuals:

Y_{i}(t) = x_{i}(t)'\theta + z_{1,i} (t)\alpha_i + u_{it}

R_{i}(t) = G(x_{i}'\theta + z_{2,i} (t)\alpha_i)

\alpha_i \sim h(.) and u_{it} \sim N(0,\sigma^2)

As an example, if time were discrete in this model then z_{1,i} could be a series of parameters for each time period z_{1,i} = [\lambda_1,\lambda_2,...,\lambda_T], what are often referred to as ‘factor loadings’ in the structural equation modelling literature. We will run up against identifiability problems with these more complex models. For example, if the random effect was normally distributed i.e. \alpha_i \sim N(0,\sigma^2_\alpha) then we could multiply each factor loading by \rho and then \alpha_i \sim N(0,\sigma^2_\alpha / \rho^2) would give us an equivalent model. So, we would have to put restrictions on the parameters. We can set the variance of the random effect to be one, i.e. \alpha_i \sim N(0,1). We can also set one of the factor loadings to zero, without loss of generality, i.e. z_{1,i} = [0,...,\lambda_T].

The distributional assumptions about the random effects can have potentially large effects on the resulting inferences. It is possible therefore to non-parametrically model these as well – e.g. using a mixture distribution. Ultimately, these models are a useful method to deal with data that are missing not at random, such as informative dropout from panel studies.


Estimation can be tricky with these models given the need to integrate out the random effects. For frequentist inferences, expectation maximisation (EM) is one way of estimating these models, but as far as I’m aware the algorithm would have to be coded for the problem specifically in Stata or R. An alternative is using some kind of quadrature based method. The Stata package stjm fits shared parameter models for longitudinal and survival data, with similar specifications to those above.

Otherwise, Bayesian tools, such as Hamiltonian Monte Carlo, may have more luck dealing with the more complex models. For the simpler correlated random effects specification specified above one can use the stan_mvmer command in the rstanarm package. For more complex models, one would need to code the model in something like Stan.


For a health economics specific discussion of these types of models, one can look to the chapter Latent Factor and Latent Class Models to Accommodate Heterogeneity, Using Structural Equation in the Encyclopedia of Health Economics, although shared parameter models only get a brief mention. However, given that that book is currently on sale for £1,000, it may be beyond the wallet of the average researcher! Some health-related applications may be more helpful. Vonesh et al. (2011) used shared parameter models to look at the effects of diet and blood pressure control on renal disease progression. Wu and others (2011) look at how to model the effects of a ‘concomitant intervention’, which is one applied when a patient’s health status deteriorates and so is confounded with health, using shared parameter models. And, Baghfalaki and colleagues (2017) examine heterogeneous random effect specification for shared parameter models and apply this to HIV data.