Method of the month: constrained randomisation

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is constrained randomisation.

Principle

Randomised experimental studies are one of the best ways of estimating the causal effects of an intervention. They have become more and more widely used in economics; Banerjee and Duflo are often credited with popularising them among economists. When done well, randomly assigning a treatment ensures both observable and unobservable factors are independent of treatment status and likely to be balanced between treatment and control units.

Many of the interventions economists are interested in are at a ‘cluster’ level, be it a school, hospital, village, or otherwise. So the appropriate experimental design would be a cluster randomised controlled trial (cRCT), in which the clusters are randomised to treatment or control and individuals within each cluster are observed either cross-sectionally or longitudinally. But, except in cases of large budgets, the number of clusters participating can be fairly small. When randomising a relatively small number of clusters we could by chance end up with a quite severe imbalance in key covariates between trial arms. This presents a problem if we suspect a priori that these covariates have an influence on key outcomes.

One solution to the problem of potential imbalance is covariate-based constrained randomisation. The principle here is to conduct a large number of randomisations, assess the balance of covariates in each one using some balance metric, and then to randomly choose one of the most balanced according to this metric. This method preserves the important random treatment assignment while ensuring covariate balance. Stratified randomisation also has a similar goal, but in many cases may not be possible if there are continuous covariates of interest or too few clusters to distribute among many strata.

Implementation

Conducting covariate constrained randomisation is straightforward and involves the following steps:

1. Specifying the important baseline covariates to balance the clusters on. For each cluster $j$ we have $L$ covariates $x_{il}; l=1,...L$.
2. Characterising each cluster in terms of these covariates, i.e. creating the $x_{il}$.
3. Enumerating all potential randomisation schemes or simulating a large number of them. For each one, we will need to measure the balance of the $x_{il}$ between trial arms.
4. Selecting a candidate set of randomisation schemes that are sufficiently balanced according to some pre-specified criterion from which we can randomly choose our treatment allocation.

Balance scores

A key ingredient in the above steps is the balance score. This score needs to be some univariate measure of potentially multivariate imbalance between two (or more) groups. A commonly used score is that proposed by Raab and Butcher: $\sum_{l=1}^{L} \omega_l (\bar{x}_{1l}-\bar{x}_{0l})^2$

where $\bar{x}_{1l}$ and $\bar{x}_{0l}$ are the mean values of covariate $l$ in the treatment and control groups respectively, and $\omega_l$ is some weight, which is often the inverse standard deviation of the covariate. Conceptually the score is a sum of standardised differences in means, so lower values indicate greater balance. But other scores would also work. Indeed, any statistic that measures the distance between the distributions of two variables would work and could be summed up over the covariates. This could include the maximum distance: $max_l |x_{1l} - x_{0l}|$

the Manhattan distance: $\sum_{l=1}^{L} |x_{1l}-x_{0l}|$

or even the Symmetrised Bayesian Kullback-Leibler divergence (I can’t be bothered to type this one out). Grischott has developed a Shiny application to estimate all these distances in a constrained randomisation framework, detailed in this paper.

Things become more complex if there are more than two trial arms. All of the above scores are only able to compare two groups. However, there already exist a number of univariate measures of multivariate balance in the form of MANOVA (multivariate analysis of variance) test statistics. For example, if we have $G$ trial arms and let $X_{jg} = \left[ x_{jg1},...,x_{jgL} \right]'$ then the between group covariance matrix is: $B = \sum_{g=1}^G N_g(\bar{X}_{.g} - \bar{X}_{..})(\bar{X}_{.g} - \bar{X}_{..})'$

and the within group covariance matrix is: $W = \sum_{g=1}^G \sum_{j=1}^{N_g} (X_{jg}-\bar{X}_{.g})(X_{jg}-\bar{X}_{.g})'$

which we can use in a variety of statistics including Wilks’ Lambda, for example: $\Lambda = \frac{det(W)}{det(W+B)}$

No trial has previously used covariate constrained randomisation with multiple groups, as far as I am aware, but this is the subject of an ongoing paper investigating these scores – so watch this space!

Once the scores have been calculated for all possible schemes or a very large number of possible schemes, we select from among those which are most balanced. The most balanced are defined according to some quantile of the balance score, say the top 15%.

As a simple simulated example of how this might be coded in R, let’s consider a trial of 8 clusters with two standard-normally distributed covariates. We’ll use the Raab and Butcher score from above:

#simulate the covariatesn <- 8x1 <- rnorm(n)x2 <- rnorm(n)x <- matrix(c(x1,x2),ncol=2)
#enumerate all possible schemes - you'll need the partitions package hereschemes <- partitions::setparts(c(n/2,n/2))
#write a function that will estimate the score#for each scheme which we can apply over our#set of schemesbalance_score <- function(scheme,covs){   treat.idx <- I(scheme==2)   control.idx <- I(scheme==1)   treat.means <- apply(covs[treat.idx,],2,mean)   control.means <- apply(covs[control.idx,],2,mean)   cov.sds <- apply(covs,2,sd)   #Raab-butcher score   score <- sum((treat.means - control.means)^2/cov.sds)   return(score)}
#apply the functionscores <- apply(schemes,2,function(i)balance_score(i,x))#find top 15% of schemes (lowest scores)scheme.set <- which(scores <= quantile(scores,0.15))#choose one at randomscheme.number <- sample(scheme.set,1)scheme.chosen <- schemes[,scheme.number]

Analyses

A commonly used method of cluster trial analysis is by estimating a mixed-model, i.e. a hierarchical model with cluster-level random effects. Two key questions are whether to control for the covariates used in the randomisation, and which test to use for treatment effects. Fan Li has two great papers answering these questions for linear models and binomial models. One key conclusion is that the appropriate type I error rates are only achieved in models adjusted for the covariates used in the randomisation. For non-linear models type I error rates can be way off for many estimators especially with small numbers of clusters, which is often the reason for doing constrained randomisation in the first place, so a careful choice is needed here. I would recommend adjusted permutation tests if in doubt to ensure the appropriate type I error rates. Of course, one could take a Bayesian approach to analysis, although there is no analysis that I’m aware of, of the performance of these models for these analyses (another case of “watch this space!”).

Application

There are many trials that used this procedure and listing even a fraction would be a daunting task. But I would be remiss for not noting a trial of my own that uses covariate constrained randomisation. It is investigating the effect of providing an incentive to small and medium sized enterprises to adhere to a workplace well-being programme. There are good applications used as examples in Fan Li’s papers mentioned above. A trial that featured in a journal round-up in February used covariate constrained randomisation to balance a very small number of clusters in a trial of a medicines access programme in Kenya.

Credit

Method of the month: Permutation tests

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is permutation tests.

Principle

One of the main objections to the use of p-values for statistical inference is that they are often misunderstood. They can be interpreted as the probability a null hypothesis is true, which they are not. Part of the cause of this problem is the black-box approach to statistical software. You can plug in data to Stata, R, or any other package, ask it to run a regression or test with ease, and it will spit out a load of p-values. Many people will just take it on trust that the software is returning the test of the hypothesis of interest and that the method has the correct properties, like type one error rate. But if one had to go through the process to obtain a test-statistic and the relevant distribution to compare it to, perhaps then the p-value wouldn’t be so misunderstood. For trials involving randomisation, permutation tests (or randomisation tests or exact tests) are just such a process.

Permutation tests were first proposed by Ronald Fisher in the early 20th Century. The basic principle is that to test differences between two groups assigned at random we can determine the exact distribution of a test statistic (such as a difference in means) under the null hypothesis by calculating the value of the test statistic for all possible ways of arranging our units into the two groups. The value of the test statistic for the actual assignment can be compared to this distribution to determine the p-value.

The simplest example would be to test a difference in means for a continuous outcome between two groups assigned in a randomised controlled trial. Let’s generate some data (we’ll do this in R) from a simple trial with 50 individuals per arm. In the control arm the data come from a $N(0,1)$ distribution and in the treatment arm they come from a $N(0.5,1)$ distribution:

n <- 100 #number of individualsD <- sample(c(rep(0,n/2),rep(1,n/2)),n) #treatment assignmenty <- rnorm(n,mean=D*0.5,sd=1) #generate normal outcomes T.diff <- mean(y[(D*1:n)])-mean(y[(-1*(D-1)*1:n)]) #actual difference in means

Now let’s add a function to randomly re-assign units to treatment and control and calculate the difference in means we would observe under the null of no difference based on our generated data. We will then plot this distribution and add a line showing our where the actual difference in means lies.

#function to generate differences in meanspermStat <- function(n,y){   D <- sample(c(rep(0,n/2),rep(1,n/2)),n) #generate new assignment   T.diff <- mean(y[(D*1:n)])-mean(y[(-1*(D-1)*1:n)])   return(T.diff) } T.dist <- sapply(1:500,function(i)permStat(n,y)) #apply it 500 timesqplot(T.dist)+geom_vline(xintercept=T.diff,col="red") #plot

Our 2-sided p-value here is 0.04, i.e. the proportion of values at least as extreme as our test statistic.

For a more realistic example we can consider a cluster randomised trial with a binary outcome. The reason for choosing this example is that estimating non-linear mixed models is difficult. Calculating test statistics, especially when the number of clusters is relatively small, is even harder. The methods used in most statistics packages have inflated type one errors, unbeknownst to many. So let’s set up the following trial: two-arms with 8 clusters per arm, and 100 patients per cluster, which is representative of trials of, say, hospitals. The data generating mechanism is for patient $i$ in cluster $j$ $y_{ij} = Bernoulli(p_{ij})$ $p_{ij} = Logit(\alpha_j + x_j'\beta + D_{j}\gamma)$

So no individual level covariates, four Bernoulli(0.3) covariates $x_j$ with $\beta = [1,1,1,1]$, and a treatment indicator $D_j$ with treatment effect $\gamma=0$ (to look at type one error rates). The cluster effect is $\alpha_j \sim N(0,\sigma^2_\alpha)$ and $\sigma^2_\alpha$ is chosen to give an intraclass correlation coefficient of 0.05. We’ll simulate data from this model and then estimate the model above and test the null hypothesis $H_0:\gamma=0$ in two ways. First, we’ll use the popular R package lme4 and the command glmer, which uses adaptive Gaussian quadrature to estimate the parameters and covariance matrix; the built in p-values are derived from standard Wald t-statistics. Second, we’ll use our permutation tests.

Gail et al. (1996) examine permutation tests for these kinds of models. They propose the following residual-based test (although one can use other tests based on the likelihood): (i) estimate the simple model under the null with no treatment effect and no hierarchical effect, i.e. $p_{ij}=Logit(\alpha+x_{ij}'\beta)$; (ii) for each individual generate their predicted values and residuals $r_{ij}$; (iii) generate the cluster average residuals $\bar{r}_{.j}=N_j^{-1}\sum_{i=1}^{N_j} r_{ij}$. Then the test statistic is $U=N^{-1}_{j} \left( \sum_{j=1}^{2N_j}D_{jg}\bar{r}_{.j} - \sum_{j=1}^{2N_j}(D_{j}-1)\bar{r}_{.j} \right) = N^{-1}_{j} \sum_{j=1}^{2N_j}(2D_{j}-1)\bar{r}_{.j}$

Under the null and given equal cluster sizes, the residual means are exchangeable. So the exact distribution of $U$ can be obtained by calculating it under all possible randomisation schemes. The p-value is then the quantile of this distribution under which the test statistic falls for the actual randomisation scheme. For larger numbers of clusters, it is not feasible to permute every possible randomisation scheme, so we approximate the distribution of $U$ using 500 randomly generated schemes. The following figure shows the estimated type one error rates using the two different methods (and 200 simulations):

The figure clearly shows an inflated type one error rates for the standard p-values reported by glmer especially for smaller numbers of clusters per arm. By contrast the residual permutation test shows approximately correct type one error rates (given more simulations there should be less noise in these estimates).

Implementation

Implementation of these tests is straightforward in different software packages. In Stata, one can use the command permute, for which you specify the different groups, number of permutations and command to estimate the treatment effect. In R, there are various packages, like coin, that perform a similar function. For more complex models particular non-linear ones and ones involving adjustment, one has to be careful about how to specify the appropriate test statistic and model under the null hypothesis, which may involve a little programming, but it is relatively straightforward to do so.

Applications

These methods have widespread applications for anyone looking to use null hypothesis significance testing. So a complete overview of the literature is not possible. Instead, we highlight a few uses of these methods.

In a previous post in this series we covered synthetic control methods; one of the ways of computing test statistics for this method has been called ‘placebo tests’, which are an exact parallel to the permutation tests discussed here. Krief and others discuss the use of these methods for evaluating health policies. Another example from a regression-based analysis is provided by Dunn and Shapiro. And Jacob, Ludwig, and Miller examine the impact of a lottery for vouchers to move to another area and employ these tests.

Sugar et al derive health states for depression from the SF-12 and use permutation test methods to validate the health states. Barber and Thompson use these tests to examine costs data from an RCT.

Credit

Method of the month: Distributional cost effectiveness analysis

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is distributional cost effectiveness analysis.

Principles

Variation in population health outcomes, particularly when socially patterned by characteristics such as income and race, are often of concern to policymakers. For example, the fact that people born in the poorest tenth of neighbourhoods in England can expect to live 19 fewer years of healthy life than those living in the richest tenth of neighbourhoods in the country, or the fact that black Americans born today can expect to die 4 years earlier than white Americans, are often considered to be unfair and in need of policy attention. As policymakers look to implement health programmes to tackle such unfair health disparities, they need the tools to enable them to evaluate the likely impacts of alternative programmes available to them in terms of the programmes’ impact on reducing these undesirable health inequalities, as well as their impact on improving population health.

Traditional tools for prospectively evaluating health programmes – that is to say, estimating the likely impacts of health programmes prior to their implementation – are typically based on cost-effectiveness analysis (CEA). CEA selects those programmes that improve the health of the average recipient of the programme the most, taking into consideration the health opportunity costs involved in implementing the programme. When using CEA to select health programmes there is, therefore, a risk that the programmes selected will not necessarily reduce the health disparities of concern to policymakers as these disparities are not part of the evaluation process used when comparing programmes. Indeed, in some cases, the programmes chosen using CEA may even unintentionally exacerbate these health inequalities.

There has been recent methodological work to build upon the standard CEA methods explicitly incorporating concerns for reducing health disparities into them. This equity augmented form of CEA is called distributional cost effectiveness analysis (DCEA). DCEA estimates the impacts of health interventions on different groups within the population and evaluates the health distributions resulting from these interventions in term of both health inequality and population health. Where necessary, DCEA can then be used to guide the trade-off between these different dimensions to pick the most “socially beneficial” programme to implement.

Implementation

The six core steps in implementing a DCEA are outlined below – full details of how DCEA is conducted in practice and applied to evaluate alternative options in a real case study (the NHS Bowel Cancer Screening Programme in England) can be found in a published tutorial.

1. Identify policy-relevant subgroups in the population

The first step in the analysis is to decide which characteristics of the population are of policy concern when thinking about health inequalities. For example, in England, there is a lot of concern about the fact that people born in poor neighbourhoods expect to die earlier than those born in rich neighbourhoods but little concern about the fact that men have shorter life expectancies than women.

2. Construct the baseline distribution of health

The next step is to construct a baseline distribution of health for the population. This baseline distribution describes the health of the population, typically measured in quality-adjusted life expectancy at birth, to show the level of health and health inequality prior to implementing the proposed interventions. This distribution can be standardised (using methods of either direct or indirect standardisation) to remove any variation in health that is not associated with the characteristics of interest. For example, in England, we might standardise the health distribution to remove variation associated with gender but retain variation associated with neighbourhood deprivation. This then gives us a description of the population health distribution with a particular focus on the health disparities we are trying to reduce. An example of how to construct such a ‘social distribution of health’ for England is given in another published article.

3. Estimate post-intervention distributions of health

We next estimate the health impacts of the interventions we are comparing. In producing these estimates we need to take into account differences by each of the equity relevant subgroups identified in the:

• prevalence and incidence of the diseases impacted by the intervention,
• rates of uptake and adherence to the intervention,
• efficacy of the intervention,
• mortality and morbidity, and
• health opportunity costs.

Standardising these health impacts and combining with the baseline distribution of health derived above gives us estimated post-intervention distributions of health for each intervention.

4. Compare post-intervention distributions using the health equity impact plane

Once post-intervention distributions of health have been estimated for each intervention we can compare them both in terms of their level of average health and in terms of their level of health inequality. Whilst calculating average levels of health in the distributions is straightforward, calculating levels of inequality requires some value judgements to be made. There is a wide range of alternative inequality measures that could be employed each of which captures different aspects of inequality. For example, relative inequality measures would conclude that a health distribution where half the population lives for 40 years and the other half lives for 50 years is just as unequal as a health distribution where half the population lives for 80 years and the other half lives for 100 years. An absolute inequality measure would instead conclude that the equivalence is with a population where half the population lives for 80 years and the other half lives for 90 years.

Two commonly used inequality measures are the Atkinson relative inequality measure and the Kolm absolute inequality measure. These both have the additional feature that they can be calibrated using an inequality aversion parameter to vary the level of priority given to those worst off in the distribution. We will see these inequality aversion parameters in action in the next step of the DCEA process.

Having selected a suitable inequality measure we can plot our post interventions distributions on a health equity impact plane. Let us assume we are comparing two interventions A and B, we can plot intervention A at the origin of the plane and plot intervention B relative to A on the plane. Cookson et al (CC BY 4.0)

If intervention B falls in the north-east quadrant of the health equity impact plane we know it both improves health overall and reduces health inequality relative to intervention A and so intervention B should be selected. If, however, intervention B falls in the south-west quadrant of the health equity impact plane we know it both reduces health and increases health inequality relative to intervention A and so intervention A should be selected. If intervention B falls either in the north-west or south-east quadrants of the health equity impact plane there is no obvious answer as to which intervention should be preferred as there is a trade-off to be made between health equity and total health.

5. Evaluate trade-offs between inequality and efficiency using social welfare functions

We use social welfare functions to trade-off between inequality reduction and average health improvement. These social welfare functions are constructed by combining our chosen measure of inequality with the average health in the distribution. This combination of inequality and average health is used to calculate what is known as an equally distributed equivalent (EDE) level of health. The EDE summarises the health distribution being analysed as one number representing the amount of health that each person in a hypothetically perfectly equal health distribution would need to have for us to be indifferent between the actual health distribution analysed and this perfectly equal health distribution. Where our social welfare function is built around an inequality measure with an inequality aversion parameter this EDE level of health will also be a function of the inequality aversion parameter. Where inequality aversion is set to zero there is no concern for inequality and the EDE simply reflects the average health in the distribution replicating results we would see under standard utilitarian CEA. As the inequality aversion level approaches infinity, our focus becomes increasingly on those worse off in the health distribution until at the limit we reflect the Rawlsian idea of focusing entirely on improving the lot of the worst-off in society. Social welfare functions derived from the Atkinson relative inequality measure and the Kolm absolute inequality measure are given below, with the inequality aversion parameters circled. Research carried out with members of the public in England suggests that suitable values for the Atkinson and Kolm inequality aversion parameters are 10.95 and 0.15 respectively.

 Atkinson Relative Social Welfare Function Kolm Absolute Social Welfare Function    When comparing interventions where one intervention does not simply dominate the others on the health equity impact plane we need to use our social welfare functions to calculate EDE levels of health associated with each of the interventions and then select the intervention that produces the highest EDE level of health. In the example depicted in the figure above we can see that pursuing intervention A results in a health distribution which appears less unequal but has a lower average level of health than the health distribution resulting from intervention B. The choice of intervention, in this case, will be determined by the form of social welfare function selected and the level of inequality this social welfare function is parameterised to embody.

6. Conduct sensitivity analysis on forms of social welfare function and extent of inequality aversion

Given that the conclusions drawn from DCEA may be dependent on the social value judgments made around the inequality measure used and the level of inequality aversion embodied in it, we should present results for a range of alternative social welfare functions parameterised at a range of inequality aversion levels. This will allow decision makers to clearly understand how robust conclusions are to alternative social value judgements. Applications

DCEA is of particular use when evaluating large-scale public health programmes that have an explicit goal of tackling health inequality. It has been applied to the NHS bowel cancer screening programme in England and to the rotavirus vaccination programme in Ethiopia.

Some key limitations of DCEA are that: (1) it currently only analyses programmes in terms of their health impacts whilst large public health programmes often have important impacts across a range of sectors beyond health; and (2) it requires a range of data beyond that required by standard CEA which may not be readily available in all contexts.

For low and middle-income settings an alternative augmented CEA methodology called extended cost effectiveness analysis (ECEA) has been developed to combine estimates of health impacts with estimates of impacts on financial risk protection. More information on ECEA can be found here.

There are ongoing efforts to generalise the DCEA methods to be applied to interventions having impacts across multiple sectors. Follow the latest developments on DCEA at the dedicated website based at the Centre for Health Economics, University of York.

Credit