# Poor statistical communication means poor statistics

Statistics is a broad and complex field. For a given research question any number of statistical approaches could be taken. In an article published last year, researchers asked 61 analysts to use the same dataset to address the question of whether referees were more likely to give dark skinned players a red card than light skinned players. They got 61 different responses. Each analysis had its advantages and disadvantages and I’m sure each analyst would have defended their work. However, as many statisticians and economists may well know, the merit of an approach is not the only factor that matters in its adoption.

There has, for decades, been criticism about the misunderstanding and misuse of null hypothesis significance testing (NHST). P-values have been a common topic on this blog. Despite this, NHST remains the predominant paradigm for most statistical work. If used appropriately this needn’t be a problem, but if it were being used appropriately it wouldn’t be used nearly as much: p-values can’t perform the inferential role many expect of them. It’s not difficult to understand why things are this way: most published work uses NHST, we teach students NHST in order to understand the published work, students become researchers who use NHST, and so on. Part of statistical education involves teaching the arbitrary conventions that have gone before such as that p-values are ‘significant’ if below 0.05 or a study is ‘adequately powered’ if power is above 80%. One of the most pernicious consequences of this is that these heuristics become a substitute for thinking. The presence of these key figures is expected and their absence often marked by a request from reviewers and other readers for their inclusion.

I have argued on this blog and elsewhere for a wider use of Bayesian methods (and less NHST) and I try to practice what I preach. For an ongoing randomised trial I am involved with, I adopted a Bayesian approach to design and analysis. Instead of the usual power calculation, I conducted a Bayesian assurance analysis (which Anthony O’Hagan has written some good articles on for those wanting more information). I’ll try to summarise the differences between ‘power’ and ‘assurance’ calculations by attempting to define them, which is actually quite hard!

Power calculation. If we were to repeat a trial infinitely many times, what sample size would we need so that in x% of trials the assumed data generating model produces data which would fall in the α% most extreme quantiles of the distribution of data that would be produced from the same data generating model but with one parameter set to exactly zero (or any equivalent hypothesis). Typically we set x%to be 80% (power) and α% to be 5% (statistical significance threshold).

Assurance calculation. For a given data generating model, what sample size do we need so that there is a x% probability that we will be 1-α% certain that the parameter is positive (or any equivalent choice).

The assurance calculation could be reframed in a decision framework as what sample size do we need so that there is a x% probability we will make the right decision about whether a parameter is positive (or any equivalent decision) given the costs of making the wrong decision.

Both of these are complex but I would argue it is the assurance calculation that gives us what we want to know most of the time when designing a trial. The assurance analysis also better represents uncertainty since we specify distributions over all the uncertain parameters rather than choose exact values. Despite this though, the funder of the trial mentioned above, who shall remain nameless, insisted on the results of a power calculation in order to be able to determine whether the trial was worth continuing with because that’s “what they’re used to.”

The main culprit for this issue is, I believe, communication. A simpler explanation with better presentation may have been easier to understand and accept. This is not to say that I do not believe the funder was substituting the heuristic ‘80% or more power = good’ for actually thinking about what we could learn from the trial. But until statisticians, economists, and other data analytic researchers start communicating better, how can we expect others to listen?

Image credit: Geralt

# Method of the month: constrained randomisation

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is constrained randomisation.

## Principle

Randomised experimental studies are one of the best ways of estimating the causal effects of an intervention. They have become more and more widely used in economics; Banerjee and Duflo are often credited with popularising them among economists. When done well, randomly assigning a treatment ensures both observable and unobservable factors are independent of treatment status and likely to be balanced between treatment and control units.

Many of the interventions economists are interested in are at a ‘cluster’ level, be it a school, hospital, village, or otherwise. So the appropriate experimental design would be a cluster randomised controlled trial (cRCT), in which the clusters are randomised to treatment or control and individuals within each cluster are observed either cross-sectionally or longitudinally. But, except in cases of large budgets, the number of clusters participating can be fairly small. When randomising a relatively small number of clusters we could by chance end up with a quite severe imbalance in key covariates between trial arms. This presents a problem if we suspect a priori that these covariates have an influence on key outcomes.

One solution to the problem of potential imbalance is covariate-based constrained randomisation. The principle here is to conduct a large number of randomisations, assess the balance of covariates in each one using some balance metric, and then to randomly choose one of the most balanced according to this metric. This method preserves the important random treatment assignment while ensuring covariate balance. Stratified randomisation also has a similar goal, but in many cases may not be possible if there are continuous covariates of interest or too few clusters to distribute among many strata.

## Implementation

Conducting covariate constrained randomisation is straightforward and involves the following steps:

1. Specifying the important baseline covariates to balance the clusters on. For each cluster $j$ we have $L$ covariates $x_{il}; l=1,...L$.
2. Characterising each cluster in terms of these covariates, i.e. creating the $x_{il}$.
3. Enumerating all potential randomisation schemes or simulating a large number of them. For each one, we will need to measure the balance of the $x_{il}$ between trial arms.
4. Selecting a candidate set of randomisation schemes that are sufficiently balanced according to some pre-specified criterion from which we can randomly choose our treatment allocation.

### Balance scores

A key ingredient in the above steps is the balance score. This score needs to be some univariate measure of potentially multivariate imbalance between two (or more) groups. A commonly used score is that proposed by Raab and Butcher:

$\sum_{l=1}^{L} \omega_l (\bar{x}_{1l}-\bar{x}_{0l})^2$

where $\bar{x}_{1l}$ and $\bar{x}_{0l}$ are the mean values of covariate $l$ in the treatment and control groups respectively, and $\omega_l$ is some weight, which is often the inverse standard deviation of the covariate. Conceptually the score is a sum of standardised differences in means, so lower values indicate greater balance. But other scores would also work. Indeed, any statistic that measures the distance between the distributions of two variables would work and could be summed up over the covariates. This could include the maximum distance:

$max_l |x_{1l} - x_{0l}|$

the Manhattan distance:

$\sum_{l=1}^{L} |x_{1l}-x_{0l}|$

or even the Symmetrised Bayesian Kullback-Leibler divergence (I can’t be bothered to type this one out). Grischott has developed a Shiny application to estimate all these distances in a constrained randomisation framework, detailed in this paper.

Things become more complex if there are more than two trial arms. All of the above scores are only able to compare two groups. However, there already exist a number of univariate measures of multivariate balance in the form of MANOVA (multivariate analysis of variance) test statistics. For example, if we have $G$ trial arms and let $X_{jg} = \left[ x_{jg1},...,x_{jgL} \right]'$ then the between group covariance matrix is:

$B = \sum_{g=1}^G N_g(\bar{X}_{.g} - \bar{X}_{..})(\bar{X}_{.g} - \bar{X}_{..})'$

and the within group covariance matrix is:

$W = \sum_{g=1}^G \sum_{j=1}^{N_g} (X_{jg}-\bar{X}_{.g})(X_{jg}-\bar{X}_{.g})'$

which we can use in a variety of statistics including Wilks’ Lambda, for example:

$\Lambda = \frac{det(W)}{det(W+B)}$

No trial has previously used covariate constrained randomisation with multiple groups, as far as I am aware, but this is the subject of an ongoing paper investigating these scores – so watch this space!

Once the scores have been calculated for all possible schemes or a very large number of possible schemes, we select from among those which are most balanced. The most balanced are defined according to some quantile of the balance score, say the top 15%.

As a simple simulated example of how this might be coded in R, let’s consider a trial of 8 clusters with two standard-normally distributed covariates. We’ll use the Raab and Butcher score from above:

#simulate the covariatesn <- 8x1 <- rnorm(n)x2 <- rnorm(n)x <- matrix(c(x1,x2),ncol=2)
#enumerate all possible schemes - you'll need the partitions package hereschemes <- partitions::setparts(c(n/2,n/2))
#write a function that will estimate the score#for each scheme which we can apply over our#set of schemesbalance_score <- function(scheme,covs){   treat.idx <- I(scheme==2)   control.idx <- I(scheme==1)   treat.means <- apply(covs[treat.idx,],2,mean)   control.means <- apply(covs[control.idx,],2,mean)   cov.sds <- apply(covs,2,sd)   #Raab-butcher score   score <- sum((treat.means - control.means)^2/cov.sds)   return(score)}
#apply the functionscores <- apply(schemes,2,function(i)balance_score(i,x))#find top 15% of schemes (lowest scores)scheme.set <- which(scores <= quantile(scores,0.15))#choose one at randomscheme.number <- sample(scheme.set,1)scheme.chosen <- schemes[,scheme.number]

Analyses

A commonly used method of cluster trial analysis is by estimating a mixed-model, i.e. a hierarchical model with cluster-level random effects. Two key questions are whether to control for the covariates used in the randomisation, and which test to use for treatment effects. Fan Li has two great papers answering these questions for linear models and binomial models. One key conclusion is that the appropriate type I error rates are only achieved in models adjusted for the covariates used in the randomisation. For non-linear models type I error rates can be way off for many estimators especially with small numbers of clusters, which is often the reason for doing constrained randomisation in the first place, so a careful choice is needed here. I would recommend adjusted permutation tests if in doubt to ensure the appropriate type I error rates. Of course, one could take a Bayesian approach to analysis, although there is no analysis that I’m aware of, of the performance of these models for these analyses (another case of “watch this space!”).

## Application

There are many trials that used this procedure and listing even a fraction would be a daunting task. But I would be remiss for not noting a trial of my own that uses covariate constrained randomisation. It is investigating the effect of providing an incentive to small and medium sized enterprises to adhere to a workplace well-being programme. There are good applications used as examples in Fan Li’s papers mentioned above. A trial that featured in a journal round-up in February used covariate constrained randomisation to balance a very small number of clusters in a trial of a medicines access programme in Kenya.

Credit

# Chris Sampson’s journal round-up for 19th February 2018

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Value of information methods to design a clinical trial in a small population to optimise a health economic utility function. BMC Medical Research Methodology [PubMed] Published 8th February 2018

Statistical significance – whatever you think of it – and the ‘power’ of clinical trials to detect change, is an important decider in clinical decision-making. Trials are designed to be big enough to detect ‘statistically significant’ differences. But in the context of rare diseases, this can be nigh-on impossible. In theory, the required sample size could exceed the size of the whole population. This paper describes an alternative method for determining sample sizes for trials in this context, couched in a value of information framework. Generally speaking, power calculations ignore the ‘value’ or ‘cost’ associated with errors, while a value of information analysis would take this into account and allow accepted error rates to vary accordingly. The starting point for this study is the notion that sample sizes should take into account the size of the population to which the findings will be applicable. As such, sample sizes can be defined on the basis of maximising the expected (societal) utility associated with the conduct of the trial (whether the intervention is approved or not). The authors describe the basis for hypothesis testing within this framework and specify the utility function to be maximised. Honestly, I didn’t completely follow the stats notation in this paper, but that’s OK – the trial statisticians will get it. A case study application is presented from the context of treating children with severe haemophilia A, which demonstrates the potential to optimise utility according to sample size. The key point is that the power is much smaller than would be required by conventional methods and the sample size accordingly reduced. The authors also demonstrate the tendency for the optimal trial sample size to increase with the size of the population. This Bayesian approach at least partly undermines the frequentist basis on which ‘power’ is usually determined. So one issue is whether regulators will accept this as a basis for defining a trial that will determine clinical practice. But then regulators are increasingly willing to allow for special cases, and it seems that the context of rare diseases could be a way-in for Bayesian trial design of this sort.

EQ-5D-5L: smaller steps but a major step change? Health Economics [PubMed] Published 7th February 2018

This editorial was doing the rounds on Twitter last week. European (and Canadian) health economists love talking about the EQ-5D-5L. The editorial features in the edition of Health Economics that hosts the 5L value set for England, which – 2 years on – has finally satisfied the vagaries of academic publication. The authors provide a summary of what’s ‘new’ with the 5L, and why it matters. But we’ve probably all figured that out by now anyway. More interestingly, the editorial points out some remaining concerns with the use of the EQ-5D-5L in England (even if it is way better than the EQ-5D-3L and its 25-year old value set). For example, there is some clustering in the valuations that might reflect bias or problems with the technique and – even if they’re accurate – present difficulties for analysts. And there are also uncertain implications for decision-making that could systematically favour or disfavour particular treatments or groups of patients. On this basis, the authors support NICE’s decision to ‘pause’ and await independent review. I tend to disagree, for reasons that I can’t fit in this round-up, so come back tomorrow for a follow-up blog post.

Factors influencing health-related quality of life in patients with Type 1 diabetes. Health and Quality of Life Outcomes [PubMed] Published 2nd February 2018

Diabetes and its complications can impact upon almost every aspect of a person’s health. It isn’t clear what aspects of health-related quality of life might be amenable to improvement in people with Type 1 diabetes, or which characteristics should be targeted. This study looks at a cohort of trial participants (n=437) and uses regression analyses to determine which factors explain differences in health-related quality of life at baseline, as measured using the EQ-5D-3L. Age, HbA1c, disease duration and being obese all significantly influenced EQ-VAS values, while self-reported mental illness and unemployment status were negatively associated with EQ-5D index scores. People who were unemployed were more likely to report problems in the mobility, self-care, and pain/discomfort domains. There are some minor misinterpretations in the paper (divining a ‘reduction’ in scores from a cross-section, for example). And the use of standard linear regression models is questionable given the nature of EQ-5D-3L index values. But the findings demonstrate the importance of looking beyond the direct consequences of a disease in order to identify the causes of reduced health-related quality of life. Getting people back to work could be more effective than most health care as a means of improving health-related quality of life.

Financial incentives for chronic disease management: results and limitations of 2 randomized clinical trials with New York Medicaid patients. American Journal of Health Promotion [PubMed] Published 1st February 2018

Chronic diseases require (self-)management, but it isn’t always easy to ensure that patients adhere to the medication or lifestyle changes that could improve health outcomes. This study looks at the effectiveness of financial incentives in the context of diabetes and hypertension. The data are drawn from 2 RCTs (n=1879) which, together, considered 3 types of incentive – process-based, outcome-based, or a combination of the two – compared with no financial incentives. Process-based incentives rewarded participants for attending primary care or endocrinologist appointments and filling their prescriptions, up to a maximum of $250. Outcome-based incentives rewarded up to$250 for achieving target reductions in systolic blood pressure or blood glucose levels. The combined arms could receive both rewards up to the same maximum of \$250. In short, none of the financial incentives made any real difference. But generally speaking, at 6-month follow-up, the movement was in the right direction, with average blood pressure and blood glucose levels tending to fall in all arms. It’s not often that authors include the word ‘limitations’ in the title of a paper, but it’s the limitations that are most interesting here. One key difficulty is that most of the participants had relatively acceptable levels of the target outcomes at baseline, meaning that they may already have been managing their disease well and there may not have been much room for improvement. It would be easy to interpret these findings as showing that – generally speaking – financial incentives aren’t effective. But the study is more useful as a way of demonstrating the circumstances in which we can expect financial incentives to be ineffective, and support a better-informed targeting for future programmes.

Credits