# Method of the month: constrained randomisation

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is constrained randomisation.

## Principle

Randomised experimental studies are one of the best ways of estimating the causal effects of an intervention. They have become more and more widely used in economics; Banerjee and Duflo are often credited with popularising them among economists. When done well, randomly assigning a treatment ensures both observable and unobservable factors are independent of treatment status and likely to be balanced between treatment and control units.

Many of the interventions economists are interested in are at a ‘cluster’ level, be it a school, hospital, village, or otherwise. So the appropriate experimental design would be a cluster randomised controlled trial (cRCT), in which the clusters are randomised to treatment or control and individuals within each cluster are observed either cross-sectionally or longitudinally. But, except in cases of large budgets, the number of clusters participating can be fairly small. When randomising a relatively small number of clusters we could by chance end up with a quite severe imbalance in key covariates between trial arms. This presents a problem if we suspect a priori that these covariates have an influence on key outcomes.

One solution to the problem of potential imbalance is covariate-based constrained randomisation. The principle here is to conduct a large number of randomisations, assess the balance of covariates in each one using some balance metric, and then to randomly choose one of the most balanced according to this metric. This method preserves the important random treatment assignment while ensuring covariate balance. Stratified randomisation also has a similar goal, but in many cases may not be possible if there are continuous covariates of interest or too few clusters to distribute among many strata.

## Implementation

Conducting covariate constrained randomisation is straightforward and involves the following steps:

1. Specifying the important baseline covariates to balance the clusters on. For each cluster $j$ we have $L$ covariates $x_{il}; l=1,...L$.
2. Characterising each cluster in terms of these covariates, i.e. creating the $x_{il}$.
3. Enumerating all potential randomisation schemes or simulating a large number of them. For each one, we will need to measure the balance of the $x_{il}$ between trial arms.
4. Selecting a candidate set of randomisation schemes that are sufficiently balanced according to some pre-specified criterion from which we can randomly choose our treatment allocation.

### Balance scores

A key ingredient in the above steps is the balance score. This score needs to be some univariate measure of potentially multivariate imbalance between two (or more) groups. A commonly used score is that proposed by Raab and Butcher:

$\sum_{l=1}^{L} \omega_l (\bar{x}_{1l}-\bar{x}_{0l})^2$

where $\bar{x}_{1l}$ and $\bar{x}_{0l}$ are the mean values of covariate $l$ in the treatment and control groups respectively, and $\omega_l$ is some weight, which is often the inverse standard deviation of the covariate. Conceptually the score is a sum of standardised differences in means, so lower values indicate greater balance. But other scores would also work. Indeed, any statistic that measures the distance between the distributions of two variables would work and could be summed up over the covariates. This could include the maximum distance:

$max_l |x_{1l} - x_{0l}|$

the Manhattan distance:

$\sum_{l=1}^{L} |x_{1l}-x_{0l}|$

or even the Symmetrised Bayesian Kullback-Leibler divergence (I can’t be bothered to type this one out). Grischott has developed a Shiny application to estimate all these distances in a constrained randomisation framework, detailed in this paper.

Things become more complex if there are more than two trial arms. All of the above scores are only able to compare two groups. However, there already exist a number of univariate measures of multivariate balance in the form of MANOVA (multivariate analysis of variance) test statistics. For example, if we have $G$ trial arms and let $X_{jg} = \left[ x_{jg1},...,x_{jgL} \right]'$ then the between group covariance matrix is:

$B = \sum_{g=1}^G N_g(\bar{X}_{.g} - \bar{X}_{..})(\bar{X}_{.g} - \bar{X}_{..})'$

and the within group covariance matrix is:

$W = \sum_{g=1}^G \sum_{j=1}^{N_g} (X_{jg}-\bar{X}_{.g})(X_{jg}-\bar{X}_{.g})'$

which we can use in a variety of statistics including Wilks’ Lambda, for example:

$\Lambda = \frac{det(W)}{det(W+B)}$

No trial has previously used covariate constrained randomisation with multiple groups, as far as I am aware, but this is the subject of an ongoing paper investigating these scores – so watch this space!

Once the scores have been calculated for all possible schemes or a very large number of possible schemes, we select from among those which are most balanced. The most balanced are defined according to some quantile of the balance score, say the top 15%.

As a simple simulated example of how this might be coded in R, let’s consider a trial of 8 clusters with two standard-normally distributed covariates. We’ll use the Raab and Butcher score from above:

#simulate the covariates
n <- 8
x1 <- rnorm(n)
x2 <- rnorm(n)
x <- matrix(c(x1,x2),ncol=2)
#enumerate all possible schemes - you'll need the partitions package here
schemes <- partitions::setparts(c(n/2,n/2))
#write a function that will estimate the score
#for each scheme which we can apply over our
#set of schemes
balance_score <- function(scheme,covs){
treat.idx <- I(scheme==2)
control.idx <- I(scheme==1)
treat.means <- apply(covs[treat.idx,],2,mean)
control.means <- apply(covs[control.idx,],2,mean)
cov.sds <- apply(covs,2,sd)
#Raab-butcher score
score <- sum((treat.means - control.means)^2/cov.sds)
return(score)
}
#apply the function
scores <- apply(schemes,2,function(i)balance_score(i,x))
#find top 15% of schemes (lowest scores)
scheme.set <- which(scores <= quantile(scores,0.15))
#choose one at random
scheme.number <- sample(scheme.set,1)
scheme.chosen <- schemes[,scheme.number]

Analyses

A commonly used method of cluster trial analysis is by estimating a mixed-model, i.e. a hierarchical model with cluster-level random effects. Two key questions are whether to control for the covariates used in the randomisation, and which test to use for treatment effects. Fan Li has two great papers answering these questions for linear models and binomial models. One key conclusion is that the appropriate type I error rates are only achieved in models adjusted for the covariates used in the randomisation. For non-linear models type I error rates can be way off for many estimators especially with small numbers of clusters, which is often the reason for doing constrained randomisation in the first place, so a careful choice is needed here. I would recommend adjusted permutation tests if in doubt to ensure the appropriate type I error rates. Of course, one could take a Bayesian approach to analysis, although there is no analysis that I’m aware of, of the performance of these models for these analyses (another case of “watch this space!”).

## Application

There are many trials that used this procedure and listing even a fraction would be a daunting task. But I would be remiss for not noting a trial of my own that uses covariate constrained randomisation. It is investigating the effect of providing an incentive to small and medium sized enterprises to adhere to a workplace well-being programme. There are good applications used as examples in Fan Li’s papers mentioned above. A trial that featured in a journal round-up in February used covariate constrained randomisation to balance a very small number of clusters in a trial of a medicines access programme in Kenya.

Credit

# Method of the month: Eye-tracking

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is eye-tracking.

## Principles

Eye-tracking methods can be used to analyse how individuals acquire information and how they make decisions. The method has been extensively used by psychologists in a variety of applications, from identifying cases of dyslexia in children to testing aviation pilots’ awareness. It was made popular by Keith Rayner, but its growing use reflects changes in the availability and affordability of technology. The textbook ‘Eye Tracking: A Comprehensive Guide to Methods and Measures’ provides a great introduction and complements the course offered by Lund University.

Eye-tracking analyses typically depend on the ‘eye-mind hypothesis’ which states “there is no appreciable lag between what is fixated on and what is processed”. In addition to fixing their gaze, individuals make rapid eye movements called ‘saccades’ when they are searching for information or items of interest. There is also research into ‘pupilometery’ which relates pupil size to cognitive burden, where ‘hard’ tasks are hypothesised to cause dilation.

There is a growing interest in using quantitative elicitation methods to understand individuals’ preferences for healthcare goods or services. Quantitative methods, such as the standard gamble, time trade-off, contingent valuation or discrete choice experiments, often employ surveys which are increasingly self-completed and administered online. These valuation methods are often underpinned by economic theories either for utility (random utility theory, expected utility theory) or, in the case of attribute-based approaches, Lancaster’s Theory of consumer demand. If respondents do not answer in line with the supporting theories, the valuations derived from the survey data may be biased. Therefore, various approaches have been employed to understand whether people complete surveys in line with the analysts’ expectations – from restricted models to test for attribute non-attendance to qualitative ‘think-aloud’ interviews. Eye-tracking offers an alternative method of testing these hypotheses.

## Implementation

The research question will dictate the eye-tracking study design, if it’s a reading study or a survey then accuracy will be key. If the experiment seeks to understand how participants respond to large visual stimuli, then it may be preferable to have comfortable equipment which can be used in a setting the participant is familiar with.

### Equipment

In its most basic form, eye-tracking research has involved researcher-individual observation of a participants’ eyes and manual notes on pupil dilation. However, more sophisticated methods have since developed in line with changes to, and availability of, technology. In the 1950s, magnetic search coils were used to track people’s eye movements which involved placing two coils on the eye, with one circling the iris on a contact lens. Nowadays, most eye-tracking involves less invasive equipment, commonly with a camera recording data on a computer and complex algorithms to calculate the location of the individual’s gaze.

To track eyes, almost all modern devices record the corneal reflection on a camera positioned towards the individual’s pupil. The corneal reflection is a glint, usually in the iris, which allows the machine to calculate the direction of the gaze using the distance from 1) the camera to the eye and; 2) the eye to the screen. From the corneal reflection, the X and Y (horizontal and vertical) coordinates, which provide the location of current focus on the screen, are then recorded. The number of times this is logged a second is referred to as the speed (or ‘frequency’) of the tracker. As the eye moves from one position to another, the magnitude of the movement is measured in visual degrees (θ), rather than millimetres, as studies may involve moving stimulus and so the distance between eye and object could change.

Popular manufacturers include Tobii, SensoMotoric Instruments (SMI) and SR Research. Eye-trackers are usually distinguished by their speed and, as a general rule, a ‘good’ eye-tracker has a high frequency and high-resolution camera. A higher frequency allows a more accurate estimation of the fixation duration, as the start of the fixation is revealed earlier and the end revealed later. There is some consensus that a sampling frequency of 500 Hz is sufficiently powerful to accurately determine fixations and saccades. Another determinant of a ‘good’ eye-tracking device is its ‘latency’, which is the time taken for the computer to make a recording. A substantial volume of processing from the headset to screen to recording is required and, for some devices, there is a measurable delay in this process.

There are three broad categories of modern eye-tracking devices.

Head-mounted eye-trackers, such as smart glasses or helmet cameras, offer participants some freedom but are harder to calibrate and can be cumbersome to wear. These eye-trackers are often used to understand how objects are attended to in a dynamic situation, for example, whilst the participant is engaged in a shopping activity.

#### Remote

Remote eye-trackers let the participant move freely but in front of a screen, with algorithms used to detect non-eye movements [PDF]. However, the additional calculations to distinguish head and eye movements are a burden to the processing capacity of the computer and, generally, result in a lower frequency and, as a consequence, have decreased precision.

Head-supported towers involve the use of a forehead and chin-rest. Whilst being contactless, these can be uncomfortable and unnatural for some participants. These devices are also often immobile, due to their heavy processing power, and require stability of the head because of their high frequency. However, head-supported towers are the most accurate and precise equipment available for researchers. For studies where the individual is not required to move and the stimuli are stationary (such as a survey), a head-supported tower eye-tracker provides the best quality data. Head-supported towers also offer the most accurate recording of pupil size.

### Data collection

Data collection will likely occur in a university lab if a head-supported tower tracker is chosen. Head-mounted and remote trackers are generally mobile and can, therefore, travel to people of interest.

Tracking devices can either record both eyes (binocular) or a single eye (monocular). When both eyes are recorded, an average of the horizontal and vertical coordinates from each eye are taken. However, most people generally have an ‘active’ and ‘lazy’ eye and literature suggests that the active, dominant eye should only be tracked. If a participant performs poorly in the calibration, then an alternative eye should be tried.

It is crucial that the eye-tracker is calibrated for each individual to ensure the eye-tracker is recording correctly. The calibration procedure involves collecting fixation data from simple points on the screen in order to ascertain the true gaze position of the individual before the experiment begins. The points are often shown as dots or crosses which move around the screen whilst fixation data are collected. A test of the calibration can be conducted by re-running the sequence and comparing the secondary fixations to the tracker’s prediction based on the first calibration data.

The calibration should involve points in all corners of the screen to ensure that the tracker is able to record in all areas. In the corners and edges, the corneal reflection can disappear, which therefore invalidates the computer’s calculations as well as resulting in missing data. Similarly, for individuals with visual aids (glasses, contact lenses) or heavy eye-makeup, the far corners can often induce another reflection which may confuse the recording and create anomalous data.

If a respondent is completing a survey, between-page calibration called ‘drift correction’ can also be completed. In this procedure, a small dot is presented in the centre of the screen and the next page appears once the participant has focussed on the spot. If there has been too much movement, the experiment will not progress and the tracker must be recalibrated.

### Data analysis

Saccades are easily identifiable as the eye moves quickly in response to or in search of visual ‘stimuli’ or objects of interest. Saccadic behaviour rarely indicates information processing as the movements are so rapid that the brain is unable to consciously realise everything that is scanned, a process known as ‘saccadic suppression’. Instead, saccades most often represent a search for information. Saccades are distinctly different to ‘micro-saccades’, which are involuntary movements whilst an individual is attempting to fixate, and the involuntary movements which occur when an individual blinks. Blinks are quite easily identifiable from regular saccades as they are immediately followed by a missing pupil image on the camera as the eyelid closes.

What constitutes a fixation varies from study to study and is dependent on the stimulus presented. For example, a familiar picture may be processed quicker than text, and a new diagram may be somewhere in between. Although complex algorithms exist for the identification of fixations in eye-tracking data, most studies define a threshold for a fixation as less than one degree of movement (a measure of distance) for between 50 to 200 milliseconds. Aggregation of the total time spent fixating, including recurrent fixations, is defined as the ‘dwell time’ to a stimulus.

Eye-tracking data provide a highly detailed record of all the locations that a user has looked at, so reducing these data to a level that can be easily analysed is challenging. One common approach in the analysis of eye-tracking data involves segmenting coordinates to defined regions or ‘areas of interest’ (AOI). AOI can be defined either prior to the experiment or post-experimentally once eye-movement data have been collected.

Another approach to reducing the data is the generation of a ‘scan path’ describing the overall sequence of movements in terms of both saccades and fixations of a respondent, either imposed on a background image of the stimulus or as a colour-coded heat map.

Pupil size can be more difficult to interpret and analyse. Measurement of the pupil differs by equipment, some use an ellipse whereas others count the number of black pixels on the camera image of the eye. Pupil dilation can be calculated as the difference in pupil size, however, analysing this as a percentage increase can cause inflated estimates when the baseline pupil size is small. Pupil size can also rarely be compared across studies as it highly affected by equipment set-up and setting luminosity.

### Software

Many eye-trackers come with manufacturer written software for programming the experiment (such as the EyeLink Experiment Builder). However, access to eye-trackers and related software may be restricted. PsychoPy is an open-source software, written in Python, for eye-tracking (and other neuroscience) experiments. Similarly, data can be analysed either in specialist software such as EyeLink’s Data Viewer (which is useful for creating scan paths) or the data can be exported to other statistical programmes such as Matlab, Stata or R.

## Applications

Orquin & Loose review how eye-tracking methods have been used in decision-making generally. In health, there are a few examples of survey-based choice experiments which have employed eye-tracking methods to understand more about respondents’ decision-making. Spinks & Mortimer used a remote eye-tracker to identify attribute non-attendance. Krucien et al combined eye-tracking and choice data to model information processing; and Ryan extended this analysis to focus on presentation biases in a subsequent publication. Vass et al use eye-tracking to understand how respondents complete choice experiments and if this differed with the presentation of risk. A forthcoming publication by Rigby et al. will provide an overview of approaches, including eye-tracking, to capturing decision-making in health care choice experiments.

Credit

# Sam Watson’s journal round-up for 16th April 2018

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

The impact of NHS expenditure on health outcomes in England: alternative approaches to identification in all‐cause and disease specific models of mortality. Health Economics [PubMedPublished 2nd April 2018

Studies looking at the relationship between health care expenditure and patient outcomes have exploded in popularity. A recent systematic review identified 65 studies by 2014 on the topic – and recent experience from these journal round-ups suggests this number has increased significantly since then. The relationship between national spending and health outcomes is important to inform policy and health care budgets, not least through the specification of a cost-effectiveness threshold. Karl Claxton and colleagues released a big study looking at all the programmes of care in the NHS in 2015 purporting to estimate exactly this. I wrote at the time that: (i) these estimates are only truly an opportunity cost if the health service is allocatively efficient, which it isn’t; and (ii) their statistical identification method, in which they used a range of socio-economic variables as instruments for expenditure, was flawed as the instruments were neither strong determinants of expenditure nor (conditionally) independent of population health. I also noted that their tests would be unlikely to be any good to detect this problem. In response to the first, Tony O’Hagan commented to say that that they did not assume NHS efficiency, nor even that it was assumed that the NHS is trying to maximise health. This may well have been the case, but I would still, perhaps pedantically, argue then that this is therefore not an opportunity cost. For the question of instrumental variables, an alternative method was proposed by Martyn Andrews and co-authors, using information that feeds into the budget allocation formula as instruments for expenditure. In this new article, Claxton, Lomas, and Martin adopt Andrews’s approach and apply it across four key programs of care in the NHS to try to derive cost-per-QALY thresholds. First off, many of my original criticisms I would also apply to this paper, to which I’d also add one: (Statistical significance being used inappropriately complaint alert!!!) The authors use what seems to be some form of stepwise regression by including and excluding regressors on the basis of statistical significance – this is a big no-no and just introduces large biases (see this article for a list of reasons why). Beyond that, the instruments issue – I think – is still a problem, as it’s hard to justify, for example, an input price index (which translates to larger budgets) as an instrument here. It is certainly correlated with higher expenditure – inputs are more expensive in higher price areas after all – but this instrument won’t be correlated with greater inputs for this same reason. Thus, it’s the ‘wrong kind’ of correlation for this study. Needless to say, perhaps I am letting the perfect be the enemy of the good. Is this evidence strong enough to warrant a change in a cost-effectiveness threshold? My inclination would be that it is not, but that is not to deny it’s relevance to the debate.

Risk thresholds for alcohol consumption: combined analysis of individual-participant data for 599 912 current drinkers in 83 prospective studies. The Lancet Published 14th April 2018

“Moderate drinkers live longer” is the adage of the casual drinker as if to justify a hedonistic pursuit as purely pragmatic. But where does this idea come from? Studies that have compared risk of cardiovascular disease to level of alcohol consumption have shown that disease risk is lower in those that drink moderately compared to those that don’t drink. But correlation does not imply causation – non-drinkers might differ from those that drink. They may be abstinent after experiencing health issues related to alcohol, or be otherwise advised to not drink to protect their health. If we truly believed moderate alcohol consumption was better for your health than no alcohol consumption we’d advise people who don’t drink to drink. Moreover, if this relationship were true then there would be an ‘optimal’ level of consumption where any protective effect were maximised before being outweighed by the adverse effects. This new study pools data from three large consortia each containing data from multiple studies or centres on individual alcohol consumption, cardiovascular disease (CVD), and all-cause mortality to look at these outcomes among drinkers, excluding non-drinkers for the aforementioned reasons. Reading the methods section, it’s not wholly clear, if replicability were the standard, what was done. I believe that for each different database a hazard ratio or odds ratio for the risk of CVD or mortality for eight groups of alcohol consumption was estimated, these ratios were then subsequently pooled in a random-effects meta-analysis. However, it’s not clear to me why you would need to do this in two steps when you could just estimate a hierarchical model that achieves the same thing while also propagating any uncertainty through all the levels. Anyway, a polynomial was then fitted through the pooled ratios – again, why not just do this in the main stage and estimate some kind of hierarchical semi-parametric model instead of a three-stage model to get the curve of interest? I don’t know. The key finding is that risk generally increases above around 100g/week alcohol (around 5-6 UK glasses of wine per week), below which it is fairly flat (although whether it is different to non-drinkers we don’t know). However, the picture the article paints is complicated, risk of stroke and heart failure go up with increased alcohol consumption, but myocardial infarction goes down. This would suggest some kind of competing risk: the mechanism by which alcohol works increases your overall risk of CVD and your proportional risk of non-myocardial infarction CVD given CVD.

Family ruptures, stress, and the mental health of the next generation [comment] [reply]. American Economic Review [RePEc] Published April 2018

I’m not sure I will write out the full blurb again about studies of in utero exposure to difficult or stressful conditions and later life outcomes. There are a lot of them and they continue to make the top journals. Admittedly, I continue to cover them in these round-ups – so much so that we could write a literature review on the topic on the basis of the content of this blog. Needless to say, exposure in the womb to stressors likely increases the risk of low birth weight birth, neonatal and childhood disease, poor educational outcomes, and worse labour market outcomes. So what does this new study (and the comments) contribute? Firstly, it uses a new type of stressor – maternal stress caused by a death in the family and apparently this has a dose-response as stronger ties to the deceased are more stressful, and secondly, it looks at mental health outcomes of the child, which are less common in these sorts of studies. The identification strategy compares the effect of the death on infants who are in the womb to those infants who experience it shortly after birth. Herein lies the interesting discussion raised in the above linked comment and reply papers: in this paper the sample contains all births up to one year post birth and to be in the ‘treatment’ group the death had to have occurred between conception and the expected date of birth, so those babies born preterm were less likely to end up in the control group than those born after the expected date. This spurious correlation could potentially lead to bias. In the authors’ reply, they re-estimate their models by redefining the control group on the basis of expected date of birth rather than actual. They find that their estimates for the effect of their stressor on physical outcomes, like low birth weight, are much smaller in magnitude, and I’m not sure they’re clinically significant. For mental health outcomes, again the estimates are qualitatively small in magnitude, but remain similar to the original paper but this choice phrase pops up (Statistical significance being used inappropriately complaint alert!!!): “We cannot reject the null hypothesis that the mental health coefficients presented in panel C of Table 3 are statistically the same as the corresponding coefficients in our original paper.” Statistically the same! I can see they’re different! Anyway, given all the other evidence on the topic I don’t need to explain the results in detail – the methods discussion is far more interesting.

Credits