The NHS as policy laboratory

In an ideal world new policies and interventions could be tested in a randomised fashion before implementation. But, all to often, policies within the health service are decided upon in the absence of decent evidence to serve political rather than public health or economic ends. Consider the recent case of the 7-day NHS, which the evidence is beginning to show will likely not produce the benefits expected of it. Researchers cannot expect political decisions to be delayed for them to be able to conduct the ideal study. Sometimes the researcher has to evaluate a policy or organisation change that will go ahead regardless or one that cannot be reversed once it is in place. Nevertheless, this can still produce a good opportunity for evaluation that can satisfy both researchers and policy makers alike: the stepped wedge cluster randomised trial.

The stepped wedge trial design is a variant on the cluster RCT design. The Figure below illustrates the different set-ups. What is unique to the stepped wedge design is that by the end of the study all of the study sites will receive the intervention: it is the order in which they receive the intervention that is randomised. Hemming et al (2015) provide a good overview with examples of the stepped wedge trial, while Hemming, Girling, and Lilford (2015) give a statistical rationale and background. And, recently Girling and Hemming (2016) have investigated hybrid designs to optimize statistical efficiency.

f1-large

Figure. Schematic illustration of the conventional parallel cluster study (with variations) and the stepped wedge study. Hemming et al. (CC BY 4.0)

The stepped wedge design presents an attractive proposition and compromise for researchers and policy makers alike. But the feasibility of implementing it depends on the stage when the researchers are involved in the design of the roll out of the intervention. Often it is the case that researchers are involved after the fact, opportunistically examining an ongoing change in the health system. However, there are a growing number of examples of stepped wedge studies being implemented in the NHS (e.g. here). Researcher involvement with policy and organisational changes in the health system should become an opt-out system rather than opt-in. Data is readily available and the intervention will already be planned making such research relatively cheap. The NHS can become a powerful policy laboratory.

Photo credit: As6022014 (CC BY-SA 3.0)

E-cigarettes and the role of science in public health policy

E-cigarettes have become, without doubt, one of the public health issues du jour. Many countries and states have been quick to prohibit them, while others continue to debate the issue. The debate ostensibly revolves around the relative harms of e-cigarettes: Are they dangerous? Will they reduce the harms caused by smoking tobacco? Will children take them up? Questions which would typically be informed by the available evidence. However, there is a growing schism within the scientific community about what indeed the evidence does say. On the one hand, there is the view that the evidence, when taken altogether, overwhelmingly suggests that e-cigarettes are significantly less harmful than cigarettes and would reduce the harms caused by nicotine use. On the other hand, there is vocal group that doubt the veracity of the available evidence and are critical of e-cigarette availability in general. Indeed, this latter view has been adopted by the biggest journals in medicine, The Lancet, the BMJ, the New England Journal of Medicine, and JAMA, each of whom have published either research or editorials along this line.

The evidence around e-cigarettes was recently summarised and reviewed by Public Health England. The conclusion of the review was the e-cigarettes were 95% less harmful than smoking tobacco. So why might these journals take a position that is arguably contrary to the evidence? From a sociological perspective, epistemological conflicts in science are also political conflicts. Actions within the scientific field are directed to acquiring scientific authority, and that authority requires social recognition. However, e-cigarette policy is also a political issue, and as such actions in this area are also directed at gaining political capital. If the e-cigarette issue can be delimited to a purely scientific problem then scientific capital can be translated into political capital. One way of achieving this is to try to establish oneself as the authoritative scientific voice on such matters and to doubt the claims made by others.

We can also view the issue in a broader context. The traditional journal format is under threat from other models of scientific publishing, including blogs, open access publishers, pre-print archives, and post-publication peer review. Much of the debate around e-cigarettes has come from these new sources. Dominant producers in the scientific field must necessarily be conservative since it is the established structure of scientific field that grants these producers their dominant status. But this competition is the scientific field may have wider, pernicious consequences.

Typically, we try to formulate policies that maximise social welfare. But, as Acemoglu and Robinson point out, the policy that may maximise social welfare now, may not maximise welfare in the long run. Different policies today affect the political equilibrium tomorrow and thus the policies that are available to policy makers tomorrow. Prohibiting e-cigarettes today may be socially optimal if there were no reliable evidence on their harms or benefits and there were suspicions that they could cause public harm. But, it is very difficult politically to reverse prohibition policies, even if evidence were to later emerge that e-cigarettes were an effective harm reduction product. Thus, even if the journals were to doubt the evidence around e-cigarettes, then the best policy position would arguably be to remain agnostic and await further evidence. But, this would not be a position that would grant them socially recognised scientific capital.

Perhaps this e-cigarette debate is reflective of a broader shift in the way in which scientific evidence and those with scientific capital are engaged in public health policy decisions. Different forms of evidence beyond RCTs are being more widely accepted in biomedical research and methods of evidence synthesis are being developed. New forums are also becoming available for their dissemination. This, I would say, can only be a positive thing.

Bayesian evidence synthesis and bootstrapping for trial-based economic evaluations: comfortable bed fellows?

By Mohsen Sadatsafavi and Stirling Bryan

In economic evaluation of health technologies, evidence synthesis is typically about quantification of the evidence in terms of parameters. Bootstrapping is a non-parametric inferential method in trial-based economic evaluations. On the surface the two paradigms seem incompatible. In a recent paper, we show that a simple and intuitive modification of the bootstrap can indeed accommodate parametric evidence synthesis.

When the recruitment phase of a pragmatic randomized controlled trial (RCT) is over, two groups of investigators will become busy. The clinical evaluation team is interested in inference about the population value of the primary outcome, typically a measure of relative effect (e.g. relative risk [RR] of the clinical outcome of interest) between the treatment groups. The economic evaluation team is in charge of inference chiefly on the population value of the incremental cost effectiveness ratio (ICER).

A widely used method of characterizing uncertainty around the ICER in RCT-based cost-effectiveness analyses is the bootstrap. For a typical two-arm RCT, the investigator obtains a bootstrap sample of the data to calculate the difference in costs and difference in effectiveness between the two treatments. Repeating this step many times provides a sample from the joint distribution of the difference in costs and effectiveness that can be used to calculate the ICER and to represent uncertainty around its value (such as to calculate credible intervals, to draw the cost-effectiveness plane and acceptability curve). As an example, the table below gives results from repeated bootstrap samples of a hypothetical two-arm RCT:

Bootstrap # Difference in costs ($) Difference in effectiveness (QALYs)
1  $1,670.1  0.0130
2  $1592.9  0.0143
10,000  $1,091.0  0.0133
Average $1,450.2 0.0151
ICER 1,450.2/0.0151=96,039.7

In deriving the costs and effectiveness values within each bootstrap loop, many steps might be involved, such as imputation of missing values and adjusting for covariates. This is what makes the bootstrap method so powerful, as all such steps are enveloped within the bootstrap, allowing for the uncertainty in all inferential steps to be accounted for.

The dilemma of external evidence

Imagine at the time of such analyses, another ‘external’ trial is published which reports results for the same interventions and treatment protocol, in the same population, with the same clinical outcome measure. Also imagine the external RCT reports the maximum-likelihood and 95% confidence interval of the RR of treatment, which we find to be more favorable for the new treatment versus the standard treatment than the RR in the current RCT. Of course, this carries some information about the effect of the treatment at the population level. But how can this be incorporated in the inference?

The task in front of the clinical evaluation team is rather straightforward: the RR from the two RCTs can be combined using meta-analytic techniques to provide an estimate for the population RR. But what about the economic evaluation team? We can speculate that, given the observed treatment effect in the external RCT, the population value of the ICER could be more favorable for the new treatment than what the current RCT suggests.

But is there any way to make the above-mentioned subjective line of reasoning into a formal and objective form of inference? This is what we have addressed in our recent paper. Before we explain our solution, we note that there are already at least two ways of performing this task: (a) to desist statistical inference and use decision-analytic modeling (which can use the pooled RR as an input parameter), and (b) to resort to parametric Bayesian inference. The former is not really a solution as long as the desire for statistical inference for cost-effectiveness is concerned, and the latter is a complete paradigm shift which also imposes a myriad of parametric assumptions (think of the regression equations, error terms, and link functions required to connect cost and effectiveness outcomes to the clinical variable, and the clinical variable to external evidence).

Can evidence synthesis be carried out using the bootstrap?

Yes! And our proposed solution is rather intuitive: the investigator first parameterises the external evidence using appropriate probability distributions (e.g. a log-normal distribution for RR constructed from the reported point estimate and interval bounds). For each bootstrap sample, the investigator calculates, in addition to cost and effectiveness outcomes, the parameters for which external evidence is available, and uses the constructed probability distribution to weight the bootstrap sample according to its degree of plausibility against external evidence. The ICER is the weighted-average of difference in costs over the weighted-average of difference in effectiveness:

Bootstrap # Difference in costs ($) Difference in effectiveness (QALYs) Treatment effect (RR) Weight according to  external evidence
1 $1,670.1   0.0130  0.521  0.058
2 $1592.9   0.0143  0.650  0.068
10,000 $1,091.0  0.0151  0.452  0.025
Weighted Average $1,034.2 0.0161
ICER  1,034.2/0.0161=64,236.0

A more practical method of assigning weights to bootstraps, instead of using the weights directly, is to ‘accept’ each bootstrap with a probability that is proportional to its weight. Rejected bootstraps are removed from the analysis. This gives the investigator an idea about the ‘effective’ number of bootstraps, and makes the subsequent calculations independent of the weights.

Why does it work?

The theory is provided in the paper, but in a nutshell, a Bayesian interpretation of the bootstrap allows one to see the bootstrap estimate of the difference in costs and difference in effectiveness as their posterior distribution conditional on the current RCT. It can be shown that the weights transform this to the posterior distribution conditional on the current AND external RCT.

An appealing feature of the method is the minimal parametric assumptions. Unlike the parametric Bayesian methods, the investigator need not make any assumption on the distribution of costs and effectiveness outcomes and how the clinical outcome affects the cost and effectiveness values. The effect is channeled directly through the experience of patients in the course of the trial, represented through the correlation structure between clinical outcomes, costs, and effectiveness variables at the individual level.

Further developments

There are indeed many gaps to be filled. The method only focuses on parallel-arm RCTs and leaves the problem open for other designs. In addition, rejection sampling can be wasteful, and if there are several parameters, then the method becomes quite unwieldy. An interesting potential solution is to create auto-correlated Markov Chain bootstraps that tend to concentrate on the high probability areas of the posterior distribution. In general, this sampling paradigm is quite flexible and can be used to incorporate external evidence in other contexts such as model-based evaluations or evaluations based on observational data.