By Mohsen Sadatsafavi and Stirling Bryan

*In economic evaluation of health technologies, evidence synthesis is typically about quantification of the evidence in terms of parameters. Bootstrapping is a non-parametric inferential method in trial-based economic evaluations. On the surface the two paradigms seem incompatible. In a recent paper, we show that a simple and intuitive modification of the bootstrap can indeed accommodate parametric evidence synthesis.*

When the recruitment phase of a pragmatic randomized controlled trial (RCT) is over, two groups of investigators will become busy. The clinical evaluation team is interested in inference about the population value of the primary outcome, typically a measure of relative effect (e.g. relative risk [RR] of the clinical outcome of interest) between the treatment groups. The economic evaluation team is in charge of inference chiefly on the population value of the incremental cost effectiveness ratio (ICER).

A widely used method of characterizing uncertainty around the ICER in RCT-based cost-effectiveness analyses is the bootstrap. For a typical two-arm RCT, the investigator obtains a bootstrap sample of the data to calculate the difference in costs and difference in effectiveness between the two treatments. Repeating this step many times provides a sample from the joint distribution of the difference in costs and effectiveness that can be used to calculate the ICER and to represent uncertainty around its value (such as to calculate credible intervals, to draw the cost-effectiveness plane and acceptability curve). As an example, the table below gives results from repeated bootstrap samples of a hypothetical two-arm RCT:

Bootstrap # | Difference in costs ($) | Difference in effectiveness (QALYs) |

1 | $1,670.1 | 0.0130 |

2 | $1592.9 | 0.0143 |

… | ||

10,000 | $1,091.0 | 0.0133 |

Average |
$1,450.2 |
0.0151 |

ICER |
1,450.2/0.0151=96,039.7 |

In deriving the costs and effectiveness values within each bootstrap loop, many steps might be involved, such as imputation of missing values and adjusting for covariates. This is what makes the bootstrap method so powerful, as all such steps are enveloped within the bootstrap, allowing for the uncertainty in all inferential steps to be accounted for.

**The dilemma of external evidence**

Imagine at the time of such analyses, another ‘external’ trial is published which reports results for the same interventions and treatment protocol, in the same population, with the same clinical outcome measure. Also imagine the external RCT reports the maximum-likelihood and 95% confidence interval of the RR of treatment, which we find to be more favorable for the new treatment versus the standard treatment than the RR in the current RCT. Of course, this carries some information about the effect of the treatment at the population level. But how can this be incorporated in the inference?

The task in front of the clinical evaluation team is rather straightforward: the RR from the two RCTs can be combined using meta-analytic techniques to provide an estimate for the population RR. But what about the economic evaluation team? We can speculate that, given the observed treatment effect in the external RCT, the population value of the ICER could be more favorable for the new treatment than what the current RCT suggests.

But is there any way to make the above-mentioned subjective line of reasoning into a formal and objective form of inference? This is what we have addressed in our recent paper. Before we explain our solution, we note that there are already at least two ways of performing this task: (a) to desist statistical inference and use decision-analytic modeling (which can use the pooled RR as an input parameter), and (b) to resort to parametric Bayesian inference. The former is not really a solution as long as the desire for statistical inference for cost-effectiveness is concerned, and the latter is a complete paradigm shift which also imposes a myriad of parametric assumptions (think of the regression equations, error terms, and link functions required to connect cost and effectiveness outcomes to the clinical variable, and the clinical variable to external evidence).

**Can evidence synthesis be carried out using the bootstrap?**

Yes! And our proposed solution is rather intuitive: the investigator first parameterises the external evidence using appropriate probability distributions (e.g. a log-normal distribution for RR constructed from the reported point estimate and interval bounds). For each bootstrap sample, the investigator calculates, in addition to cost and effectiveness outcomes, the parameters for which external evidence is available, and uses the constructed probability distribution to weight the bootstrap sample according to its degree of plausibility against external evidence. The ICER is the weighted-average of difference in costs over the weighted-average of difference in effectiveness:

Bootstrap # | Difference in costs ($) | Difference in effectiveness (QALYs) | Treatment effect (RR) | Weight according to external evidence |

1 | $1,670.1 | 0.0130 | 0.521 | 0.058 |

2 | $1592.9 | 0.0143 | 0.650 | 0.068 |

… | ||||

10,000 | $1,091.0 | 0.0151 | 0.452 | 0.025 |

Weighted Average |
$1,034.2 |
0.0161 |
||

ICER |
1,034.2/0.0161=64,236.0 |

A more practical method of assigning weights to bootstraps, instead of using the weights directly, is to ‘accept’ each bootstrap with a probability that is proportional to its weight. Rejected bootstraps are removed from the analysis. This gives the investigator an idea about the ‘effective’ number of bootstraps, and makes the subsequent calculations independent of the weights.

**Why does it work?**

The theory is provided in the paper, but in a nutshell, a Bayesian interpretation of the bootstrap allows one to see the bootstrap estimate of the difference in costs and difference in effectiveness as their posterior distribution conditional on the current RCT. It can be shown that the weights transform this to the posterior distribution conditional on the current AND external RCT.

An appealing feature of the method is the minimal parametric assumptions. Unlike the parametric Bayesian methods, the investigator need not make any assumption on the distribution of costs and effectiveness outcomes and how the clinical outcome affects the cost and effectiveness values. The effect is channeled directly through the experience of patients in the course of the trial, represented through the correlation structure between clinical outcomes, costs, and effectiveness variables at the individual level.

**Further developments**

There are indeed many gaps to be filled. The method only focuses on parallel-arm RCTs and leaves the problem open for other designs. In addition, rejection sampling can be wasteful, and if there are several parameters, then the method becomes quite unwieldy. An interesting potential solution is to create auto-correlated Markov Chain bootstraps that tend to concentrate on the high probability areas of the posterior distribution. In general, this sampling paradigm is quite flexible and can be used to incorporate external evidence in other contexts such as model-based evaluations or evaluations based on observational data.