How to explain cost-effectiveness models for diagnostic tests to a lay audience

Non-health economists (henceforth referred to as ‘lay stakeholders’) are often asked to use the outputs of cost-effectiveness models to inform decisions, but they can find them difficult to understand. Conversely, health economists may have limited experience of explaining cost-effectiveness models to lay stakeholders. How can we do better?

This article shares my experience of explaining cost-effectiveness models of diagnostic tests to lay stakeholders such as researchers in other fields, clinicians, managers, and patients, and suggests some approaches to make models easier to understand. It is the condensed version of my presentation at ISPOR Europe 2018.

Why are cost-effectiveness models of diagnostic tests difficult to understand?

Models designed to compare diagnostic strategies are particularly challenging. In my view, this is for two reasons.

Firstly, there is the sheer number of possible diagnostic strategies that a cost-effectiveness model allows us to compare. Even if we are looking at only a couple of tests, we can use them in various combinations and at many diagnostic thresholds. See, for example, this cost-effectiveness analysis of diagnosis of prostate cancer.

Secondly, diagnostic tests can affect costs and health outcomes in multiple ways. Specifically, diagnostic tests can have a direct effect on people’s health-related quality of life, mortality risk, acquisition costs, as well as the consequences of side effects. Furthermore, diagnostic tests can have an indirect effect via the consequences of the subsequent management decisions. This indirect effect is often the key driver of cost-effectiveness.

As a result, the cost-effectiveness analysis of diagnostic tests can have many strategies, with multiple effects modelled in the short and long-term. This makes the model and the results difficult to understand.

Map out the effect of the test on health outcomes or costs

The first step in developing any cost-effectiveness model is to understand how the new technology, such as a diagnostic test or a drug, can impact the patient and the health care system. Ferrante di Ruffano et al and Kip et al are two studies that can be used as a starting point to understand the possible effects of a test on health outcomes and/or costs.

Ferrante di Ruffano et al conducted a review of the mechanisms by which diagnostic tests can affect health outcomes and provides a list of the possible effects of diagnostic tests.

Kip et al suggests a checklist for the reporting of cost-effectiveness analyses of diagnostic tests and biomarkers. Although this is a checklist for the reporting of a cost-effectiveness analysis that has been previously conducted, it can also be used as a prompt to define the possible effects of a test.

Reach a shared understanding of the clinical pathway

The parallel step is to understand the clinical pathway in which the diagnostic strategies integrate and affect. This consists of conceptualising the elements of the health care service relevant for the decision problem. If you’d like to know more about model conceptualisation, I suggest this excellent paper by Paul Tappenden.

These conceptual models are necessarily simplifications of reality. They need to be as simple as possible, but accurate enough that lay stakeholders recognise it as valid. As Einstein said: “to make the irreducible basic elements as simple and as few as possible, without having to surrender the adequate representation of a single datum of experience.”

Agree which impacts to include in the cost-effectiveness model

What to include and to exclude from the model is, at present, more of an art than a science. For example, Chilcott et al conducted a series of interviews with health economists and found that their approach to model development varied widely.

I find that the best approach is to design the model in consultation with the relevant stakeholders, such as clinicians, patients, health care managers, etc. This ensures that the cost-effectiveness model has face validity to those who will ultimately be their end user and (hopefully) advocates of the results.

Decouple the model diagram from the mathematical model

When we have a reasonable idea of the model that we are going to build, we can draw its diagram. A model diagram not only is a recommended component of the reporting of a cost-effectiveness model but also helps lay stakeholders understand it.

The temptation is often to draw the model diagram as similar as possible to the mathematical model. In cost-effectiveness models of diagnostic tests, the mathematical model tends to be a decision tree. Therefore, we often see a decision tree diagram.

The problem is that decision trees can easily become unwieldy when we have various test combinations and decision nodes. We can try to synthesise a gigantic decision tree into a simpler diagram, but unless you have great graphic designer skills, it might be a futile exercise (see, for example, here).

An alternative approach is to decouple the model diagram from the mathematical model and break down the decision problem into steps. The figure below shows an example of how the model diagram can be decoupled from the mathematical model.

The diagram breaks the problem down into steps that relate to the clinical pathway, and therefore, to the stakeholders. In this example, the diagram follows the questions that clinicians and patients may ask: which test to do first? Given the result of the first test, should a second test be done? If a second test is done, which one?

Simplified model diagram on the cost-effectiveness analysis of magnetic resonance imaging (MRI) and biopsy to diagnose prostate cancer

Relate the results to the model diagram

The next point of contact between the health economists and lay stakeholders is likely to be at the point when the first cost-effectiveness results are available.

The typical chart for the probabilistic results is the cost-effectiveness acceptability curve (CEAC). In my experience, the CEAC is challenging for lay stakeholders. It plots results over a range of cost-effectiveness thresholds, which are not quantities that most people outside cost-effectiveness analysis relate to. Additionally, CEACs showing the results of multiple strategies can have many lines and some discontinuities, which can be difficult to understand by the untrained eye.

An alternative approach is to re-use the model diagram to present the results. The model diagram can show the strategy that is expected to be cost-effective and its probability of cost-effectiveness at the relevant threshold. For example, the probability that the strategies starting with a specific test are cost-effective is X%; and the probability that strategies using the specific test at a specific cut-off are cost-effective is Y%, etc.

Next steps for practice and research

Research about the communication of cost-effectiveness analysis is sparse, and guidance is lacking. Beyond the general advice to speak in plain English and avoiding jargon, there is little advice. Hence, health economists find themselves developing their own approaches and techniques.

In my experience, the key aspects for effective communication are to engage with lay stakeholders from the start of the model development, to explain the intuition behind the model in simplified diagrams, and to find a balance between scientific accuracy and clarity which is appropriate for the audience.

More research and guidance are clearly needed to develop communication methods that are effective and straightforward to use in applied cost-effectiveness analysis. Perhaps this is where patient and public involvement can really make a difference!

Bad reasons not to use the EQ-5D-5L

We’ve seen a few editorials and commentaries popping up about the EQ-5D-5L recently, in Health Economics, PharmacoEconomics, and PharmacoEconomics again. All of these articles have – to varying extents – acknowledged the need for NICE to exercise caution in the adoption of the EQ-5D-5L. I don’t get it. I see no good reason not to use the EQ-5D-5L.

If you’re not familiar with the story of the EQ-5D-5L in England, read any of the linked articles, or see an OHE blog post summarising the tale. The important part of the story is that NICE has effectively recommended the use of the EQ-5D-5L descriptive system (the questionnaire), but not the new EQ-5D-5L value set for England. Of the new editorials and commentaries, Devlin et al are vaguely pro-5L, Round is vaguely anti-5L, and Brazier et al are vaguely on the fence. NICE has manoeuvred itself into a situation where it has to make a binary decision. 5L, or no 5L (which means sticking with the old EQ-5D-3L value set). Yet nobody seems keen to lay down their view on what NICE ought to decide. Maybe there’s a fear of being proven wrong.

So, herewith a list of reasons for exercising caution in the adoption of the EQ-5D-5L, which are either explicitly or implicitly cited by recent commentators, and why they shouldn’t determine NICE’s decision. The EQ-5D-5L value set for England should be recommended without hesitation.

We don’t know if the descriptive system is valid

Round argues that while the 3L has been validated in many populations, the 5L has not. Diabetes, dementia, deafness and depression are presented as cases where the 3L has been validated but the 5L has not. But the same goes for the reverse. There are plenty of situations in which the 3L has been shown to be problematic and the 5L has not. It’s simply a matter of time. This argument should only hold sway if we expect there to be more situations in which the 5L lacks validity, or if those violations are in some way more serious. I see no evidence of that. In fact, we see measurement properties improved with the 5L compared with the 3L. Devlin et al put the argument to bed in highlighting the growing body of evidence demonstrating that the 5L descriptive system is better than the 3L descriptive system in a variety of ways, without any real evidence that there are downsides to the descriptive expansion. And this – the comparison of the 3L and the 5L – is the correct comparison to be making, because the use of the 3L represents current practice. More fundamentally, it’s hard to imagine how the 5L descriptive system could be less valid than the 3L descriptive system. That there are only a limited number of validation studies using the 5L is only a problem if we can hypothesise reasons for the 5L to lack validity where the 3L held it. I can’t think of any. And anyway, NICE is apparently satisfied with the descriptive system; it’s the value set they’re worried about.

We don’t know if the preference elicitation methods are valid for states worse than dead

This argument is made by Brazier et al. The value set for England uses lead time TTO, which is a relatively new (and therefore less-tested) method. The problem is that we don’t know if any methods for valuing states worse than dead are valid because valuing states worse than dead makes no real sense. Save for pulling out a Ouija board, or perhaps holding a gun to someone’s head, we can never find out what is the most valid approach to valuing states worse than dead. And anyway, this argument fails on the same basis as the previous one: where is the evidence to suggest that the MVH approach to valuing states worse than dead (for the EQ-5D-3L) holds more validity than lead time TTO?

We don’t know if the EQ-VT was valid

As discussed by Brazier et al, it looks like there may have been some problems in the administration of the EuroQol valuation protocol (the EQ-VT) for the EQ-5D-5L value set. As a result, some of the data look a bit questionable, including large spikes in the distribution of values at 1.0, 0.5, 0.0, and -1.0. Certainly, this justifies further investigation. But it shouldn’t stall adoption of the 5L value set unless this constitutes a greater concern than the distributional characteristics of the 3L, and that’s not an argument I see anybody making. Perhaps there should have been more piloting of the EQ-VT, but that should (in itself) have no bearing on the decision of whether to use the 3L value set or the 5L value set. If the question is whether we expect the EQ-VT protocol to provide a more accurate estimation of health preferences than the MVH protocol – and it should be – then as far as I can tell there is no real basis for preferring the MVH protocol.

We don’t know if the value set (for England) is valid

Devlin et al state that, with respect to whether differences in the value sets represent improvements, “Until the external validation of the England 5L value set concludes, the jury is still out.” I’m not sure that’s true. I don’t know what the external validation is going to involve, but it’s hard to imagine a punctual piece of work that could demonstrate the ‘betterness’ of the 5L value set compared with the 3L value set. Yes, a validation exercise could tell us whether the value set is replicable. But unless validation of the comparator (i.e. the 3L value set) is also attempted and judged on the same basis, it won’t be at all informative to NICE’s decision. Devlin et al state that there is a governmental requirement to validate the 5L value set for England. But beyond checking the researchers’ sums, it’s difficult to understand what that could even mean. Given that nobody seems to have defined ‘validity’ in this context, this is a very dodgy basis for determining adoption or non-adoption of the 5L.

5L-based evaluations will be different to 3L-based evaluations

Well, yes. Otherwise, what would be the point? Brazier et al present this as a justification for a ‘pause’ for an independent review of the 5L value set. The authors present the potential shift in priority from life-improving treatments to life-extending treatments as a key reason for a pause. But this is clearly a circular argument. Pausing to look at the differences will only bring those (and perhaps new) differences into view (though notably at a slower rate than if the 5L was more widely adopted). And then what? We pause for longer? Round also mentions this point as a justification for further research. This highlights a misunderstanding of what it means for NICE to be consistent. NICE has no responsibility to make decisions in 2018 precisely as it would have in 2008. That would be foolish and ignorant of methodological and contextual developments. What NICE needs to provide is consistency in the present – precisely what is precluded by the current semi-adoption of the EQ-5D-5L.

5L data won’t be comparable to 3L data

Round mentions this. But why does it matter? This is nothing compared to the trickery that goes on in economic modelling. The whole point of modelling is to do the best we can with the data we’ve got. If we have to compare an intervention for which outcomes are measured in 3L values with an intervention for which outcomes are measured in 5L values, then so be it. That is not a problem. It is only a problem if manufacturers strategically use 3L or 5L values according to whichever provides the best results. And you know what facilitates that? A pause, where nobody really knows what is going on and NICE has essentially said that the use of both 3L and 5L descriptive systems is acceptable. If you think mapping from 5L to 3L values is preferable to consistently using the 5L values then, well, I can’t reason with you, because mapping is never anything but a fudge (albeit a useful one).

There are problems with the 3L, so we shouldn’t adopt the 5L

There’s little to say on this point beyond asserting that we mustn’t let perfect be the enemy of the good. Show me what else you’ve got that could be more readily and justifiably introduced to replace the 3L. Round suggests that shifting from the 3L to the 5L is no different to shifting from the 3L to an entirely different measure, such as the SF-6D. That’s wrong. There’s a good reason that NICE should consider the 5L as the natural successor to the 3L. And that’s because it is. This is exactly what it was designed to be: a methodological improvement on the same conceptual footing. The key point here is that the 3L and 5L contain the same domains. They’re trying to capture health-related quality of life in a consistent way; they refer to the same evaluative space. Shifting to the SF-6D (for example) would be a conceptual shift, whereas shifting to the 5L from the 3L is nothing but a methodological shift (with the added benefit of more up-to-date preference data).

To sum up

Round suggests that the pause is because of “an unexpected set of results” arising from the valuation exercise. That may be true in part. But I think it’s more likely the fault of dodgy public sector deals with the likes of Richard Branson and a consequently algorithm-fearing government. I totally agree with Round that, if NICE is considering a new outcome measure, they shouldn’t just be considering the 5L. But given that right now they are only considering the 5L, and that the decision is explicitly whether or not to adopt the 5L, there are no reasons not to do so.

The new value set is only a step change because we spent the last 25 years idling. Should we really just wait for NICE to assess the value set, accept it, and then return to our see-no-evil position for the next 25 years? No! The value set should be continually reviewed and redeveloped as methods improve and societal preferences evolve. The best available value set for England (and Wales) should be regularly considered by NICE as part of a review of the reference case. A special ‘pause’ for the new 5L value set will only serve to reinforce the longevity of compromised value sets in the future.

Yes, the EQ-5D-3L and its associated value set for the UK has been brilliantly useful over the years, but it now has a successor that – as far as we can tell – is better in many ways and at least as good in the rest. As a public body, NICE is conservative by nature. But researchers needn’t be.

Credits

The irrelevance of inference: (almost) 20 years on is it still irrelevant?

The Irrelevance of Inference was a seminal paper published by Karl Claxton in 1999. In it he outlines a stochastic decision making approach to the evaluation of health technologies. A key point that he makes is that we need only to examine the posterior mean incremental net benefit of one technology compared to another to make a decision. Other aspects of the distribution of incremental net benefits are irrelevant – hence the title.

I hated this idea. From a Bayesian perspective estimation and inference is a decision problem. Surely uncertainty matters! But, in the extra-welfarist framework that we generally conduct cost-effectiveness analysis in, it is irrefutable. To see why let’s consider a basic decision making framework.

There are three aspects to a decision problem. Firstly, there is a state of the world, \theta \in \Theta with density \pi(\theta). In this instance it is the net benefits in the population, but could be the state of the economy, or effectiveness of a medical intervention in other contexts, for example. Secondly, there is the possible actions denoted by a \in \mathcal{A}. There might be a discrete set of actions or a continuum of possibilities. Finally, there is the loss function L(a,\theta). The loss function describes the losses or costs associated with making decision a given that \theta is the state of nature. The action that should be taken is the one which minimises expected losses \rho(\theta,a)=E_\theta(L(a,\theta)). Minimising losses can be seen as analogous to maximising utility. We also observe data x=[x_1,...,x_N]' that provide information on the parameter \theta. Our state of knowledge regarding this parameter is then captured by the posterior distribution \pi(\theta|x). Our expected losses should be calculated with respect to this distribution.

Given the data and posterior distribution of incremental net benefits, we need to make a choice about a value (a Bayes estimator), that minimises expected losses. The opportunity loss from making the wrong decision is “the difference in net benefit between the best choice and the choice actually made.” So the decision falls down to deciding whether the incremental net benefits are positive or negative (and hence whether to invest), \mathcal{A}=[a^+,a^-]. The losses are linear if we make the wrong decision:

L(a^+,\theta) = 0 if \theta >0 and L(a^+,\theta) = \theta if \theta <0

L(a^-,\theta) = - \theta if \theta >0 and L(a^+,\theta) = 0 if \theta <0

So we should decide that the incremental net benefits are positive if

E_\theta(L(a^+,\theta)) - E_\theta(L(a^-,\theta)) > 0

which is equivalent to

\int_0^\infty \theta dF^{\pi(\theta|x)}(\theta) - \int_{-\infty}^0 -\theta dF^{\pi(\theta|x)}(\theta) = \int_{-\infty}^\infty \theta dF^{\pi(\theta|x)}(\theta) > 0

which is obviously equivalent to E(\theta|x)>0 – the posterior mean!

If our aim is simply the estimation of net benefits (so \mathcal{A} \subseteq \mathbb{R}), different loss functions lead to different estimators. If we have a squared loss function L(a, \theta)=|\theta-a|^2 then again we should choose the posterior mean. However, other choices of loss function lead to other estimators. The linear loss function, L(a, \theta)=|\theta-a| leads to the posterior median. And a ‘0-1’ loss function: L(a, \theta)=0 if a=\theta and L(a, \theta)=1 if a \neq \theta, gives the posterior mode, which is also the maximum likelihood estimator (MLE) if we have a uniform prior. This latter point does suggest that MLEs will not give the ‘correct’ answer if the net benefit distribution is asymmetric. The loss function is therefore important. But for the purposes of the decision between technologies I see no good reason to reject our initial loss function.

Claxton also noted that equity considerations could be incorporated through ‘adjustments to the measure of outcome’. This could be some kind of weighting scheme. However, this is where I might begin to depart from the claim of the irrelevance of inference. I prefer a social decision maker approach to evaluation in the vein of cost-benefit analysis as discussed by the brilliant Alan Williams. This approach allows for non-market outcomes that extra-welfarism might include but classical welfarism would exclude; their valuations could be arrived at by a political, democratic process or by other means. It also permits inequality aversion and other features that I find are a perhaps more accurate reflection of a political decision making approach. However, one must be aware of all the flaws and failures of this approach, which Williams so neatly describes.

In a social decision maker framework, the decision that should be made is the one that maximises a social welfare function. A utility function expresses social preferences over the distribution of utility in the population, the social welfare function aggregates utility and is usually assumed to be linear (utilitarian). If the utility function is inequality averse then the variance obviously does matter. But, in making this claim I am moving away from the arguments of Claxton’s paper and towards a discussion of the relative merits extra-welfarism and other approaches.

Perhaps the statement that inference was irrelevant was made just to capture our attention. After all the process of updating our knowledge of the net benefits of alternatives from data is inference. But Claxton’s statement refers more to the process of hypothesis testing and p-values (or Bayesian ranges of equivalents), the use of which has no place in decision making. On this point I wholeheartedly agree.