IVF and the evaluation of policies that don’t affect particular persons

Over at the CLAHRC West Midlands blog, Richard Lilford (my boss, I should hasten to add!) writes about the difficulties with the economic evaluation of IVF. The post notes that there are a number of issues that “are not generally considered in the standard canon for health economic assessment” including the problems with measuring benefits, choosing an appropriate discount rate, indirect beneficiaries, and valuing the life of the as yet unborn child. Au contraire! These issues are the very bread and butter of health economics and economic evaluation research. But I would concede that their impact on estimates of cost-effectiveness are not nearly well enough integrated into standard assessments.

We’ve covered the issue of choosing a social discount rate on this blog before with regards to treatments with inter-generational effects. I want instead to consider the last point about how we should, in the most normative of senses, consider the life of the child born as a result of IVF.

It puts me in mind of the work of the late, great Derek Parfit. He could be said to have single-handedly developed the field of ethics about future people. He identified a number of ethical problems that still often don’t have satisfactory answers. Decisions like funding IVF have an impact on the very existence of persons. But these decisions do not affect the well-being or rights of any particular persons, rather, as Parfit terms them, general persons. Few would deny that we have moral obligations not to cause material harm to future generations. Most would reject the narrow view that the only relevant outcomes are those that affect actual, particular persons, the narrow person-centred view. For example, in considering the problem of global warming, we do not reject its consequences on future generations as being irrelevant. But there remains the question about how we morally treat these general, future persons. Parfit calls this the non-identity problem and it applies neatly to the issue of IVF.

To illustrate the problem of IVF consider the choice:

If we choose A Adam and Barbara will not have children Charles will not exist
If we choose B Adam and Barbara will have a child Charles will live to 70

If we ignore evidence that suggests quality of life actually declines after one has children, we will assume that Adam and Barbara having children will in fact raise their quality of life since they are fulfilling their preferences. It would then seem to be clear that the fact of Charles existing and living a healthy life would be better than him not existing at all and the net benefit of Choice B is greater. But then consider the next choice:

If we choose A Adam and Barbara will not have children Charles will not exist Dianne will not exist
If we choose B Adam and Barbara will have a child Charles will live to 70 Dianne will not exist
If we choose C Adam and Barbara will have children Charles will live to 40 Dianne will live to 40

Now, Choice C would still seem to be preferable to Choice B if all life years have the same quality of life. But we could continue adding children with shorter and shorter life expectancies until we have a large population that lives a very short life, which is certainly not a morally superior position. This is a version of Parfit’s repugnant conclusion, in which general utilitarian principles leads us to prefer a situation with a very large, very low quality of life population to a smaller, better off one. No satisfying solution has yet been proposed. For IVF this might imply increasing the probability of multiple births!

We can also consider the “opposite” of IVF, contraception. In providing contraception we are superficially choosing Choice A above, which by the same utilitarian reasoning would be a worse situation than one in which those children are born. However, contraception is often used to be able to delay fertility decisions, so the choice actually becomes between a child being born earlier and living a worse life than a child being born later in better circumstances. So for a couple, things would go worse for the general person who is their first child, if things are worse for the particular person who is actually their first child. So it clearly matters how we frame the question as well.

We have a choice about how to weigh up the different situations if we reject the ‘narrow person-centred view’. On a no difference view, the effects on general and particular persons are weighted the same. On a two-tier view, the effects on general persons only matter a fraction of those on particular persons. For IVF this relates to how we weight Charles’s (and Diane’s) life in an evaluation. But current practice is ambiguous about how we weigh up these lives, and if we have a ‘two-tier view’, how we weight the lives of general persons.

From an economic perspective, we often consider that the values we place on benefits resulting from decisions as being determined by societal preferences. Generally, we ignore the fact that for many treatments the actual beneficiaries do not yet exist, which would suggest a ‘no difference view’. For example, when assessing the benefits of providing a treatment for childhood leukaemia, we don’t value the benefits to those particular children who have the disease differently to those general persons who may have the disease in the future. Perhaps we do not consider this since the provision of the treatment does not cause a difference in who will exist in the future. But equally when assessing the effects of interventions that may cause, in a counterfactual sense, changes in fertility decisions and the existence of persons, like social welfare payments or a lifesaving treatment for a woman of childbearing age, we do not think about the effects on the general persons that may be a child of that person or household. This would then suggest a ‘narrow person-centred view’.

There is clearly some inconsistency in how we treat general persons. For IVF evaluations, in particular, many avoid this question altogether and just estimate the cost per successful pregnancy, leaving the weighing up of benefits to later decision makers. While the arguments clearly don’t point to a particular conclusion, my tentative conclusion would be a ‘no difference view’. At any rate, it is an open question. In my rare lectures, I often remark that we spend a lot more time on empirical questions than questions of normative economics. This example shows how this can result in inconsistencies in how we choose to analyse and report our findings.

Credit

 

Bad reasons not to use the EQ-5D-5L

We’ve seen a few editorials and commentaries popping up about the EQ-5D-5L recently, in Health Economics, PharmacoEconomics, and PharmacoEconomics again. All of these articles have – to varying extents – acknowledged the need for NICE to exercise caution in the adoption of the EQ-5D-5L. I don’t get it. I see no good reason not to use the EQ-5D-5L.

If you’re not familiar with the story of the EQ-5D-5L in England, read any of the linked articles, or see an OHE blog post summarising the tale. The important part of the story is that NICE has effectively recommended the use of the EQ-5D-5L descriptive system (the questionnaire), but not the new EQ-5D-5L value set for England. Of the new editorials and commentaries, Devlin et al are vaguely pro-5L, Round is vaguely anti-5L, and Brazier et al are vaguely on the fence. NICE has manoeuvred itself into a situation where it has to make a binary decision. 5L, or no 5L (which means sticking with the old EQ-5D-3L value set). Yet nobody seems keen to lay down their view on what NICE ought to decide. Maybe there’s a fear of being proven wrong.

So, herewith a list of reasons for exercising caution in the adoption of the EQ-5D-5L, which are either explicitly or implicitly cited by recent commentators, and why they shouldn’t determine NICE’s decision. The EQ-5D-5L value set for England should be recommended without hesitation.

We don’t know if the descriptive system is valid

Round argues that while the 3L has been validated in many populations, the 5L has not. Diabetes, dementia, deafness and depression are presented as cases where the 3L has been validated but the 5L has not. But the same goes for the reverse. There are plenty of situations in which the 3L has been shown to be problematic and the 5L has not. It’s simply a matter of time. This argument should only hold sway if we expect there to be more situations in which the 5L lacks validity, or if those violations are in some way more serious. I see no evidence of that. In fact, we see measurement properties improved with the 5L compared with the 3L. Devlin et al put the argument to bed in highlighting the growing body of evidence demonstrating that the 5L descriptive system is better than the 3L descriptive system in a variety of ways, without any real evidence that there are downsides to the descriptive expansion. And this – the comparison of the 3L and the 5L – is the correct comparison to be making, because the use of the 3L represents current practice. More fundamentally, it’s hard to imagine how the 5L descriptive system could be less valid than the 3L descriptive system. That there are only a limited number of validation studies using the 5L is only a problem if we can hypothesise reasons for the 5L to lack validity where the 3L held it. I can’t think of any. And anyway, NICE is apparently satisfied with the descriptive system; it’s the value set they’re worried about.

We don’t know if the preference elicitation methods are valid for states worse than dead

This argument is made by Brazier et al. The value set for England uses lead time TTO, which is a relatively new (and therefore less-tested) method. The problem is that we don’t know if any methods for valuing states worse than dead are valid because valuing states worse than dead makes no real sense. Save for pulling out a Ouija board, or perhaps holding a gun to someone’s head, we can never find out what is the most valid approach to valuing states worse than dead. And anyway, this argument fails on the same basis as the previous one: where is the evidence to suggest that the MVH approach to valuing states worse than dead (for the EQ-5D-3L) holds more validity than lead time TTO?

We don’t know if the EQ-VT was valid

As discussed by Brazier et al, it looks like there may have been some problems in the administration of the EuroQol valuation protocol (the EQ-VT) for the EQ-5D-5L value set. As a result, some of the data look a bit questionable, including large spikes in the distribution of values at 1.0, 0.5, 0.0, and -1.0. Certainly, this justifies further investigation. But it shouldn’t stall adoption of the 5L value set unless this constitutes a greater concern than the distributional characteristics of the 3L, and that’s not an argument I see anybody making. Perhaps there should have been more piloting of the EQ-VT, but that should (in itself) have no bearing on the decision of whether to use the 3L value set or the 5L value set. If the question is whether we expect the EQ-VT protocol to provide a more accurate estimation of health preferences than the MVH protocol – and it should be – then as far as I can tell there is no real basis for preferring the MVH protocol.

We don’t know if the value set (for England) is valid

Devlin et al state that, with respect to whether differences in the value sets represent improvements, “Until the external validation of the England 5L value set concludes, the jury is still out.” I’m not sure that’s true. I don’t know what the external validation is going to involve, but it’s hard to imagine a punctual piece of work that could demonstrate the ‘betterness’ of the 5L value set compared with the 3L value set. Yes, a validation exercise could tell us whether the value set is replicable. But unless validation of the comparator (i.e. the 3L value set) is also attempted and judged on the same basis, it won’t be at all informative to NICE’s decision. Devlin et al state that there is a governmental requirement to validate the 5L value set for England. But beyond checking the researchers’ sums, it’s difficult to understand what that could even mean. Given that nobody seems to have defined ‘validity’ in this context, this is a very dodgy basis for determining adoption or non-adoption of the 5L.

5L-based evaluations will be different to 3L-based evaluations

Well, yes. Otherwise, what would be the point? Brazier et al present this as a justification for a ‘pause’ for an independent review of the 5L value set. The authors present the potential shift in priority from life-improving treatments to life-extending treatments as a key reason for a pause. But this is clearly a circular argument. Pausing to look at the differences will only bring those (and perhaps new) differences into view (though notably at a slower rate than if the 5L was more widely adopted). And then what? We pause for longer? Round also mentions this point as a justification for further research. This highlights a misunderstanding of what it means for NICE to be consistent. NICE has no responsibility to make decisions in 2018 precisely as it would have in 2008. That would be foolish and ignorant of methodological and contextual developments. What NICE needs to provide is consistency in the present – precisely what is precluded by the current semi-adoption of the EQ-5D-5L.

5L data won’t be comparable to 3L data

Round mentions this. But why does it matter? This is nothing compared to the trickery that goes on in economic modelling. The whole point of modelling is to do the best we can with the data we’ve got. If we have to compare an intervention for which outcomes are measured in 3L values with an intervention for which outcomes are measured in 5L values, then so be it. That is not a problem. It is only a problem if manufacturers strategically use 3L or 5L values according to whichever provides the best results. And you know what facilitates that? A pause, where nobody really knows what is going on and NICE has essentially said that the use of both 3L and 5L descriptive systems is acceptable. If you think mapping from 5L to 3L values is preferable to consistently using the 5L values then, well, I can’t reason with you, because mapping is never anything but a fudge (albeit a useful one).

There are problems with the 3L, so we shouldn’t adopt the 5L

There’s little to say on this point beyond asserting that we mustn’t let perfect be the enemy of the good. Show me what else you’ve got that could be more readily and justifiably introduced to replace the 3L. Round suggests that shifting from the 3L to the 5L is no different to shifting from the 3L to an entirely different measure, such as the SF-6D. That’s wrong. There’s a good reason that NICE should consider the 5L as the natural successor to the 3L. And that’s because it is. This is exactly what it was designed to be: a methodological improvement on the same conceptual footing. The key point here is that the 3L and 5L contain the same domains. They’re trying to capture health-related quality of life in a consistent way; they refer to the same evaluative space. Shifting to the SF-6D (for example) would be a conceptual shift, whereas shifting to the 5L from the 3L is nothing but a methodological shift (with the added benefit of more up-to-date preference data).

To sum up

Round suggests that the pause is because of “an unexpected set of results” arising from the valuation exercise. That may be true in part. But I think it’s more likely the fault of dodgy public sector deals with the likes of Richard Branson and a consequently algorithm-fearing government. I totally agree with Round that, if NICE is considering a new outcome measure, they shouldn’t just be considering the 5L. But given that right now they are only considering the 5L, and that the decision is explicitly whether or not to adopt the 5L, there are no reasons not to do so.

The new value set is only a step change because we spent the last 25 years idling. Should we really just wait for NICE to assess the value set, accept it, and then return to our see-no-evil position for the next 25 years? No! The value set should be continually reviewed and redeveloped as methods improve and societal preferences evolve. The best available value set for England (and Wales) should be regularly considered by NICE as part of a review of the reference case. A special ‘pause’ for the new 5L value set will only serve to reinforce the longevity of compromised value sets in the future.

Yes, the EQ-5D-3L and its associated value set for the UK has been brilliantly useful over the years, but it now has a successor that – as far as we can tell – is better in many ways and at least as good in the rest. As a public body, NICE is conservative by nature. But researchers needn’t be.

Credits

On the commensurability of efficiency

In this week’s round-up, I highlighted a recent paper in the journal Cambridge Quarterly of Healthcare Ethics. There are some interesting ideas presented regarding the challenge of decision-making at the individual patient level, and in particular a supposed trade-off between achieving efficiency and satisfying health need.

The gist of the argument is that these two ‘values’ are incommensurable in the sense that the comparative value of two choices is ambiguous where the achievement of efficiency and need satisfaction needs to be traded. In the journal round-up, I highlighted 2 criticisms. First, I suggested that efficiency and health need satisfaction are commensurable. Second, I suggested that the paper did not adequately tackle the special nature of microlevel decision-making. The author – Anders Herlitz – was gracious enough to respond to my comments with several tweets.

Here, I’d like to put forth my reasoning on the subject (albeit with an ignorance of the background literature on incommensurability and other matters of ethics).

Consider a machine gun

A machine gun is far more efficient than a pistol, right? Well, maybe. A machine gun can shoot more bullets than a pistol over a sustained period. Likewise, a doctor who can treat 50 patients per day is more efficient than a doctor who can treat 20 patients per day.

However, the premise of this entire discussion, as established by Herlitz, is values. Herlitz introduces efficiency as a value and not as some dispassionate indicator of return on input. When we are considering values – as we necessarily are when we are discussing decision-making and more generally ‘what matters’ – we cannot take the ‘more bullets’ approach to assessing efficiency.

That’s because ‘more bullets’ is not what we mean when we talk about the value of efficiency. The production function is fundamental to our understanding of efficiency as a value. Once values are introduced, it is plain to see that in the context of war (where value is attached to a greater number of deaths) a machine gun may very well be considered more efficient. However, bearing a machine gun is far less efficient than bearing a pistol in a civilian context because we value a situation that results in fewer deaths.

In this analogy, bullets are health care and deaths are (somewhat confusingly, I admit) health improvement. Treating more people is not better because we want to provide more health care, but because we want to improve people’s health (along with some other basket of values).

Efficiency only has value with respect to the outcome in whose terms it is defined, and is therefore always commensurable with that outcome. That is, the production function is an inherent and necessary component of an efficiency to which we attach value.

I believe that Herlitz’s idea of incommensurability could be a useful one. Different outcomes may well be incommensurable in the way described in the paper. But efficiency has no place in this discussion. The incommensurability Herlitz describes in his paper seems to be a simple conflict between utilitarianism and prioritarianism, though I don’t have the wherewithal to pursue that argument so I’ll leave it there!

Microlevel efficiency trade-offs

Having said all that, I do think there could be a special decision-making challenge regarding efficiency at the microlevel. And that might partly explain Herlitz’s suggestion that efficiency is incommensurable with other outcomes.

There could be an incommensurability between values that can be measured in their achievement at the individual level (e.g. health improvement) and values that aren’t measured with individual-level outcomes (e.g. prioritisation of more severe patients). Those two outcomes are incommensurable in the way Herlitz described, but the simple fact that we tend to think about the former as an efficiency argument and the latter as an equity argument is irrelevant. We could think about both in efficiency terms (for example, treating n patients of severity x is more efficient than treating n-1 patients of severity x, or n patients of severity x-1), we just don’t. The difficulty is that this equity argument is meaningless at the individual level because it relies on information about outcomes outside the microlevel. The real challenge at the microlevel, therefore, is to acknowledge scope for efficiency in all outcomes of value. The incommensurability that matters is between microlevel and higher-level assessments of value.

As an aside, I was surprised that the Rule of Rescue did not get a mention in the paper. This is a perfect example of a situation in which arguments that tend to be made on efficiency grounds are thrown out and another value (the duty to save an immediately endangered life) takes over. One doesn’t need to think very hard about how Rule of Rescue decision-making could be framed as efficient.

In short, efficiency is never incommensurable because it is never an end in itself. If you’re concerned with being more efficient for the sake of being more efficient then you are probably not making very efficient decisions.

Credit