Chris Sampson’s journal round-up for 2nd April 2018

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Quality-adjusted life-years without constant proportionality. Value in Health Published 27th March 2018

The assumption of constant proportional trade-offs (CPTO) is at the heart of everything we do with QALYs. It assumes that duration has no impact on the value of a given health state, and so the value of a health state is constant regardless of its duration. This assumption has been repeatedly demonstrated to fail. This study looks for a non-constant alternative, which hasn’t been done before. The authors consider a quality-adjusted lifespan and four functional forms for the relationship between time and the value of life: constant, discount, logarithmic, and power. These relationships were tested in an online survey with more than 5,000 people, which involved the completion of 30-40 time trade-off pairs based on the EQ-5D-5L. Respondents traded off health states of varying severities and durations. Initially, a saturated model (making no assumptions about functional form) was estimated. This demonstrated that the marginal value of lifespan is decreasing. The authors provide a set of values attached to different health states at different durations. Then, the econometric model is adjusted to suit a power model, with the power estimated for duration expressed in days, weeks, months, or years. The power value for time is 0.415, but different expressions of time could introduce bias; time expressed in days (power=0.403) loses value faster than time expressed in years (power=0.654). There are also some anomalies that arise from the data that don’t fit the power function. For example, a single day of moderate problems can be worse than death, whereas 7 days or more is not. Using ‘power QALYs’ could be the future. But the big remaining question is whether decisionmakers ought to respond to people’s time preferences in this way.

A systematic review of studies comparing the measurement properties of the three-level and five-level versions of the EQ-5D. PharmacoEconomics [PubMed] Published 23rd March 2018

The debate about the EQ-5D-5L continues (on Twitter, at least). Conveniently, this paper addresses a concern held by some people – that we don’t understand the implications of using the 5L descriptive system. The authors systematically review papers comparing the measurement properties of the 3L and 5L, written in English or German. The review ended up including 24 studies. The measurement properties that were considered by the authors were: i) distributional properties, ii) informativity, iii) inconsistencies, iv) responsiveness, and v) test-retest reliability. The last property involves consideration of index values. Each study was also quality-assessed, with all being considered of good to excellent quality. The studies covered numerous countries and different respondent groups, with sample sizes from the tens to the thousands. For most measurement properties, the findings for the 3L and 5L were very similar. Floor effects were generally below 5% and tended to be slightly reduced for the 5L. In some cases, the 5L was associated with major reductions in the proportion of people responding as 11111 – a well-recognised ceiling effect associated with the 3L. Just over half of the studies reported on informativity using Shannon’s H’ and Shannon’s J’. The 5L provided consistently better results. Only three studies looked at responsiveness, with two slightly favouring the 5L and one favouring the 3L. The latter could be explained by the use of the 3L-5L crosswalk, which is inherently less responsive because it is a crosswalk. The overarching message is consistency. Business as usual. This is important because it means that the 3L and 5L descriptive systems provide comparable results (which is the basis for the argument I recently made that they are measuring the same thing). In some respects, this could be disappointing for 5L proponents because it suggests that the 5L descriptive system is not a lot better than the 3L. But it is a little better. This study demonstrates that there are still uncertainties about the differences between 3L and 5L assessments of health-related quality of life. More comparative studies, of the kind included in this review, should be conducted so that we can better understand the differences in results that are likely to arise now that we have moved (relatively assuredly) towards using the 5L instead of the 3L.

Preference-based measures to obtain health state utility values for use in economic evaluations with child-based populations: a review and UK-based focus group assessment of patient and parent choices. Quality of Life Research [PubMed] Published 21st March 2018

Calculating QALYs for kids continues to be a challenge. One of the challenges is the choice of which preference-based measure to use. Part of the problem here is that the EuroQol group – on which we rely for measuring adult health preferences – has been a bit slow. There’s the EQ-5D-Y, which has been around for a while, but it wasn’t developed with any serious thought about what kids value and there still isn’t a value set for the UK. So, if we use anything, we use a variety of measures. In this study, the authors review the use of generic preference-based measures. 45 papers are identified, including 5 different measures: HUI2, HUI3, CHU-9D, EQ-5D-Y, and AQOL-6D. No prizes for guessing that the EQ-5D (adult version) was the most commonly used measure for child-based populations. Unfortunately, the review is a bit of a disappointment. And I’m not just saying that because at least one study on which I’ve worked isn’t cited. The search strategy is likely to miss many (perhaps most) trial-based economic evaluations with children, for which cost-utility analyses don’t usually get a lot of airtime. It’s hard to see how a review of this kind is useful if it isn’t comprehensive. But the goal of the paper isn’t just to summarise the use of measures to date. The focus is on understanding when researchers should use self- or proxy-response, and when a parent-child dyad might be most useful. The literature review can’t do much to guide that question except to assert that the identified studies tended to use parent–proxy respondents. But the study also reports on some focus groups, which are potentially more useful. These were conducted as part of a wider study relating to the design of an RCT. In five focus groups, participants were presented with the EQ-5D-Y and the CHU-9D. It isn’t clear why these two measures were selected. The focus groups included parents and some children over the age of 11. Unfortunately, there’s no real (qualitative) analysis conducted, so the findings are limited. Parents expressed concern about a lack of sensitivity. Naturally, they thought that they knew best and should be the respondents. Of the young people reviewing the measures themselves, the EQ-5D-Y was perceived as more straightforward in referring to tangible experiences, whereas the CHU-9D’s severity levels were seen as more representative. Older adolescents tended to prefer the CHU-9D. The youths weren’t so sure of themselves as the adults and, though they expressed concern about their parents not understanding how they feel, they were generally neutral to who ought to respond. The older kids wanted to speak for themselves. The paper provides a good overview of the different measures, which could be useful for researchers planning data collection for child health utility measurement. But due to the limitations of the review and the lack of analysis of the focus groups, the paper isn’t able to provide any real guidance.

Credits

 

Bad reasons not to use the EQ-5D-5L

We’ve seen a few editorials and commentaries popping up about the EQ-5D-5L recently, in Health Economics, PharmacoEconomics, and PharmacoEconomics again. All of these articles have – to varying extents – acknowledged the need for NICE to exercise caution in the adoption of the EQ-5D-5L. I don’t get it. I see no good reason not to use the EQ-5D-5L.

If you’re not familiar with the story of the EQ-5D-5L in England, read any of the linked articles, or see an OHE blog post summarising the tale. The important part of the story is that NICE has effectively recommended the use of the EQ-5D-5L descriptive system (the questionnaire), but not the new EQ-5D-5L value set for England. Of the new editorials and commentaries, Devlin et al are vaguely pro-5L, Round is vaguely anti-5L, and Brazier et al are vaguely on the fence. NICE has manoeuvred itself into a situation where it has to make a binary decision. 5L, or no 5L (which means sticking with the old EQ-5D-3L value set). Yet nobody seems keen to lay down their view on what NICE ought to decide. Maybe there’s a fear of being proven wrong.

So, herewith a list of reasons for exercising caution in the adoption of the EQ-5D-5L, which are either explicitly or implicitly cited by recent commentators, and why they shouldn’t determine NICE’s decision. The EQ-5D-5L value set for England should be recommended without hesitation.

We don’t know if the descriptive system is valid

Round argues that while the 3L has been validated in many populations, the 5L has not. Diabetes, dementia, deafness and depression are presented as cases where the 3L has been validated but the 5L has not. But the same goes for the reverse. There are plenty of situations in which the 3L has been shown to be problematic and the 5L has not. It’s simply a matter of time. This argument should only hold sway if we expect there to be more situations in which the 5L lacks validity, or if those violations are in some way more serious. I see no evidence of that. In fact, we see measurement properties improved with the 5L compared with the 3L. Devlin et al put the argument to bed in highlighting the growing body of evidence demonstrating that the 5L descriptive system is better than the 3L descriptive system in a variety of ways, without any real evidence that there are downsides to the descriptive expansion. And this – the comparison of the 3L and the 5L – is the correct comparison to be making, because the use of the 3L represents current practice. More fundamentally, it’s hard to imagine how the 5L descriptive system could be less valid than the 3L descriptive system. That there are only a limited number of validation studies using the 5L is only a problem if we can hypothesise reasons for the 5L to lack validity where the 3L held it. I can’t think of any. And anyway, NICE is apparently satisfied with the descriptive system; it’s the value set they’re worried about.

We don’t know if the preference elicitation methods are valid for states worse than dead

This argument is made by Brazier et al. The value set for England uses lead time TTO, which is a relatively new (and therefore less-tested) method. The problem is that we don’t know if any methods for valuing states worse than dead are valid because valuing states worse than dead makes no real sense. Save for pulling out a Ouija board, or perhaps holding a gun to someone’s head, we can never find out what is the most valid approach to valuing states worse than dead. And anyway, this argument fails on the same basis as the previous one: where is the evidence to suggest that the MVH approach to valuing states worse than dead (for the EQ-5D-3L) holds more validity than lead time TTO?

We don’t know if the EQ-VT was valid

As discussed by Brazier et al, it looks like there may have been some problems in the administration of the EuroQol valuation protocol (the EQ-VT) for the EQ-5D-5L value set. As a result, some of the data look a bit questionable, including large spikes in the distribution of values at 1.0, 0.5, 0.0, and -1.0. Certainly, this justifies further investigation. But it shouldn’t stall adoption of the 5L value set unless this constitutes a greater concern than the distributional characteristics of the 3L, and that’s not an argument I see anybody making. Perhaps there should have been more piloting of the EQ-VT, but that should (in itself) have no bearing on the decision of whether to use the 3L value set or the 5L value set. If the question is whether we expect the EQ-VT protocol to provide a more accurate estimation of health preferences than the MVH protocol – and it should be – then as far as I can tell there is no real basis for preferring the MVH protocol.

We don’t know if the value set (for England) is valid

Devlin et al state that, with respect to whether differences in the value sets represent improvements, “Until the external validation of the England 5L value set concludes, the jury is still out.” I’m not sure that’s true. I don’t know what the external validation is going to involve, but it’s hard to imagine a punctual piece of work that could demonstrate the ‘betterness’ of the 5L value set compared with the 3L value set. Yes, a validation exercise could tell us whether the value set is replicable. But unless validation of the comparator (i.e. the 3L value set) is also attempted and judged on the same basis, it won’t be at all informative to NICE’s decision. Devlin et al state that there is a governmental requirement to validate the 5L value set for England. But beyond checking the researchers’ sums, it’s difficult to understand what that could even mean. Given that nobody seems to have defined ‘validity’ in this context, this is a very dodgy basis for determining adoption or non-adoption of the 5L.

5L-based evaluations will be different to 3L-based evaluations

Well, yes. Otherwise, what would be the point? Brazier et al present this as a justification for a ‘pause’ for an independent review of the 5L value set. The authors present the potential shift in priority from life-improving treatments to life-extending treatments as a key reason for a pause. But this is clearly a circular argument. Pausing to look at the differences will only bring those (and perhaps new) differences into view (though notably at a slower rate than if the 5L was more widely adopted). And then what? We pause for longer? Round also mentions this point as a justification for further research. This highlights a misunderstanding of what it means for NICE to be consistent. NICE has no responsibility to make decisions in 2018 precisely as it would have in 2008. That would be foolish and ignorant of methodological and contextual developments. What NICE needs to provide is consistency in the present – precisely what is precluded by the current semi-adoption of the EQ-5D-5L.

5L data won’t be comparable to 3L data

Round mentions this. But why does it matter? This is nothing compared to the trickery that goes on in economic modelling. The whole point of modelling is to do the best we can with the data we’ve got. If we have to compare an intervention for which outcomes are measured in 3L values with an intervention for which outcomes are measured in 5L values, then so be it. That is not a problem. It is only a problem if manufacturers strategically use 3L or 5L values according to whichever provides the best results. And you know what facilitates that? A pause, where nobody really knows what is going on and NICE has essentially said that the use of both 3L and 5L descriptive systems is acceptable. If you think mapping from 5L to 3L values is preferable to consistently using the 5L values then, well, I can’t reason with you, because mapping is never anything but a fudge (albeit a useful one).

There are problems with the 3L, so we shouldn’t adopt the 5L

There’s little to say on this point beyond asserting that we mustn’t let perfect be the enemy of the good. Show me what else you’ve got that could be more readily and justifiably introduced to replace the 3L. Round suggests that shifting from the 3L to the 5L is no different to shifting from the 3L to an entirely different measure, such as the SF-6D. That’s wrong. There’s a good reason that NICE should consider the 5L as the natural successor to the 3L. And that’s because it is. This is exactly what it was designed to be: a methodological improvement on the same conceptual footing. The key point here is that the 3L and 5L contain the same domains. They’re trying to capture health-related quality of life in a consistent way; they refer to the same evaluative space. Shifting to the SF-6D (for example) would be a conceptual shift, whereas shifting to the 5L from the 3L is nothing but a methodological shift (with the added benefit of more up-to-date preference data).

To sum up

Round suggests that the pause is because of “an unexpected set of results” arising from the valuation exercise. That may be true in part. But I think it’s more likely the fault of dodgy public sector deals with the likes of Richard Branson and a consequently algorithm-fearing government. I totally agree with Round that, if NICE is considering a new outcome measure, they shouldn’t just be considering the 5L. But given that right now they are only considering the 5L, and that the decision is explicitly whether or not to adopt the 5L, there are no reasons not to do so.

The new value set is only a step change because we spent the last 25 years idling. Should we really just wait for NICE to assess the value set, accept it, and then return to our see-no-evil position for the next 25 years? No! The value set should be continually reviewed and redeveloped as methods improve and societal preferences evolve. The best available value set for England (and Wales) should be regularly considered by NICE as part of a review of the reference case. A special ‘pause’ for the new 5L value set will only serve to reinforce the longevity of compromised value sets in the future.

Yes, the EQ-5D-3L and its associated value set for the UK has been brilliantly useful over the years, but it now has a successor that – as far as we can tell – is better in many ways and at least as good in the rest. As a public body, NICE is conservative by nature. But researchers needn’t be.

Credits

Chris Sampson’s journal round-up for 18th December 2017

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Individualized glycemic control for U.S. adults with type 2 diabetes: a cost-effectiveness analysis. Annals of Internal Medicine [PubMed] Published 12th December 2017

The nature of diabetes – that it affects a lot of people and is associated with a wide array of physiological characteristics and health impacts – has given rise to recommendations for individualisation of care. This paper evaluates individualisation of glycemic control targets. Specifically, the individualised programme allocated people to one of 3 HbA1c targets (<6.5%, <7%, <8%) according to their characteristics, while the comparator was based on a single fixed target (<7%). The researchers used a patient-level simulation model. Risk equations developed by the UKPDS study were used to predict diabetes complications and mortality. The baseline population was derived from the NHANES study for 2011-12 and constitutes people who self-reported as having diabetes and who were at least 30 years old at diagnosis (to try and isolate type 2 diabetes). It’s not much of a surprise that the individualised approach dominated uniform intensive control, saving $13,547 on average per patient with a slight improvement in QALY outcomes. But the findings are not all in favour of individualisation. Quality of life improvements due to the benefits of medication were partially counteracted by a slight decrease in life years gained due to a higher rate of (mortality-increasing) complications. The absolute lifetime risk of myocardial infarction was 1.39% higher with individualisation. A key outstanding question is how much the individualisation process would actually cost to get right. Granted, it probably wouldn’t cost as much as the savings estimated in this study, but the difficulty of ensuring adequate data quality to consistently inform individualisation should not be underestimated.

Microlevel prioritizations and incommensurability. Cambridge Quarterly of Healthcare Ethics [PubMed] Published 7th December 2017

This article concerns the ethical challenges of decision-making at the microlevel. For example, decisions may need to be made about allocating resources between 2 or more identifiable patients, perhaps within a particular clinic or amongst an individual clinician’s patients. The author asserts two relevant values: health need satisfaction and efficiency. Health need satisfaction is defined in terms of severity (regardless of capacity to benefit from available treatments), while efficiency is defined in terms of the maximisation of health benefit (subject to the effectiveness of treatment). The author then argues that these two values are incommensurable in the sense that we can have situations in which health need satisfaction is greater (or less) for a given choice over another, while efficiency could be lower (or higher). Thus, it is not always possible to rank choices given two non-cardinally-comparable values. It might not be clear whether it is better to treat patient A or patient B if the implications of doing so are different in terms of need and efficiency. The author then goes on to suggest some solutions to this apparent problem, starting by highlighting the need for decision makers (in this case clinicians) to recognise different decision paths. The first solution is to generate some guidelines that offer complete ordering of possible choices. These might be based on a process of weighting the different values (e.g. health need satisfaction and efficiency). The other ‘solution’ is to leave the decision to medical practitioners, who can create reasons for choices that may be unique to the case at hand. In this case, certain decision paths should be avoided, such as those that would entail discrimination. I have a lot of problems with this assessment of decision-making at the individual level. Mainly, the discussion is undermined by the fact that efficiency and health need satisfaction are entirely commensurable insofar as we care about either of them in relation to prioritisation in health care. We tend to understand both health need satisfaction and opportunity cost (the basis for estimating efficiency) in terms of health outcomes. The essay also fails to clearly identify the uniqueness of the challenge of microlevel decision-making as distinct from the process of creating clinical guidelines. This may call for a follow-up blog post…

EQ-5D: moving from three levels to five. Value in Health Published 6th December 2017

If you work on economic evaluation, the move from using the EQ-5D-3L to the EQ-5D-5L – in terms of the impact on our results – is one of the biggest methodological step changes in recent memory. We all know that the 5L (and associated value set for England) is better than the 3L. Don’t we? So it is perhaps a bit disappointing that the step to the 5L has been so tentative. This editorial articulates the challenge. NICE makes standards. EuroQoL does research. NICE was (relatively) satisfied with the 3L. EuroQoL wasn’t. We have a clash between an inherently (perhaps necessarily) conservative institution and an inherently progressive institution. Hopefully, their interaction will put us on a sustainable path that achieves both methodological consistency and scientific rigour. This editorial also provides us with a DOI-citable account of the saga that includes the development of the 5L value set for England and NICE’s subsequent memorandum.

Current UK practices on health economics analysis plans (HEAPs): are we using heaps of them? PharmacoEconomics [PubMed] Published 6th December 2017

You could get by for years in economic evaluation without even hearing about ‘health economics analysis plans’ (HEAPs). It probably depends on the policies set by the clinical trials unit (CTU) that you’re working with. The idea is that HEAPs are an equivalent standard operating procedure (SOP) to a statistical analysis plan – setting out how the trial data will be analysed before the analysis begins. This could aid transparency and consistency, and prevent dodgy practices. In this study, the researchers sought to find out whether HEAPs are actually being used, and their perceived role in clinical trials research. A survey targeted 46 UK CTUs, asking about the role of health economists in the unit and whether they used HEAP SOPs. Of 28 respondents, 11 reported having an embedded health economics team. A third of CTUs reported always having a HEAP in place. Most said they only used HEAPs ‘sometimes’, and publicly funded trials were said to be more likely to use a HEAP. The majority of respondents agreed it was acceptable to produce the HEAP at any point prior to a lockdown of the data. The findings demonstrate inconsistency in who writes HEAPs and who is perceived to be the audience. I agree with the premise that we need HEAPs. Though I’m not sure what they should look like, except that statistical analysis plans probably should not be used as a template. It would be good if some of these researchers took things a step further and figured out what ought to go into a HEAP, so that we can consistently employ their recommendations. If you’re on the HEALTHECON-ALL mailing list, you’ll know that they’re already on the case.

Credits