Chris Sampson’s journal round-up for 23rd July 2018

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Quantifying life: understanding the history of quality-adjusted life-years (QALYs). Social Science & Medicine [PubMed] Published 3rd July 2018

We’ve had some fun talking about the history of the QALY here on this blog. The story of how the QALY came to be important in health policy has been obscured. This paper seeks to address that. The research adopts a method called ‘multiple streams analysis’ (MSA) in order to explain how QALYs caught on. The MSA framework identifies three streams – policy, politics, and problems – and considers the ‘policy entrepreneurs’ involved. For this study, archival material was collected from the National Archives, Department of Health files, and the University of York. The researchers also conducted 44 semi-structured interviews with academics and civil servants.

The problem stream highlights shocks to the UK economy in the late 1960s, coupled with growth in health care costs due to innovations and changing expectations. Cost-effectiveness began to be studied and, increasingly, policymaking was meant to be research-based and accountable. By the 80s, the likes of Williams and Maynard were drawing attention to apparent inequities and inefficiencies in the health service. The policy stream gets going in the 40s and 50s when health researchers started measuring quality of life. By the early 60s, the idea of standardising these measures to try and rank health states was on the table. Through the late 60s and early 70s, government economists proliferated and proved themselves useful in health policy. The meeting of Rachel Rosser and Alan Williams in the mid-70s led to the creation of QALYs as we know them, combining quantity and quality of life on a 0-1 scale. Having acknowledged inefficiencies and inequities in the health service, UK politicians and medics were open to new ideas, but remained unconvinced by the QALY. Yet it was a willingness to consider the need for rationing that put the wheels in motion for NICE, and the politics stream – like the problem and policy stream – characterises favourable conditions for the use of the QALY.

The MSA framework also considers ‘policy entrepreneurs’ who broker the transition from idea to implementation. The authors focus on the role of Alan Williams and of the Economic Advisers’ Office. Williams was key in translating economic ideas into forms that policymakers could understand. Meanwhile, the Economic Advisers’ Office encouraged government economists to engage with academics at HESG and later the QoL Measurement Group (which led to the creation of EuroQol).

The main takeaway from the paper is that good ideas only prevail in the right conditions and with the right people. It’s important to maintain multi-disciplinary and multi-stakeholder networks. In the case of the QALY, the two-way movement of economists between government and academia was crucial.

I don’t completely understand or appreciate the MSA framework, but this paper is an enjoyable read. My only reservation is with the way the authors describe the QALY as being a dominant aspect of health policy in the UK. I don’t think that’s right. It’s dominant within a niche of a niche of a niche – that is, health technology assessment for new pharmaceuticals. An alternative view is that the QALY has in fact languished in a quiet corner of British policymaking, and been completely excluded in some other countries.

Accuracy of patient recall for self‐reported doctor visits: is shorter recall better? Health Economics [PubMed] Published 2nd July 2018

In designing observational studies, such as clinical trials, I have always recommended that self-reported resource use be collected no less frequently than every 3 months. This is partly based on something I once read somewhere that I can’t remember, but partly also on some logic that the accuracy of people’s recall decays over time. This paper has come to tell me how wrong I’ve been.

The authors start by highlighting that recall can be subject to omission, whereby respondents forget relevant information, or commission, whereby respondents include events that did not occur. A key manifestation of the latter is ‘telescoping’, whereby events are included from outside the recall period. We might expect commission to be more likely in short recalls and omission to be more common for long recalls. But there’s very little research on this regarding health service use.

This study uses data from a large trial in diabetes care in Australia, in which 5,305 participants were randomised to receive either 2-week, 3-month, or 12-month recall for how many times they had seen a doctor. Then, the trial data were matched with Medicare data to identify the true levels of resource use.

Over 92% of 12-month recall participants made an error, 76% of the 3-month recall, and 46% of the 2-week recall. The patterns of errors were different. There was very little under-reporting in the 2-week recall sample, with 3-month giving the most over-reporting and 12-month giving the most under-reporting. 12-month recall was associated with the largest number of days reported in error. However, when the authors account for the longer period being considered, and estimate a relative error, the impact of misreporting is smallest for the 12-month recall and greatest for the 2-week recall. This translates into a smaller overall bias for the longest recall period. The authors also find that older, less educated, unemployed, and low‐income patients exhibit higher measurement errors.

Health surveys and comparative studies that estimate resource use over a long period of time should use 12-month recall unless they can find a reason to do otherwise. The authors provide some examples from economic evaluations to demonstrate how selecting shorter recall periods could result in recommending the wrong decisions. It’s worth trying to understand the reasons why people can more accurately recall service use over 12 months. That way, data collection methods could be designed to optimise recall accuracy.

Who should receive treatment? An empirical enquiry into the relationship between societal views and preferences concerning healthcare priority setting. PLoS One [PubMed] Published 27th June 2018

Part of the reason the QALY faces opposition is that it has been used in a way that might not reflect societal preferences for resource allocation. In particular, the idea that ‘a QALY is a QALY is a QALY’ may conflict with notions of desert, severity, or process. We’re starting to see more evidence for groups of people holding different views, which makes it difficult to come up with decision rules to maximise welfare. This study considers some of the perspectives that people adopt, which have been identified in previous research – ‘equal right to healthcare’, ‘limits to healthcare’, and ‘effective and efficient healthcare’ – and looks at how they are distributed in the Netherlands. Using four willingness to trade-off (WTT) exercises, the authors explore the relationship between these views and people’s preferences about resource allocation. Trade-offs are between quality vs quantity of life, health maximisation vs equality, children vs the elderly, and lifestyle-related risk vs adversity. The authors sought to test several hypotheses: i) that ‘equal right’ respondents have a lower WTT; ii) ‘limits to healthcare’ people express a preference for health gains, health maximisation, and treating people with adversity; and iii) ‘effective and efficient’ people support health maximisation, treating children, and treating people with adversity.

A representative online sample of adults in the Netherlands (n=261) was recruited. The first part of the questionnaire collected socio-demographic information. The second part asked questions necessary to allocate people to one of the three perspectives using Likert scales based on a previous study. The third part of the questionnaire consisted of the four reimbursement scenarios. Participants were asked to identify the point (in terms of the relevant quantities) at which they would be indifferent between two options.

The distribution of the viewpoints was 65% ‘equal right’, 23% ‘limits to healthcare’, and 7% ‘effective and efficient’. 6% couldn’t be matched to one of the three viewpoints. In each scenario, people had the option to opt out of trading. 24% of respondents were non-traders for all scenarios and, of these, 78% were of the ‘equal right’ viewpoint. Unfortunately, a lot of people opted out of at least one of the trades, and for a wide variety of reasons. Decisionmakers can’t opt out, so I’m not sure how useful this is.

The authors describe many associations between individual characteristics, viewpoints, and WTT results. But the tested hypotheses were broadly supported. While the findings showed that different groups were more or less willing to trade, the points of indifference for traders within the groups did not vary. So while you can’t please everyone in health care priority setting, this study shows how policies might be designed to satisfy the preferences of people with different perspectives.

Credits

James Lomas’s journal round-up for 21st May 2018

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Decision making for healthcare resource allocation: joint v. separate decisions on interacting interventions. Medical Decision Making [PubMed] Published 23rd April 2018

While it may be uncontroversial that including all of the relevant comparators in an economic evaluation is crucial, a careful examination of this statement raises some interesting questions. Which comparators are relevant? For those that are relevant, how crucial is it that they are not excluded? The answer to the first of these questions may seem obvious, that all feasible mutually exclusive interventions should be compared, but this is in fact deceptive. Dakin and Gray highlight inconsistency between guidelines as to what constitutes interventions that are ‘mutually exclusive’ and so try to re-frame the distinction according to whether interventions are ‘incompatible’ – when it is physically impossible to implement both interventions simultaneously – and, if not, whether interventions are ‘interacting’ – where the costs and effects of the simultaneous implementation of A and B do not equal the sum of these parts. What I really like about this paper is that it has a very pragmatic focus. Inspired by policy arrangements, for example single technology appraisals, and the difficulty in capturing all interactions, Dakin and Gray provide a reader-friendly flow diagram to illustrate cases where excluding interacting interventions from a joint evaluation is likely to have a big impact, and furthermore propose a sequencing approach that avoids the major problems in evaluating separately what should be considered jointly. Essentially when we have interacting interventions at different points of the disease pathway, evaluating separately may not be problematic if we start at the end of the pathway and move backwards, similar to the method of backward induction used in sequence problems in game theory. There are additional related questions that I’d like to see these authors turn to next, such as how to include interaction effects between interventions and, in particular, how to evaluate system-wide policies that may interact with a very large number of interventions. This paper makes a great contribution to answering all of these questions by establishing a framework that clearly distinguishes concepts that had previously been subject to muddied thinking.

When cost-effective interventions are unaffordable: integrating cost-effectiveness and budget impact in priority setting for global health programs. PLoS Medicine [PubMed] Published 2nd October 2017

In my opinion, there are many things that health economists shouldn’t try to include when they conduct cost-effectiveness analysis. Affordability is not one of these. This paper is great, because Bilinski et al shine a light on the worldwide phenomenon of interventions being found to be ‘cost-effective’ but not affordable. A particular quote – that it would be financially impossible to implement all interventions that are found to be ‘very cost-effective’ in many low- and middle-income countries – is quite shocking. Bilinski et al compare and contrast cost-effectiveness analysis and budget impact analysis, and argue that there are four key reasons why something could be ‘cost-effective’ but not affordable: 1) judging cost-effectiveness with reference to an inappropriate cost-effectiveness ‘threshold’, 2) adoption of a societal perspective that includes costs not falling upon the payer’s budget, 3) failing to make explicit consideration of the distribution of costs over time and 4) the use of an inappropriate discount rate that may not accurately reflect the borrowing and investment opportunities facing the payer. They then argue that, because of this, cost-effectiveness analysis should be presented along with budget impact analysis so that the decision-maker can base a decision on both analyses. I don’t disagree with this as a pragmatic interim solution, but – by highlighting these four reasons for divergence of results with such important economic consequences – I think that there will be further reaching implications of this paper. To my mind, Bilinski et al essentially serves as a call to arms for researchers to try to come up with frameworks and estimates so that the conduct of cost-effectiveness analysis can be improved in order that paradoxical results are no longer produced, decisions are more usefully informed by cost-effectiveness analysis, and the opportunity costs of large budget impacts are properly evaluated – especially in the context of low- and middle-income countries where the foregone health from poor decisions can be so significant.

Patient cost-sharing, socioeconomic status, and children’s health care utilization. Journal of Health Economics [PubMed] Published 16th April 2018

This paper evaluates a policy using a combination of regression discontinuity design and difference-in-difference methods. Not only does it do that, but it tackles an important policy question using a detailed population-wide dataset (a set of linked datasets, more accurately). As if that weren’t enough, one of the policy reforms was actually implemented as a result of a vote where two politicians ‘accidentally pressed the wrong button’, reducing concerns that the policy may have in some way not been exogenous. Needless to say I found the method employed in this paper to be a pretty convincing identification strategy. The policy question at hand is about whether demand for GP visits for children in the Swedish county of Scania (Skåne) is affected by cost-sharing. Cost-sharing for GP visits has occurred for different age groups over different periods of time, providing the basis for regression discontinuities around the age threshold and treated and control groups over time. Nilsson and Paul find results suggesting that when health care is free of charge doctor visits by children increase by 5-10%. In this context, doctor visits happened subject to telephone triage by a nurse and so in this sense it can be argued that all of these visits would be ‘needed’. Further, Nilsson and Paul find that the sensitivity to price is concentrated in low-income households, and is greater among sickly children. The authors contextualise their results very well and, in addition to that context, I can’t deny that it also particularly resonated with me to read this approaching the 70th birthday of the NHS – a system where cost-sharing has never been implemented for GP visits by children. This paper is clearly also highly relevant to that debate that has surfaced again and again in the UK.

Credits

 

Bad reasons not to use the EQ-5D-5L

We’ve seen a few editorials and commentaries popping up about the EQ-5D-5L recently, in Health Economics, PharmacoEconomics, and PharmacoEconomics again. All of these articles have – to varying extents – acknowledged the need for NICE to exercise caution in the adoption of the EQ-5D-5L. I don’t get it. I see no good reason not to use the EQ-5D-5L.

If you’re not familiar with the story of the EQ-5D-5L in England, read any of the linked articles, or see an OHE blog post summarising the tale. The important part of the story is that NICE has effectively recommended the use of the EQ-5D-5L descriptive system (the questionnaire), but not the new EQ-5D-5L value set for England. Of the new editorials and commentaries, Devlin et al are vaguely pro-5L, Round is vaguely anti-5L, and Brazier et al are vaguely on the fence. NICE has manoeuvred itself into a situation where it has to make a binary decision. 5L, or no 5L (which means sticking with the old EQ-5D-3L value set). Yet nobody seems keen to lay down their view on what NICE ought to decide. Maybe there’s a fear of being proven wrong.

So, herewith a list of reasons for exercising caution in the adoption of the EQ-5D-5L, which are either explicitly or implicitly cited by recent commentators, and why they shouldn’t determine NICE’s decision. The EQ-5D-5L value set for England should be recommended without hesitation.

We don’t know if the descriptive system is valid

Round argues that while the 3L has been validated in many populations, the 5L has not. Diabetes, dementia, deafness and depression are presented as cases where the 3L has been validated but the 5L has not. But the same goes for the reverse. There are plenty of situations in which the 3L has been shown to be problematic and the 5L has not. It’s simply a matter of time. This argument should only hold sway if we expect there to be more situations in which the 5L lacks validity, or if those violations are in some way more serious. I see no evidence of that. In fact, we see measurement properties improved with the 5L compared with the 3L. Devlin et al put the argument to bed in highlighting the growing body of evidence demonstrating that the 5L descriptive system is better than the 3L descriptive system in a variety of ways, without any real evidence that there are downsides to the descriptive expansion. And this – the comparison of the 3L and the 5L – is the correct comparison to be making, because the use of the 3L represents current practice. More fundamentally, it’s hard to imagine how the 5L descriptive system could be less valid than the 3L descriptive system. That there are only a limited number of validation studies using the 5L is only a problem if we can hypothesise reasons for the 5L to lack validity where the 3L held it. I can’t think of any. And anyway, NICE is apparently satisfied with the descriptive system; it’s the value set they’re worried about.

We don’t know if the preference elicitation methods are valid for states worse than dead

This argument is made by Brazier et al. The value set for England uses lead time TTO, which is a relatively new (and therefore less-tested) method. The problem is that we don’t know if any methods for valuing states worse than dead are valid because valuing states worse than dead makes no real sense. Save for pulling out a Ouija board, or perhaps holding a gun to someone’s head, we can never find out what is the most valid approach to valuing states worse than dead. And anyway, this argument fails on the same basis as the previous one: where is the evidence to suggest that the MVH approach to valuing states worse than dead (for the EQ-5D-3L) holds more validity than lead time TTO?

We don’t know if the EQ-VT was valid

As discussed by Brazier et al, it looks like there may have been some problems in the administration of the EuroQol valuation protocol (the EQ-VT) for the EQ-5D-5L value set. As a result, some of the data look a bit questionable, including large spikes in the distribution of values at 1.0, 0.5, 0.0, and -1.0. Certainly, this justifies further investigation. But it shouldn’t stall adoption of the 5L value set unless this constitutes a greater concern than the distributional characteristics of the 3L, and that’s not an argument I see anybody making. Perhaps there should have been more piloting of the EQ-VT, but that should (in itself) have no bearing on the decision of whether to use the 3L value set or the 5L value set. If the question is whether we expect the EQ-VT protocol to provide a more accurate estimation of health preferences than the MVH protocol – and it should be – then as far as I can tell there is no real basis for preferring the MVH protocol.

We don’t know if the value set (for England) is valid

Devlin et al state that, with respect to whether differences in the value sets represent improvements, “Until the external validation of the England 5L value set concludes, the jury is still out.” I’m not sure that’s true. I don’t know what the external validation is going to involve, but it’s hard to imagine a punctual piece of work that could demonstrate the ‘betterness’ of the 5L value set compared with the 3L value set. Yes, a validation exercise could tell us whether the value set is replicable. But unless validation of the comparator (i.e. the 3L value set) is also attempted and judged on the same basis, it won’t be at all informative to NICE’s decision. Devlin et al state that there is a governmental requirement to validate the 5L value set for England. But beyond checking the researchers’ sums, it’s difficult to understand what that could even mean. Given that nobody seems to have defined ‘validity’ in this context, this is a very dodgy basis for determining adoption or non-adoption of the 5L.

5L-based evaluations will be different to 3L-based evaluations

Well, yes. Otherwise, what would be the point? Brazier et al present this as a justification for a ‘pause’ for an independent review of the 5L value set. The authors present the potential shift in priority from life-improving treatments to life-extending treatments as a key reason for a pause. But this is clearly a circular argument. Pausing to look at the differences will only bring those (and perhaps new) differences into view (though notably at a slower rate than if the 5L was more widely adopted). And then what? We pause for longer? Round also mentions this point as a justification for further research. This highlights a misunderstanding of what it means for NICE to be consistent. NICE has no responsibility to make decisions in 2018 precisely as it would have in 2008. That would be foolish and ignorant of methodological and contextual developments. What NICE needs to provide is consistency in the present – precisely what is precluded by the current semi-adoption of the EQ-5D-5L.

5L data won’t be comparable to 3L data

Round mentions this. But why does it matter? This is nothing compared to the trickery that goes on in economic modelling. The whole point of modelling is to do the best we can with the data we’ve got. If we have to compare an intervention for which outcomes are measured in 3L values with an intervention for which outcomes are measured in 5L values, then so be it. That is not a problem. It is only a problem if manufacturers strategically use 3L or 5L values according to whichever provides the best results. And you know what facilitates that? A pause, where nobody really knows what is going on and NICE has essentially said that the use of both 3L and 5L descriptive systems is acceptable. If you think mapping from 5L to 3L values is preferable to consistently using the 5L values then, well, I can’t reason with you, because mapping is never anything but a fudge (albeit a useful one).

There are problems with the 3L, so we shouldn’t adopt the 5L

There’s little to say on this point beyond asserting that we mustn’t let perfect be the enemy of the good. Show me what else you’ve got that could be more readily and justifiably introduced to replace the 3L. Round suggests that shifting from the 3L to the 5L is no different to shifting from the 3L to an entirely different measure, such as the SF-6D. That’s wrong. There’s a good reason that NICE should consider the 5L as the natural successor to the 3L. And that’s because it is. This is exactly what it was designed to be: a methodological improvement on the same conceptual footing. The key point here is that the 3L and 5L contain the same domains. They’re trying to capture health-related quality of life in a consistent way; they refer to the same evaluative space. Shifting to the SF-6D (for example) would be a conceptual shift, whereas shifting to the 5L from the 3L is nothing but a methodological shift (with the added benefit of more up-to-date preference data).

To sum up

Round suggests that the pause is because of “an unexpected set of results” arising from the valuation exercise. That may be true in part. But I think it’s more likely the fault of dodgy public sector deals with the likes of Richard Branson and a consequently algorithm-fearing government. I totally agree with Round that, if NICE is considering a new outcome measure, they shouldn’t just be considering the 5L. But given that right now they are only considering the 5L, and that the decision is explicitly whether or not to adopt the 5L, there are no reasons not to do so.

The new value set is only a step change because we spent the last 25 years idling. Should we really just wait for NICE to assess the value set, accept it, and then return to our see-no-evil position for the next 25 years? No! The value set should be continually reviewed and redeveloped as methods improve and societal preferences evolve. The best available value set for England (and Wales) should be regularly considered by NICE as part of a review of the reference case. A special ‘pause’ for the new 5L value set will only serve to reinforce the longevity of compromised value sets in the future.

Yes, the EQ-5D-3L and its associated value set for the UK has been brilliantly useful over the years, but it now has a successor that – as far as we can tell – is better in many ways and at least as good in the rest. As a public body, NICE is conservative by nature. But researchers needn’t be.

Credits