Chris Sampson’s journal round-up for 13th January 2020

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

A vision ‘bolt-on’ increases the responsiveness of EQ-5D: preliminary evidence from a study of cataract surgery. The European Journal of Health Economics [PubMed] Published 4th January 2020

The EQ-5D is insensitive to differences in how well people can see, despite this seeming to be an important aspect of health. In contexts where the impact of visual impairment may be important, we could potentially use a ‘bolt-on’ item that asks about a person’s vision. I’m working on the development of a vision bolt-on at the moment. But ours won’t be the first. A previously-developed bolt-on has undergone some testing and has been shown to be sensitive to differences between people with different levels of visual function. However, there is little or no evidence to support its responsiveness to changes in visual function, which might arise from treatment.

For this study, 63 individuals were recruited prior to receiving cataract surgery in Singapore. Participants completed the EQ-5D-3L and EQ-5D-5L, both with and without a vision bolt-on, which matched the wording of other EQ-5D dimensions. Additionally, the SF-6D, HUI3, and VF-12 were completed along with a LogMAR assessment of visual acuity. The authors sought to compare the responsiveness of the EQ-5D with a vision bolt-on compared with the standard EQ-5D and the other measures. Therefore, all measures were completed before and after cataract surgery. Preference weights can be generated for the EQ-5D-3L with a vision bolt-on, but they can’t for the EQ-5D-5L, so the authors looked at rescaled sum scores to compare across all measures. Responsiveness was measured using indicators such as standardised effect size and response mean.

Visual acuity changed dramatically before and after surgery, for almost everybody. The authors found that the vision bolt-on does seem to provide a great deal more in the way of response to this, compared to the EQ-5D without the bolt-on. For instance, the mean change in the EQ-5D-3L index score was 0.018 without the vision bolt-on, and 0.031 with it. The HUI3 came out with a mean change of 0.105 and showed the highest responsiveness across all analyses.

Does this mean that we should all be using a vision bolt-on, or perhaps the HUI3? Not exactly. Something I see a lot in papers of this sort – including in this one – is the framing of a “superior responsiveness” as an indication that the measure is doing a better job. That isn’t true if the measure is responding to things to which we don’t want it to respond. As the authors point out, the HUI3 has quite different foundations to the EQ-5D. We also don’t want a situation where analysts can pick and choose measures according to which ever is most responsive to the thing to which they want it to be most responsive. In EuroQol parlance, what goes into the descriptive system is very important.

The causal effect of social activities on cognition: evidence from 20 European countries. Social Science & Medicine Published 9th January 2020

Plenty of studies have shown that cognitive abilities are correlated with social engagement, but few have attempted to demonstrate causality in a large sample. The challenge, of course, is that people who engage in more social activities are likely to have greater cognitive abilities for other reasons, and people’s decision to engage in social activities might depend on their cognitive abilities. This study tackles the question of causality using a novel (to me, at least) methodology.

The analysis uses data from five waves of SHARE (the Survey of Health, Ageing and Retirement in Europe). Survey respondents are asked about whether they engage in a variety of social activities, such as voluntary work, training, sports, or community-related organisations. From this, the authors generate an indicator for people participating in zero, one, or two or more of these activities. The survey also uses a set of tests to measure people’s cognitive abilities in terms of immediate recall capacity, delayed recall capacity, fluency, and numeracy. The authors look at each of these four outcomes, with 231,407 observations for the first three and 124,381 for numeracy (for which the questions were missing from some waves). Confirming previous findings, a strong positive correlation is found between engagement in social activities and each of the cognition indicators.

The empirical strategy, which I had never heard of, is partial identification. This is a non-parametric method that identifies bounds for the average treatment effect. Thus, it is ‘partial’ because it doesn’t identify a point estimate. Fewer assumptions means wider and less informative bounds. The authors start with a model with no assumptions, for which the lower bound for the treatment effect goes below zero. They then incrementally add assumptions. These include i) a monotone treatment response, assuming that social participation does not reduce cognitive abilities on average; ii) monotone treatment selection, assuming that people who choose to be socially active tend to have higher cognitive capacities; iii) a monotone instrumental variable assumption that body mass index is negatively associated with cognitive abilities. The authors argue that their methodology is not likely to be undermined by unobservables, as previous studies might.

The various models show that engaging in social activities has a positive impact on all four of the cognitive indicators. The assumption of monotone treatment response had the highest identifying power. For all models that included this, the 95% confidence intervals in the estimates showed a statistically significant positive impact of social activities on cognition. What is perhaps most interesting about this approach is the huge amount of uncertainty in the estimates. Social activities might have a huge effect on cognition or they might have a tiny effect. A basic OLS-type model, assuming exogenous selection, provides very narrow confidence intervals, whereas the confidence intervals on the partial identification models are almost as wide as the lower and upper band themselves.

One shortcoming of this study for me is that it doesn’t seek to identify the causal channels that have been proposed in previous literature (e.g. loneliness, physical activity, self-care). So it’s difficult to paint a clear picture of what’s going on. But then, maybe that’s the point.

Do research groups align on an intervention’s value? Concordance of cost-effectiveness findings between the Institute for Clinical and Economic Review and other health system stakeholders. Applied Health Economics and Health Policy [PubMed] Published 10th January 2020

Aside from having the most inconvenient name imaginable, ICER has been a welcome edition to the US health policy scene, appraising health technologies in order to provide guidance on coverage. ICER has become influential, with some pharmacy benefit managers using their assessments as a basis for denying coverage for low value medicines. ICER identify technologies as falling in one of three categories – high, low, or intermediate long-term value – according to whether the ICER (grr) falls below, above, or between the threshold range of $50,000-$175,000 per QALY. ICER conduct their own evaluations, but so do plenty of other people. This study sought to find out whether other analyses in the literature agree with ICER’s categorisations.

The authors consider 18 assessments by ICER, including 76 interventions, between 2015 and 2017. For each of these, the authors searched the literature for other comparative studies. Specifically, they went looking for cost-effectiveness analyses that employed the same perspectives and outcomes. Unfortunately, they were only able to identify studies for six disease areas and 14 interventions (of the 76), across 25 studies. It isn’t clear whether this is because there is a lack of literature out there – which would be an interesting finding in itself – or because their search strategy or selection criteria weren’t up to scratch. Of the 14 interventions compared, 10 get a more favourable assessment in the published studies than in their corresponding ICER evaluations, with most being categorised as intermediate value instead of low value. The authors go on to conduct one case study, comparing an ICER evaluation in the context of migraine with a published study by some of the authors of this paper. There were methodological differences. In some respects, it seems as if ICER did a more thorough job, while in other respects the published study seemed to use more defensible assumptions.

I agree with the authors that these kinds of comparisons are important. Not least, we need to be sure that ICER’s approach to appraisal is valid. The findings of this study suggest that maybe ICER should be looking at multiple studies and combining all available data in a more meaningful way. But the authors excluded too many studies. Some imperfect comparisons would have been more useful than exclusion – 14 of 76 is kind of pitiful and probably not representative. And I’m not sure why the authors set out to identify studies that are ‘more favourable’, rather than just different. That perspective seems to reveal an assumption that ICER are unduly harsh in their assessments.

Credits

Are QALYs #ableist?

As many of us who have had to review submitted journal articles, thesis defenses, grant applications, white papers, and even published literature know, providing feedback on something that is poorly conceived is much harder than providing feedback on something well done.

This is going to be hard.

Who is ValueOurHealth?

The video above comes from the website of “ValueOurHealth.org”; I would tell you more about them, but there is no “About Us” menu item on the website. However, the website indicates that they are a group of patient organizations concerned about:

“The use of flawed, discriminatory value assessments [that] could threaten access to care for patients with chronic illnesses and people with disabilities.”

In particular, who find issue with value assessments that

“place a value on the life of a human based on their health status and assume every patient will respond the same way to treatments.”

QALYs, according to these concerned patient groups, assign a value to human beings. People with lower values (like Jessica, in the video above), then, will be denied coverage because their life is “valued less than someone in perfect health” which means “less value is also placed on treating” them. (Many will be quick to notice that health states and QALYs are used interchangeably here. I try to explain why below.)

It’s not like this is a well-intended rogue group who simply misunderstands the concept of a QALY, requires someone to send them a polite email, and then we can all move on. Other groups have also asserted that QALYs unfairly discriminate against the aged and disabled, and include AimedAlliance, Alliance for Patient Access, Institute for Patient Access, Alliance for Aging Research, and Global Liver Institute. There are likely many more patient groups that abhor QALYs (and definite articles/determiners, it seems) out there, and are justifiably concerned about patient access to therapy. But these are all the ones I could find through a quick search and sitting from my perch in Canada.

Why do they hate QALYs?

One can infer pretty quickly that ValueOurHealth and their illustrative message is largely motivated by another very active organization, the “Partnership to Improve Patient Care” (PIPC). The video, and the arguments about “assigning QALYs” to people, seem to stem from a white paper produced by the PIPC, which in turn cites a very nicely written paper by Franco Sassi (of Imperial College London), that explains QALY and DALY calculations for researchers and policymakers.

The PIPC white paper, in fact, uses the very same calculation provided by Prof. Sassi to illustrate the impact of preventing a case of tuberculosis. However, unlike Prof. Sassi’s illustrative example, the PIPC fails to quantify the QALYs gained by the intervention. Instead they simply focus on the QALYs an individual who has tuberculosis for 6 months will experience. (0.36, versus 0.50, for those keeping score). After some further discussion about problems with measuring health states, the PIPC white paper then skips ahead to ethical problems with QALYs central to their position, citing a Value in Health paper by Erik Nord and colleagues. One of the key problems with the QALY according to the PIPC and argued in the Nord paper goes as follows:

“Valuing health gains in terms of QALYs means that life-years gained in full health—through, for instance, prevention of fatal accidents in people in normal health—are counted as more valuable than life-years gained by those who are chronically ill or disabled—for instance, by averting fatal episodes in people with asthma, heart disease, or mental illness.”

It seems the PIPC assume the lower number of QALYs experienced by those who are sick equates with the value of lives to payers. Even more interestingly, Prof. Nord’s analysis says nothing about costs. While those who are older have fewer QALYs to potentially gain, they also incur fewer costs. This is why, contrary to the assertion of preventing accidents in healthy people, preventive measures may offer a similar value to treatments when both QALYS and costs are considered.

It is also why an ICER review showed that alemtuzumab is good value in individuals requiring second-line treatment for relapse-remitting multiple sclerosis (1.34 QALYs can be gained compared to the next best alternative and at a lower cost then comparators), while a policy of annual mammography screening of similarly aged (i.e., >40) healthy women is of poor economic value (0.036 QALYs can be gained compared to no screening at an additional cost of $5,500 for every woman). Mammography provides better value in older individuals. It is not unlike fracture prevention and a myriad of other interventions in healthy, asymptomatic people in this regard. Quite contrary to the assertion of these misinformed groups, many interventions represent increasingly better value in frail, disabled, and older patients. Relative risks create larger yields when baseline risks are high.

None of this is to say that QALYs (and incremental cost-effectiveness ratios) do not have problems. And the PIPC, at the very least, should be commended for trying to advance alternative metrics, something that very few critics have offered. Instead, the PIPC and like-minded organizations are likely trapped in a filter bubble. They know there are problems with QALYs, and they see expensive and rare disease treatments being valued harshly. So, ergo, blame the QALY. (Note to PIPC: it is because the drugs are expensive, relative to other life-saving things, not because of your concerns about the QALY.) They then see that others feel the same way, which means their concerns are likely justified. A critique of QALYs issued by the Pioneer Institute identifies many of these same arguments. One Twitterer, a disabled Massachusetts lawyer “alive because of Medicaid” has offered further instruction for the QALY-naive.

What to do about it?

As a friend recently told me, not everyone is concerned with the QALY. Some don’t like what they see as a rationing approach promoted by the Institute for Clinical and Economic Review (ICER) assessments. Some hate the QALY. Some hate both. Last year, Joshua T. Cohen, Dan Ollendorf, and Peter Neumann published their own blog entry on the effervescing criticism of ICER, even allowing the PIPC head to have a say about QALYs. They then tried to set the record straight with these thoughts:

While we applaud the call for novel measures and to work with patient and disability advocates to understand attributes important to them, there are three problems with PIPC’s position.

First, simply coming up with that list of key attributes does not address how society should allocate finite resources, or how to price a drug given individual or group preferences.

Second, the diminished weight QALYs assign to life with disability does not represent discrimination. Instead, diminished weight represents recognition that treatments mitigating disability confer value by restoring quality of life to levels typical among most of the population.

Finally, all value measures that inform allocation of finite resources trade off benefits important to some patients against benefits potentially important to others. PIPC itself notes that life years not weighted for disability (e.g., the equal value life-year gained, or evLYG, introduced by ICER for sensitivity analysis purposes) do not award value for improved quality of life. Indeed, any measure that does not “discriminate” against patients with disability cannot award treatments credit for improving their quality of life. Failing to award that credit would adversely affect this population by ruling out spending on such improvements.

Certainly a lot more can be said here.

But for now, I am more curious what others have to say…

Chris Sampson’s journal round-up for 11th September 2017

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Core items for a standardized resource use measure (ISRUM): expert Delphi consensus survey. Value in Health Published 1st September 2017

Trial-based collection of resource use data, for the purpose of economic evaluation, is wild. Lots of studies use bespoke questionnaires. Some use off-the-shelf measures, but many of these are altered to suit the context. Validity rarely gets a mention. Some of you may already be aware of this research; I’m sure I’m not the only one here who participated. The aim of the study is to establish a core set of resource use items that should be included in all studies to aid comparability, consistency and validity. The researchers identified a long list of 60 candidate items for inclusion, through a review of 59 resource use instruments. An NHS and personal social services perspective was adopted, and any similar items were merged. This list was constructed into a Delphi survey. Members of the HESG mailing list – as well as 111 other identified experts – were invited to complete the survey, for which there were two rounds. The first round asked participants to rate the importance of including each item in the core set, using a scale from 1 (not important) to 9 (very important). Participants were then asked to select their ‘top 10’. Items survived round 1 if they scored at least 7 with more than 50% of respondents, and less than 3 by no more than 15%, either overall or within two or more participant subgroups. In round 2, participants were presented with the results of round 1 and asked to re-rate 34 remaining items. There was a sample of 45 usable responses in round 1 and 42 in round 2. Comments could also be provided, which were subsequently subject to content analysis. After all was said and done, a meeting was held for final item selection based on the findings, to which some survey participants were invited but only one attended (sorry I couldn’t make it). The final 10 items were: i) hospital admissions, ii) length of stay, iii) outpatient appointments, iv) A&E visits, v) A&E admissions, vi) number of appointments in the community, vii) type of appointments in the community, viii) number of home visits, ix) type of home visits and x) name of medication. The measure isn’t ready to use just yet. There is still research to be conducted to identify the ideal wording for each item. But it looks promising. Hopefully, this work will trigger a whole stream of research to develop bolt-ons in specific contexts for a modular system of resource use measurement. I also think that this work should form the basis of alignment between costing and resource use measurement. Resource use is often collected in a way that is very difficult to ‘map’ onto costs or prices. I’m sure the good folk at the PSSRU are paying attention to this work, and I hope they might help us all out by estimating unit costs for each of the core items (as well as any bolt-ons, once they’re developed). There’s some interesting discussion in the paper about the parallels between this work and the development of core outcome sets. Maybe analysis of resource use can be as interesting as the analysis of quality of life outcomes.

A call for open-source cost-effectiveness analysis. Annals of Internal Medicine [PubMed] Published 29th August 2017

Yes, this paper is behind a paywall. Yes, it is worth pointing out this irony over and over again until we all start practising what we preach. We’re all guilty; we all need to keep on keeping on at each other. Now, on to the content. The authors argue in favour of making cost-effectiveness analysis (and model-based economic evaluation in particular) open to scrutiny. The key argument is that there is value in transparency, and analogies are drawn with clinical trial reporting and epidemiological studies. This potential additional value is thought to derive from i) easy updating of models with new data and ii) less duplication of efforts. The main challenges are thought to be the need for new infrastructure – technical and regulatory – and preservation of intellectual property. Recently, I discussed similar issues in a call for a model registry. I’m clearly in favour of cost-effectiveness analyses being ‘open source’. My only gripe is that the authors aren’t the first to suggest this, and should have done some homework before publishing this call. Nevertheless, it is good to see this issue being raised in a journal such as Annals of Internal Medicine, which could be an indication that the tide is turning.

Differential item functioning in quality of life measurement: an analysis using anchoring vignettes. Social Science & Medicine [PubMed] [RePEc] Published 26th August 2017

Differential item functioning (DIF) occurs when different groups of people have different interpretations of response categories. For example, in response to an EQ-5D questionnaire, the way that two groups of people understand ‘slight problems in walking about’ might not be the same. If that were the case, the groups wouldn’t be truly comparable. That’s a big problem for resource allocation decisions, which rely on trade-offs between different groups of people. This study uses anchoring vignettes to test for DIF, whereby respondents are asked to rate their own health alongside some health descriptions for hypothetical individuals. The researchers conducted 2 online surveys, which together recruited a representative sample of 4,300 Australians. Respondents completed the EQ-5D-5L, some vignettes, some other health outcome measures and a bunch of sociodemographic questions. The analysis uses an ordered probit model to predict responses to the EQ-5D dimensions, with the vignettes used to identify the model’s thresholds. This is estimated for each dimension of the EQ-5D-5L, in the hope that the model can produce coefficients that facilitate ‘correction’ for DIF. But this isn’t a guaranteed approach to identifying the effect of DIF. Two important assumptions are inherent; first, that individuals rate the hypothetical vignette states on the same latent scale as they rate their own health (AKA response consistency) and, second, that everyone values the vignettes on an equivalent latent scale (AKA vignette equivalence). Only if these assumptions hold can anchoring vignettes be used to adjust for DIF and make different groups comparable. The researchers dedicate a lot of effort to testing these assumptions. To test response consistency, separate (condition-specific) measures are used to assess each domain of the EQ-5D. The findings suggest that responses are consistent. Vignette equivalence is assessed by the significance of individual characteristics in determining vignette values. In this study, the vignette equivalence assumption didn’t hold, which prevents the authors from making generalisable conclusions. However, the researchers looked at whether the assumptions were satisfied in particular age groups. For 55-65 year olds (n=914), they did, for all dimensions except anxiety/depression. That might be because older people are better at understanding health problems, having had more experience of them. So the authors can tell us about DIF in this older group. Having corrected for DIF, the mean health state value in this group increases from 0.729 to 0.806. Various characteristics explain the heterogeneous response behaviour. After correcting for DIF, the difference in EQ-5D index values between high and low education groups increased from 0.049 to 0.095. The difference between employed and unemployed respondents increased from 0.077 to 0.256. In some cases, the rankings changed. The difference between those divorced or widowed and those never married increased from -0.028 to 0.060. The findings hint at a trade-off between giving personalised vignettes to facilitate response consistency and generalisable vignettes to facilitate vignette equivalence. It may be that DIF can only be assessed within particular groups (such as the older sample in this study). But then, if that’s the case, what hope is there for correcting DIF in high-level resource allocation decisions? Clearly, DIF in the EQ-5D could be a big problem. Accounting for it could flip resource allocation decisions. But this study shows that there isn’t an easy answer.

How to design the cost-effectiveness appraisal process of new healthcare technologies to maximise population health: a conceptual framework. Health Economics [PubMed] Published 22nd August 2017

The starting point for this paper is that, when it comes to reimbursement decisions, the more time and money spent on the appraisal process, the more precise the cost-effectiveness estimates are likely to be. So the question is, how much should be committed to the appraisal process in the way of resources? The authors set up a framework in which to consider a variety of alternatively defined appraisal processes, how these might maximise population health and which factors are key drivers in this. The appraisal process is conceptualised as a diagnostic tool to identify which technologies are cost-effective (true positives) and which aren’t (true negatives). The framework builds on the fact that manufacturers can present a claimed ICER that makes their technology more attractive, but that the true ICER can never be known with certainty. As a diagnostic test, there are four possible outcomes: true positive, false positive, true negative, or false negative. Each outcome is associated with an expected payoff in terms of population health and producer surplus. Payoffs depend on the accuracy of the appraisal process (sensitivity and specificity), incremental net benefit per patient, disease incidence, time of relevance for an approval, the cost of the process and the price of the technology. The accuracy of the process can be affected by altering the time and resources dedicated to it or by adjusting the definition of cost-effectiveness in terms of the acceptable level of uncertainty around the ICER. So, what determines an optimal level of accuracy in the appraisal process, assuming that producers’ price setting is exogenous? Generally, the process should have greater sensitivity (at the expense of specificity) when there is more to gain: when a greater proportion of technologies are cost-effective or when the population or time of relevance is greater. There is no fixed optimum for all situations. If we relax the assumption of exogenous pricing decisions, and allow pricing to be partly determined by the appraisal process, we can see that a more accurate process incentivises cost-effective price setting. The authors also consider the possibility of there being multiple stages of appraisal, with appeals, re-submissions and price agreements. The take-home message is that the appraisal process should be re-defined over time and with respect to the range of technologies being assessed, or even an individualised process for each technology in each setting. At least, it seems clear that technologies with exceptional characteristics (with respect to their potential impact on population health), should be given a bespoke appraisal. NICE is already onto these ideas – they recently introduced a fast track process for technologies with a claimed ICER below £10,000 and now give extra attention to technologies with major budget impact.

Credits