# Chris Sampson’s journal round-up for 11th September 2017

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Core items for a standardized resource use measure (ISRUM): expert Delphi consensus survey. Value in Health Published 1st September 2017

Trial-based collection of resource use data, for the purpose of economic evaluation, is wild. Lots of studies use bespoke questionnaires. Some use off-the-shelf measures, but many of these are altered to suit the context. Validity rarely gets a mention. Some of you may already be aware of this research; I’m sure I’m not the only one here who participated. The aim of the study is to establish a core set of resource use items that should be included in all studies to aid comparability, consistency and validity. The researchers identified a long list of 60 candidate items for inclusion, through a review of 59 resource use instruments. An NHS and personal social services perspective was adopted, and any similar items were merged. This list was constructed into a Delphi survey. Members of the HESG mailing list – as well as 111 other identified experts – were invited to complete the survey, for which there were two rounds. The first round asked participants to rate the importance of including each item in the core set, using a scale from 1 (not important) to 9 (very important). Participants were then asked to select their ‘top 10’. Items survived round 1 if they scored at least 7 with more than 50% of respondents, and less than 3 by no more than 15%, either overall or within two or more participant subgroups. In round 2, participants were presented with the results of round 1 and asked to re-rate 34 remaining items. There was a sample of 45 usable responses in round 1 and 42 in round 2. Comments could also be provided, which were subsequently subject to content analysis. After all was said and done, a meeting was held for final item selection based on the findings, to which some survey participants were invited but only one attended (sorry I couldn’t make it). The final 10 items were: i) hospital admissions, ii) length of stay, iii) outpatient appointments, iv) A&E visits, v) A&E admissions, vi) number of appointments in the community, vii) type of appointments in the community, viii) number of home visits, ix) type of home visits and x) name of medication. The measure isn’t ready to use just yet. There is still research to be conducted to identify the ideal wording for each item. But it looks promising. Hopefully, this work will trigger a whole stream of research to develop bolt-ons in specific contexts for a modular system of resource use measurement. I also think that this work should form the basis of alignment between costing and resource use measurement. Resource use is often collected in a way that is very difficult to ‘map’ onto costs or prices. I’m sure the good folk at the PSSRU are paying attention to this work, and I hope they might help us all out by estimating unit costs for each of the core items (as well as any bolt-ons, once they’re developed). There’s some interesting discussion in the paper about the parallels between this work and the development of core outcome sets. Maybe analysis of resource use can be as interesting as the analysis of quality of life outcomes.

A call for open-source cost-effectiveness analysis. Annals of Internal Medicine [PubMed] Published 29th August 2017

Yes, this paper is behind a paywall. Yes, it is worth pointing out this irony over and over again until we all start practising what we preach. We’re all guilty; we all need to keep on keeping on at each other. Now, on to the content. The authors argue in favour of making cost-effectiveness analysis (and model-based economic evaluation in particular) open to scrutiny. The key argument is that there is value in transparency, and analogies are drawn with clinical trial reporting and epidemiological studies. This potential additional value is thought to derive from i) easy updating of models with new data and ii) less duplication of efforts. The main challenges are thought to be the need for new infrastructure – technical and regulatory – and preservation of intellectual property. Recently, I discussed similar issues in a call for a model registry. I’m clearly in favour of cost-effectiveness analyses being ‘open source’. My only gripe is that the authors aren’t the first to suggest this, and should have done some homework before publishing this call. Nevertheless, it is good to see this issue being raised in a journal such as Annals of Internal Medicine, which could be an indication that the tide is turning.

Differential item functioning in quality of life measurement: an analysis using anchoring vignettes. Social Science & Medicine [PubMed] [RePEc] Published 26th August 2017

Differential item functioning (DIF) occurs when different groups of people have different interpretations of response categories. For example, in response to an EQ-5D questionnaire, the way that two groups of people understand ‘slight problems in walking about’ might not be the same. If that were the case, the groups wouldn’t be truly comparable. That’s a big problem for resource allocation decisions, which rely on trade-offs between different groups of people. This study uses anchoring vignettes to test for DIF, whereby respondents are asked to rate their own health alongside some health descriptions for hypothetical individuals. The researchers conducted 2 online surveys, which together recruited a representative sample of 4,300 Australians. Respondents completed the EQ-5D-5L, some vignettes, some other health outcome measures and a bunch of sociodemographic questions. The analysis uses an ordered probit model to predict responses to the EQ-5D dimensions, with the vignettes used to identify the model’s thresholds. This is estimated for each dimension of the EQ-5D-5L, in the hope that the model can produce coefficients that facilitate ‘correction’ for DIF. But this isn’t a guaranteed approach to identifying the effect of DIF. Two important assumptions are inherent; first, that individuals rate the hypothetical vignette states on the same latent scale as they rate their own health (AKA response consistency) and, second, that everyone values the vignettes on an equivalent latent scale (AKA vignette equivalence). Only if these assumptions hold can anchoring vignettes be used to adjust for DIF and make different groups comparable. The researchers dedicate a lot of effort to testing these assumptions. To test response consistency, separate (condition-specific) measures are used to assess each domain of the EQ-5D. The findings suggest that responses are consistent. Vignette equivalence is assessed by the significance of individual characteristics in determining vignette values. In this study, the vignette equivalence assumption didn’t hold, which prevents the authors from making generalisable conclusions. However, the researchers looked at whether the assumptions were satisfied in particular age groups. For 55-65 year olds (n=914), they did, for all dimensions except anxiety/depression. That might be because older people are better at understanding health problems, having had more experience of them. So the authors can tell us about DIF in this older group. Having corrected for DIF, the mean health state value in this group increases from 0.729 to 0.806. Various characteristics explain the heterogeneous response behaviour. After correcting for DIF, the difference in EQ-5D index values between high and low education groups increased from 0.049 to 0.095. The difference between employed and unemployed respondents increased from 0.077 to 0.256. In some cases, the rankings changed. The difference between those divorced or widowed and those never married increased from -0.028 to 0.060. The findings hint at a trade-off between giving personalised vignettes to facilitate response consistency and generalisable vignettes to facilitate vignette equivalence. It may be that DIF can only be assessed within particular groups (such as the older sample in this study). But then, if that’s the case, what hope is there for correcting DIF in high-level resource allocation decisions? Clearly, DIF in the EQ-5D could be a big problem. Accounting for it could flip resource allocation decisions. But this study shows that there isn’t an easy answer.

How to design the cost-effectiveness appraisal process of new healthcare technologies to maximise population health: a conceptual framework. Health Economics [PubMed] Published 22nd August 2017

The starting point for this paper is that, when it comes to reimbursement decisions, the more time and money spent on the appraisal process, the more precise the cost-effectiveness estimates are likely to be. So the question is, how much should be committed to the appraisal process in the way of resources? The authors set up a framework in which to consider a variety of alternatively defined appraisal processes, how these might maximise population health and which factors are key drivers in this. The appraisal process is conceptualised as a diagnostic tool to identify which technologies are cost-effective (true positives) and which aren’t (true negatives). The framework builds on the fact that manufacturers can present a claimed ICER that makes their technology more attractive, but that the true ICER can never be known with certainty. As a diagnostic test, there are four possible outcomes: true positive, false positive, true negative, or false negative. Each outcome is associated with an expected payoff in terms of population health and producer surplus. Payoffs depend on the accuracy of the appraisal process (sensitivity and specificity), incremental net benefit per patient, disease incidence, time of relevance for an approval, the cost of the process and the price of the technology. The accuracy of the process can be affected by altering the time and resources dedicated to it or by adjusting the definition of cost-effectiveness in terms of the acceptable level of uncertainty around the ICER. So, what determines an optimal level of accuracy in the appraisal process, assuming that producers’ price setting is exogenous? Generally, the process should have greater sensitivity (at the expense of specificity) when there is more to gain: when a greater proportion of technologies are cost-effective or when the population or time of relevance is greater. There is no fixed optimum for all situations. If we relax the assumption of exogenous pricing decisions, and allow pricing to be partly determined by the appraisal process, we can see that a more accurate process incentivises cost-effective price setting. The authors also consider the possibility of there being multiple stages of appraisal, with appeals, re-submissions and price agreements. The take-home message is that the appraisal process should be re-defined over time and with respect to the range of technologies being assessed, or even an individualised process for each technology in each setting. At least, it seems clear that technologies with exceptional characteristics (with respect to their potential impact on population health), should be given a bespoke appraisal. NICE is already onto these ideas – they recently introduced a fast track process for technologies with a claimed ICER below £10,000 and now give extra attention to technologies with major budget impact.

Credits

# Hawking is right, Jeremy Hunt does egregiously cherry pick the evidence

I’m beginning to think Jeremy Hunt doesn’t actually care what the evidence says on the weekend effect. Last week, renowned physicist Stephen Hawking criticized Hunt for ‘cherry picking’ evidence with regard to the ‘weekend effect’: that patients admitted at the weekend are observed to be more likely than their counterparts admitted on a weekday to die. Hunt responded by doubling down on his claims:

Some people have questioned Hawking’s credentials to speak on the topic beyond being a user of the NHS. But it has taken a respected public figure to speak out to elicit a response from the Secretary of State for Health, and that should be welcomed. It remains the case though that a multitude of experts do continue to be ignored. Even the oft-quoted Freemantle paper is partially ignored where it notes of the ‘excess’ weekend deaths, “to assume that [these deaths] are avoidable would be rash and misleading.”

We produced a simple tool to demonstrate how weekend effect studies might estimate an increased risk of mortality associated with weekend admissions even in the case of no difference in care quality. However, the causal model underlying these arguments is not always obvious. So here it is:

A simple model of the effect of the weekend on patient health outcomes. The dashed line represents unobserved effects

So what do we know about the weekend effect?

1. The weekend effect exists. A multitude of studies have observed that patients admitted at the weekend are more likely to die than those admitted on a weekday. This amounts to having shown that $E(Y|W,S) \neq E(Y|W',S)$. As our causal model demonstrates, being admitted is correlated with health and, importantly, the day of the week. So, this is not the same as saying that risk of adverse clinical outcomes differs by day of the week if you take into account propensity for admission, we can’t say $E(Y|W) \neq E(Y|W')$. Nor does this evidence imply care quality differs at the weekend, $E(Q|W) \neq E(Q|W')$. In fact, the evidence only implies differences in care quality if the propensity to be admitted is independent of (unobserved) health status, i.e. $Pr(S|U,X) = Pr(S|X)$ (or if health outcomes are uncorrelated with health status, which is definitely not the case!).
2. Admissions are different at the weekend. Fewer patients are admitted at the weekend and those that are admitted are on average more severely unwell. Evidence suggests that the better patient severity is controlled for, the smaller the estimated weekend effect. Weekend effect estimates also diminish in models that account for the selection mechanism.
3. There is some evidence that care quality may be worse at the weekend (at least in the United States). So $E(Q|W) \neq E(Q|W')$. Although this has not been established in the UK (we’re currently investigating it!)
4. Staffing levels, particularly specialist to patient ratios, are different at the weekend, $E(X|W) \neq E(X|W')$.
5. There is little evidence to suggest how staffing levels and care quality are related. While the relationship seems evident prima facie, its extent is not well understood, for example, we might expect a diminishing return to increased staffing levels.
6. There is a reasonable amount of evidence on the impact of care quality (preventable errors and adverse events) on patient health outcomes.

But what are we actually interested in from a policy perspective? Do we actually care that it is the weekend per se? I would say no, we care that there is potentially a lapse in care quality. So, it’s a two part question: (i) how does care quality (and hence avoidable patient harm) differ at the weekend $E(Q|W) - E(Q|W') = ?$; and (ii) what effect does this have on patient outcomes $E(Y|Q)=?$. The first question answers to what extent policy may affect change and the second gives us a way of valuing that change and yet the vast majority of studies in the area address neither. Despite there being a number of publicly funded research projects looking at these questions right now, it’s the studies that are not useful for policy that keep being quoted by those with the power to make change.

Hawking is right, Jeremy Hunt has egregiously cherry picked and misrepresented the evidence, as has been pointed out again and again and again and again and … One begins to wonder if there isn’t some motive other than ensuring long run efficiency and equity in the health service.

Credits

# Paul Mitchell’s journal round-up for 17th July 2017

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

What goes wrong with the allocation of domestic and international resources for HIV? Health Economics [PubMedPublished 7th July 2017

Investment in foreign aid is coming under considered scrutiny as a number of leading western economies re-evaluate their role in the world and their obligations to countries with developing economies. Therefore, it is important for those who believe in the benefits of such investments to show that they are being done efficiently. This paper looks at how funding for HIV is distributed both domestically and internationally across countries, using multivariate regression analysis with instruments to control for reverse causality between financing and HIV prevalence, and domestic and international financing. The author is also concerned about countries free riding on international aid and estimates how countries ought to be allocating national resources to HIV using quintile regression to estimate what countries have fiscal space for expanding their current spending domestically. The results of the study show that domestic expenditure relative to GDP per capita is almost unit elastic, whereas it is inelastic with regards to HIV prevalence. Government effectiveness (as defined by the World Bank indices) has a statistically significant effect on domestic expenditure, although it is nonlinear, with gains more likely when moving up from a lower level of government effectiveness. International expenditure is inversely related to GDP per capita and HIV prevalence, and positively with government effectiveness, albeit the regression models for international expenditure had poor explanatory power. Countries with higher GDP per capita tended to dedicate more money towards HIV, however, the author reckons there is \$3bn of fiscal space in countries such as South Africa and Nigeria to contribute more to HIV, freeing up international aid for other countries such as Cameroon, Ghana, Thailand, Pakistan and Columbia. The author is concerned that countries with higher GDP should be able to allocate more to HIV, but feels there are improvements to be made in how international aid is distributed too. Although there is plenty of food for thought in this paper, I was left wondering how this analysis can help in aiding a better allocation of resources. The normative model of what funding for HIV ought to be is from the viewpoint that this is the sole objective of countries of allocating resources, which is clearly contestable (the author even casts doubt as to whether this is true for international funding of HIV). Perhaps the other demands faced by national governments (e.g. funding for other diseases, education etc.) can be better reflected in future research in this area.

Can pay-for-performance to primary care providers stimulate appropriate use of antibiotics? Health Economics [PubMed] [RePEcPublished 7th July 2017

Antibiotic resistance is one of the largest challenges facing global health this century. This study from Sweden looks to see whether pay for performance (P4P) can have a role in the prescription practices of GPs when it comes to treating children with respiratory tract infection. P4P was introduced on a staggered basis across a number of regions in Sweden to incentivise primary care to use narrow spectrum penicillin as a first line treatment, as it is said to have a smaller impact on resistance. Taking advantage of data from the Swedish Prescribed Drug Register between 2006-2013, the authors conducted a difference in difference regression analysis to show the effect P4P had on the share of the incentivised antibiotic. They find a positive main effect of P4P on drug prescribing of 1.1 percentage points, that is also statistically significant. Of interest, the P4P in Sweden under analysis here was not directly linked to salaries of GPs but the health care centre. Although there are a number of limitations with the study that the authors clearly highlight in the discussion, it is a good example of how to make the most of routinely available data. It also highlights that although the share of the less resistant antibiotic went up, the national picture of usage of antibiotics did not reduce in line with a national policy aimed at doing so during the same time period. Even though Sweden is reported to be one of the lower users of antibiotics in Europe, it highlights the need to carefully think through how targets are achieved and where incentives might help in some areas to meet such targets.

Econometric modelling of multiple self-reports of health states: the switch from EQ-5D-3L to EQ-5D-5L in evaluating drug therapies for rheumatoid arthritis. Journal of Health Economics Published 4th July 2017

The EQ-5D is the most frequently used health state descriptive system for the generation of utility values for quality-adjusted life years (QALYs) in economic evaluation. To improve sensitivity and reduce floor and ceiling effects, the EuroQol team developed a five level version (5L) compared to the previous three level (3L) version. This study adds to recent evidence in this area of the unforeseen consequences of making this change to the descriptive system and also the valuation system used for the 5L. Using data from the National Data Bank for Rheumatic Diseases, where both 3L and 5L versions were completed simultaneously alongside other clinical measures, the authors construct a mapping between both versions of EQ-5D, informed by the response levels and the valuation systems that have been developed in the UK for the measures. They also test their mapping estimates on a previous economic evaluation for rheumatoid arthritis treatments. The descriptive results show that although there is a high correlation between both versions, and the 5L version achieves its aim of greater sensitivity, there is a systematic difference in utility scores generated using both versions, with an average 87% of the score of the 3L recorded compared to the 5L. Not only are there differences highlighted between value sets for the 3L and 5L but also the responses to dimensions across measures, where the mobility and pain dimensions do not align as one would expect. The new mapping developed in this paper highlights some of the issues with previous mapping methods used in practice, including the assumption of independence of dimension levels from one another that was used while the new valuation for the 5L was being developed. Although the case study they use to demonstrate the effect of using the different approaches in practice did not result in a different cost-effectiveness result, the study does manage to highlight that the assumption of 3L and 5L versions being substitutes for one another, both in terms of descriptive systems and value sets, does not hold. Although the authors are keen to highlight the benefits of their new mapping that produces a smooth distribution from actual to predicted 5L, decision makers will need to be clear about what descriptive system they now want for the generation of QALYs, given the discrepancies between 3L and 5L versions of EQ-5D, so that consistent results are obtained from economic evaluations.

Credits