Paul Mitchell’s journal round-up for 25th December 2017

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Consensus-based cross-European recommendations for the identification, measurement and valuation of costs in health economic evaluations: a European Delphi study. European Journal of Health Economics [PubMedPublished 19th December 2017

The primary aim of this study was to develop guidelines for costing in economic evaluation studies conducted across more than one European country. The starting point of the societal perspective as the benchmark for costing was not entirely obvious from the abstract, where this broadest approach to costing is not recommended uniformly across all European countries. Recommendations following this starting point looked at the identification, measurement and valuation of resource use, discount rate and discounting of future costs. A three-step Delphi study was used to gain consensus on what should be included in an economic evaluation from a societal perspective, based initially on findings from a review of costing methodologies adopted across European country-specific guidelines. Consensus required at least two thirds (67%) agreement across those participating in the Delphi study at all 3 stages. Where no agreement was reached after the three stages, a panel of four of the co-authors made a final decision on what should be recommended. In total, 26 of the 110 invited to participate completed at least one Delphi round, with all Delphi rounds having at least 16 participants. It remains unclear to me if 16 for a Delphi round is sufficient to reach a European wide consensus on costing methodologies. There were a number of key areas where no consensus was reached (e.g. including costs unrelated to the intervention, measurement of resource use and absenteeism, and valuation of opportunity costs of patient time and informal care), so the four-strong author panel had a leading role on some of the main recommendations. Notwithstanding the limitations associated with the reference perspective taken and sample for the Delphi study and panel, the paper provides a useful illustration of the different approaches to costing across European countries. It also provides a good coverage of costing issues that need to be explained in detail in economic evaluations to allow for clear understanding of methods used and the underpinning rationale for those decisions where a choice is required on the costing methodology applied.

A (five-)level playing field for mental health conditions?: exploratory analysis of EQ-5D-5L derived utility values. Quality of Life Research [PubMedPublished 16th December 2017

The UK health economics community has been reeling from the decision made earlier this year by UK guidelines developer, the National Institute for Health and Care Excellence (NICE), who recommended to not adopt the new population values developed for the EQ-5D-5L version when calculating QALYs and instead rely on a crosswalk of the values developed over 20 years ago for the 3 level EQ-5D version. This paper provides a timely comparison of how these two value sets perform for the EQ-5D-5L descriptive system in patient groups with mental health conditions, groups often thought to be disadvantaged by the physical health functioning focus of the EQ-5D descriptive system. Using baseline data from three trials, the authors find that the new utility values produce a higher mean EQ-5D score of 0.08 compared to the old crosswalk values, with a 0.225 difference for those reporting extreme problems with the anxiety/depression dimension on EQ-5D. Although, the authors of this study highlight using these new values would increase cost per QALY results in this sample using scenario analysis, when improvements are in the depression/anxiety category only, such improvements are relatively better than across the whole EQ-5D-5L descriptive system due to the relative additional value placed on the anxiety/depression dimension in the new values. This paper makes for interesting reading and one that NICE should take into consideration when reviewing their decision on this issue next year. Although I would disagree with the authors when they state that this study would be a primary reason for revising the NICE cost-effectiveness threshold (more compelling arguments for this elsewhere in my view), it does clearly highlight the influence of the choice of descriptive system and the values used in the outcomes produced for economic analysis such as QALYs, even when the two descriptive systems in question (EQ-5D-3L and EQ-5D-5L) are roughly the same.

What characteristics of nursing homes are most valued by customers? A discrete choice experiment with residents and family members. Value in Health Published 1st December 2017

Our final paper for review in 2017 looks at the characteristics that are of most importance to individuals and their family members when it comes to nursing home provision. The authors conducted a valuation exercise using a discrete choice experiment (DCE) to calculate the relative importance of the attributes contained on the Consumer Choice Index-Six Dimension (CCI-6D), a measure developed to assess the quality of nursing home care across 3 levels on six domains: 1. level of time care staff spent with residents; 2. homeliness of shared spaces; 3. homeliness of room setup; 4. access to outside and garden; 5. frequency of meaningful activities; and 6. flexibility with care routines. Those who lived in a nursing home for at least a year with low levels of cognitive impairment completed the DCE themselves, whereas family members were asked to proxy for their close relative with more severe cognitive impairment. 126 residents and 416 family member proxies completed the DCE comparisons of nursing homes with different qualities in these six areas. The results of the DCE show differences in preferences across the two groups. Although similar importance is placed on some dimensions across both groups (i.e. “homeliness of room set up” ranked highly, whereas “frequency of meaningful activities” ranked lower), residents value access to outside and garden four times as much as the family proxies do (second most important dimension for residents, lowest for family proxies), family members value level of time care staff spent with residents twice as much as residents themselves (most important attribute for family proxies, third most important for residents). Although residents in both groups may have important differences in characteristics that might explain some of this difference, it is probably a good time of year to remember family preferences may be inconsistent with individuals within them, so make sure to take account of this variation when preparing those Christmas dinners.

Happy holidays all.


Alastair Canaway’s journal round-up for 18th September 2017

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Selection of key health domains from PROMIS® for a generic preference-based scoring system. Quality of Life Research [PubMedPublished 19th August 2017

The US Panel on Cost-Effectiveness recommends the use of QALYs. It doesn’t, however, instruct (unlike the UK) as to what measure should be used. This leaves the door ajar for both new and established measures. This paper sets about developing a new preference-based measure from the Patient-Reported Outcomes Measurement System (PROMIS). PROMIS is a US National Institutes of Health funded suite of person-centred measures of physical, mental, and social health. Across all the PROMIS measures there exist over 70 domains of health relevant to adult health. For all its promise, the PROMIS system does not produce a summary score amenable to the calculation of QALYs, nor for general descriptive purposes such as measuring HRQL over time. This study aimed to reduce the 70 items down to a number suitable for valuation. To do this, Delphi methods were used. The Delphi approach is something that seems to be increasing in popularity in the health economics world. For those unfamiliar, it essentially involves obtaining the opinions of experts independently and iteratively conducting rounds of questioning to reach a consensus (over two or more rounds). In this case nine health outcomes experts were recruited, they were presented with ‘all 37 domains’ (no mention is made of how they got from 70 to 37!) and asked to remove any domains that were not appropriate for inclusion in a general health utility measure or were redundant due to another PROMIS domain. If more than seven experts agreed, then the domain was removed. Responses were combined and presented until consensus was reached. This left 10 domains. They then used a community sample of 50 participants to test for independence of domains using a pairwise independence evaluation test. They were given the option of removing a domain they felt was not important to overall HRQL and asked to rate the importance of remaining domains using a VAS. These findings were used by the research team to whittle down from nine domains to seven. The final domains were: Cognitive function- abilities; Depression; Fatigue; Pain Interference; Physical Function; Ability to participate in social roles and activities; and Sleep disturbance. Many of these are common to existing measures but I did rather like the inclusion of cognitive function and fatigue – something that is missing in many, and to me appear important. The next step is valuation. Upon valuation, this is a promising candidate for use in economic evaluation – particularly in the US where the PROMIS measurement suite is already established.

Predictive validation and the re-analysis of cost-effectiveness: do we dare to tread? PharmacoEconomics [PubMedPublished 22nd August 2017

PharmacoEconomics treated us to a provocative editorial regarding predictive validation and re-analysis of cost-effectiveness models – a call to arms of sorts. For those (like me) who are not modelling experts, predictive validation (aka 4th order validation) refers to the comparison of model outputs with data that are collected after the initial analysis of the model. So essentially you’re comparing what you modelled would happen with what actually happened. The literature suggests that predictive validation is widely ignored. The importance of predictive validity is highlighted with a case study where predictive-validity was examined three years after the end of a trial – upon reanalysis the model was poor. This was then revised, which led to a much better fit of the prospective data. Predictive validation can, therefore, be used to identify sources of inaccuracies in models. If predictive validity was examined more commonly, improvements in model quality more generally are possible. Furthermore, it might be possible to identify specific contexts where poor predictive validity is prevalent and thus require further research. The authors highlight the field of advanced cancers as a particularly relevant context where uncertainty around survival curves is prevalent. By actively scheduling further data collection and updating the survival curves we can reduce the uncertainty surrounding the value of high-cost drugs. Predictive validation can also inform other aspects of the modelling process, such as the best choice of time point from which to extrapolate, or credible rates of change in predicted hazards. The authors suggest using expected value of information analysis to identify technologies with the largest costs of uncertainty to prioritise where predictive validity could be assessed. NICE and other reimbursement bodies require continued data collection for ‘some’ new technologies, the processes are therefore in place for future studies to be designed and implemented in a way to capture such data which allows later re-analysis. Assessing predictive validity seems eminently sensible, there are however barriers. Money is the obvious issue, extended prospective data collection and re-analysis of models requires resources. It does, however, have the potential to save money and improve health in the long run. The authors note how in a recent study they demonstrated that a drug for osteoporosis that had been recommended by Australia’s Pharmaceutical Benefits Advisory Committee was not actually cost-effective when further data were examined. There is clearly value to be achieved in predictive validation and re-analysis – it’s hard to disagree with the authors and we should probably be campaigning for longer term follow-ups, re-analysis and increased acknowledgement of the desirability of predictive validity.

How should cost-of-illness studies be interpreted? The Lancet Psychiatry [PubMed] Published 7th September 2017

It’s a good question – cost of illness studies are commonplace, but are they useful from a health economics perspective? A comment piece in The Lancet Psychiatry examines this issue using the case study of self-harm and suicide. It focuses on a recent publication by Tsiachristas et al, which examines the hospital resource use and care costs for all presentations of self-harm in a UK hospital. Each episode of self-harm cost £809, and when extrapolated to the UK cost £162 million. Over 30% of these costs were psychological assessments which despite being recommended by NICE only 75% of self-harming patients received. If all self-harming patients received assessments as recommended by NICE then another £51 million would be added to the bill. The author raises the question of how much use is this information for health economists. Nearly all cost of illness studies end up concluding that i) they cost a lot, and ii) money could be saved by reducing or ameliorating the underlying factors that cause the illness. Is this helpful? Well, not particularly, by focusing only on one illness there is no consideration of the opportunity cost: if you spend money preventing one condition then that money will be displacing resources elsewhere, likewise, resources spent reducing one illness will likely be balanced by increased spending on another illness. The author highlights this with a thought experiment: “imagine a world where a cost of illness study has been done for every possible diseases and that the total cost of illness was aggregated. The counterfactual from such an exercise is a world where nobody gets sick and everybody dies suddenly at some pre-determined age”. Another issue is that more often than not, cost of illness studies identify that more, not less should be spent on a problem, in the self-harm example it was that an extra £51 million should be spent on psychological assessments. Similarly, it highlights the extra cost of psychological assessments, rather than the glaring issue that 25% who attend hospital for self-harm are not getting the required psychological assessments. This very much links into the final point that cost of illness studies neglect the benefits being achieved. Now all the negatives are out the way, there are at least a couple of positives I can think of off the top of my head i) identification of key cost drivers, and ii) information for use in economic models. The take home message is that although there is some use to cost of illness studies, from a health economics perspective we (as a field) would be better off spending our time steering clear.


Widespread misuse of statistical significance in health economics

Despite widespread cautionary messages, p-values and claims of statistical significance are continuously misused. One of the most common errors is to mistake statistical significance for economic, clinical, or political significance. This error may manifest itself by authors interpreting only ‘statistically significant’ results as important, or even neglecting to examine the magnitude of estimated coefficients. For example, we’ve written previously about a claim of how statistically insignificant results are ‘meaningless’. Another common error is to ‘transpose the conditional’, that is to interpret the p-value as the posterior probability of a null hypothesis. For example, in an exchange on Twitter recently, David Colquhoun, whose discussions of p-values we’ve also previously covered, made the statement:

However, the p-value does not provide probability/evidence of a null hypothesis (that an effect ‘exists’). P-values are correlated with the posterior probability of the null hypothesis in a way that depends on statistical power, choice of significance level, and prior probability of the null. But observing a significant p-value only means that the data were unlikely to be produced by a particular model, not that the alternative hypothesis is true. Indeed, the null hypothesis may be a poor explanation for the observed data, but that does not mean it is a better explanation than the alternative. This is the essence of Lindley’s paradox.

So what can we say about p-values? The six principles of the ASA’s statement on p-values are:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.


In 1996, Deirdre McClosky and Stephen Ziliak surveyed economics papers published in the American Economic Review in the 1980s for p-value misuse. Overall, 70% did not distinguish statistical from economic significance and 96% misused a test statistic in some way. Things hadn’t improved when they repeated the study ten years later. Unfortunately, these problems are not exclusive to the AER. A quick survey of a top health economics journal, Health Economics, finds similar misuse as we discuss below. This journal is not singled out for any particular reason beyond that it’s one of the key journals in the field covered by this blog, and frequently features in our journal round-ups. Similarly, no comment is made on the quality of the studies or authors beyond the claims and use of statistical significance. Nevertheless, where there are p-values, there are problems. For such a pivotal statistic, one that careers can be made or broken on, we should at least get it right!

Nine studies were published in the May 2017 issue of Health Economics. The list below shows some examples of p-value errors in the text of the articles. The most common issue was using the p-value to interpret whether an effect exists or not, or using it as the (only) evidence to support or reject a particular hypothesis. As described above, the statistical significance of a coefficient does not imply the existence of an effect. Some of the statements claimed below to be erroneous may be contentious as, in the broader context of the paper, they may make sense. For example, claiming that a statistically significant estimate is evidence of an effect may be correct where the broader totality of the evidence suggests that any observed data would be incompatible with a particular model. However, this is generally not the way the p‘s are used.

Examples of p-value (mis-)statements

Even the CMI has no statistically significant effect on the facilitation ratio. Thus, the diversity and complexity of treated patients do not play a role for the subsidy level of hospitals.

the coefficient for the baserate is statistically significant for PFP hospitals in the FE model, indicating that a higher price level is associated with a lower level of subsidies.

Using the GLM we achieved nine significant effects, including, among others, Parkinson’s disease and osteoporosis. In all components we found more significant effects compared with the GLM approach. The number of significant effects decreases from component 2 (44 significant effects) to component 4 (29 significant effects). Although the GLM lead to significant results for intestinal diverticulosis, none of the component showed equivalent results. This might give a hint that taking the component based heterogeneity into account, intestinal diverticulosis does not significantly affect costs in multimorbidity patients. Besides this, certain coefficients are significant in only one component.

[It is unclear what ‘significant’ and ‘not significant’ refer to or how they are calculated but appear to refer to t>1.96. Not clear if corrections for multiple comparisons.]

There is evidence of upcoding as the coefficient of spreadp_posis statistically significant.

Neither [variable for upcoding] is statistically significant. The incentive for upcoding is, according to these results, independent of the statutory nature of hospitals.

The checkup significantly raises the willingness to pay any positive amount, although it does not significantly affect the amount reported by those willing to pay some positive amount.

[The significance is with reference to statistical significance].

Similarly, among the intervention group, there were lower probabilities of unhappiness or depression (−0.14, p = 0.045), being constantly under strain (0.098, p = 0.013), and anxiety or depression (−0.10, p = 0.016). There was no difference between the intervention group and control group 1 (eligible non-recipients) in terms of the change in the likelihood of hearing problems (p = 0.64), experiencing elevate blood pressure (p = 0.58), and the number of cigarettes smoked (p = 0.26).

The ∆CEs are also statistically significant in some educational categories. At T + 1, the only significant ∆CE is observed for cancer survivors with a university degree for whom the cancer effect on the probability of working is 2.5 percentage points higher than the overall effect. At T + 3, the only significant ∆CE is observed for those with no high school diploma; it is 2.2 percentage points lower than the overall cancer effect on the probability of working at T + 3.

And, just for balance, here is a couple from this year’s winner of the Arrow prize at iHEA, which gets bonus points for the phrase ‘marginally significant’, which can be used both to confirm and refute a hypothesis depending on the inclination of the author:

Our estimated net effect of waiting times for high-income patients (i.e., adding the waiting time coefficient and the interaction of waiting times and high income) is positive, but only marginally significant (p-value 0.055).

We find that patients care about distance to the hospital and both of the distance coefficients are highly significant in the patient utility function.


As we’ve argued before, p-values should not be the primary result reported. Their interpretation is complex and so often leads to mistakes. Our goal is to understand economic systems and to determine the economic, clinical, or policy relevant effects of interventions or modifiable characteristics. The p-value does provide some useful information but not enough to support the claims made from it.