# Chris Sampson’s journal round-up for 23rd September 2019

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Can you repeat that? Exploring the definition of a successful model replication in health economics. PharmacoEconomics [PubMed] Published 18th September 2019

People talk a lot about replication and its role in demonstrating the validity and reliability of analyses. But what does a successful replication in the context of cost-effectiveness modelling actually mean? Does it mean coming up with precisely the same estimates of incremental costs and effects? Does it mean coming up with a model that recommends the same decision? The authors of this study sought to bring us closer to an operational definition of replication success.

There is potentially much to learn from other disciplines that have a more established history of replication. The authors reviewed literature on the definition of ‘successful replication’ across all disciplines, and used their findings to construct a variety of candidate definitions for use in the context of cost-effectiveness modelling in health. Ten definitions of a successful replication were pulled out of the cross-disciplinary review, which could be grouped into ‘data driven’ replications and ‘experimental’ replications – the former relating to the replication of analyses and the latter relating to the replication of specific observed effects. The ten definitions were from economics, biostatistics, cognitive science, psychology, and experimental philosophy. The definitions varied greatly, with many involving subjective judgments about the proximity of findings. A few studies were found that reported on replications of cost-effectiveness models and which provided some judgment on the level of success. Again, these were inconsistent and subjective.

Quite reasonably, the authors judge that the lack of a fixed definition of successful replication in any scientific field is not just an oversight. The threshold for ‘success’ depends on the context of the replication and on how the evidence will be used. This paper provides six possible definitions of replication success for use in cost-effectiveness modelling, ranging from an identical replication of the results, through partial success in replicating specific pathways within a given margin of error, to simply replicating the same implied decision.

Ultimately, ‘data driven’ replications are a solution to a problem that shouldn’t exist, namely, poor reporting. This paper mostly convinced me that overall ‘success’ isn’t a useful thing to judge in the context of replicating decision models. Replication of certain aspects of a model is useful to evaluate. Whether the replication implied the same decision is a key thing to consider. Beyond this, it is probably worth considering partial success in replicating specific parts of a model.

Differential associations between interpersonal variables and quality-of-life in a sample of college students. Quality of Life Research [PubMed] Published 18th September 2019

There is growing interest in the well-being of students and the distinct challenges involved in achieving good mental health and addressing high levels of demand for services in this group. Students go through many changes that might influence their mental health, prominent among these is the change to their social situation.

This study set out to identify the role of key interpersonal variables on students’ quality of life. The study recruited 1,456 undergraduate students from four universities in the US. The WHOQOL measure was used for quality of life and a barrage of measures were used to collect information on loneliness, social connectedness, social support, emotional intelligence, intimacy, empathic concern, and more. Three sets of analyses of increasing sophistication were conducted, from zero-order correlations between each measure and the WHOQOL, to a network analysis using a Gaussian Graphical Model to identify both direct and indirect relationships while accounting for shared variance.

In all analyses, loneliness stuck out as the strongest driver of quality of life. Social support, social connectedness, emotional intelligence, intimacy with one’s romantic partner, and empathic concern were also significantly associated with quality of life. But the impact of loneliness was greatest, with other interpersonal variables influencing quality of life through their impact on loneliness.

This is a well-researched and reported study. The findings are informative to student support and other services that seek to improve the well-being of students. There is reason to believe that such services should recognise the importance of interpersonal determinants of well-being and in particular address loneliness. But it’s important to remember that this study is only as good as the measures it uses. If you don’t think WHOQOL is adequately measuring student well-being, or you don’t think the UCLA Loneliness Scale tells us what we need to know, you might not want these findings to influence practice. And, of course, the findings may not be generalisable, as the extent to which different interpersonal variables affect quality of life is very likely dependent on the level of service provision, which varies greatly between different universities, let alone countries.

Affordability and non-perfectionism in moral action. Ethical Theory and Moral Practice [PhilPapers] Published 14th September 2019

The ‘cost-effective but unaffordable’ challenge has been bubbling for a while now, at least since sofosbuvir came on the scene. This study explores whether “we can’t afford it” is a justifiable position to take. The punchline is that, no, affordability is not a sound ethical basis on which to support or reject the provision of a health technology. I was extremely sceptical when I first read the claim. If we can’t afford it, it’s impossible, and how can there by a moral imperative in an impossibility? But the authors proceeded to convince me otherwise.

The authors don’t go into great detail on this point, but it all hinges on divisibility. The reason that a drug like sofosbuvir might be considered unaffordable is that loads of people would be eligible to receive it. If sofosbuvir was only provided to a subset of this population, it could be affordable. On this basis, the authors propose the ‘principle of non-perfectionism’. This states that not being able to do all the good we can do (e.g. provide everyone who needs it with sofosbuvir) is not a reason for not doing some of the good we can do. Thus, if we cannot support provision of a technology to everyone who could benefit from it, it does not follow (ethically) to provide it to nobody, but rather to provide it to some people. The basis for selecting people is not of consequence to this argument but could be based on a lottery, for example.

Building on this, the authors explain to us why this is wrong, with the notion of ‘numerical discrimination’. They argue that it is not OK to prioritise one group over another simply because we can meet the needs of everyone within that group as opposed to only some members of the other group. This is exactly what’s happening when we are presented with notions of (un)affordability. If the population of people who could benefit from sofosbuvir was much smaller, there wouldn’t be an issue. But the simple fact that the group is large does not make it morally permissible to deny cost-effective treatment to any individual member within that group. You can’t discriminate against somebody because they are from a large population.

I think there are some tenuous definitions in the paper and some questionable analogies. Nevertheless, the authors succeeded in convincing me that total cost has no moral weight. It is irrelevant to moral reasoning. We should not refuse any health technology to an entire population on the grounds that it is ‘unaffordable’. The authors frame it as a ‘mistake in moral mathematics’. For this argument to apply in the HTA context, it relies wholly on the divisibility of health technologies. To some extent, NICE and their counterparts are in the business of defining models of provision, which might result in limited use criteria to get around the affordability issue. Though these issues are often handled by payers such as NHS England.

The authors of this paper don’t consider the implications for cost-effectiveness thresholds, but this is where my thoughts turned. Does the principle of non-perfectionism undermine the morality of differentiating cost-effectiveness thresholds according to budget impact? I think it probably does. Reducing the threshold because the budget impact is great will result in discrimination (‘numerical discrimination’) against individuals simply because they are part of a large population that could benefit from treatment. This seems to be the direction in which we’re moving. Maybe the efficiency cart is before the ethical horse.

Credits

## Why did I do it?

I have evaluated lots of services and been involved in trials where I have asked people to collect EQ-5D data. During this time several people have complained to me about having to collect EQ-5D data so I thought I would have a ‘taste of my own medicine’. I measured my health-related quality of life (HRQoL) using EQ-5D-3L, EQ-5D-VAS, and EQ-5D-5L, every day for a year (N=1). I had the EQ-5D on a spreadsheet on my smartphone and prompted myself to do it at 9 p.m. every night. I set a target of never being more than three days late in doing it, which I missed twice through the year. I also recorded health-related notes for some days, for instance, 21st January said “tired, dropped a keytar on toe (very 1980s injury)”.

By doing this I wanted to illuminate issues around anchoring, ceiling effects and ideas of health and wellness. With a big increase in wearable tech and smartphone health apps this type of big data collection might become a lot more commonplace. I have not kept a diary since I was about 13 so it was an interesting way of keeping track on what was happening, with a focus on health. Starting the year I knew I had one big life event coming up: a new baby due in early March. I am generally quite healthy, a bit overweight, don’t get enough sleep. I have been called a hypochondriac by people before, typically complaining of headaches, colds and sore throats around six months of the year. I usually go running once or twice a week.

From the start I was very conscious that I felt I shouldn’t grumble too much, that EQ-5D was mainly used to measure functional health in people with disease, not in well people (and ceiling effects were a feature of the EQ-5D). I immediately felt a ‘freedom’ of the greater sensitivity of the EQ-5D-5L when compared to the 3L so I could score myself as having slight problems with the 5L, but not that they were bad enough to be ‘some problems’ on the 3L.

There were days when I felt a bit achey or tired because I had been for a run, but unless I had an actual injury I did not score myself as having problems with pain or mobility because of this; generally if I feel achey from running I think of that as a good thing as having pushed myself hard, ‘no pain no gain’. I also started doing yoga this year which made me feel great but also a bit achey sometimes. But in general I noticed that one of the main problems I had was fatigue which is not explicitly covered in the EQ-5D but was reflected sometimes as being slightly impaired on usual activities. I also thought that usual activities could be impaired if you are working and travelling a lot, as you don’t get to do any of the things you enjoy doing like hobbies or spending time with family, but this is more of a capability question whereas the EQ-5D is more functional.

## How did my HRQoL compare?

I matched up my levels on the individual domains to EQ-5D-3L and 5L index scores based on UK preference scores. The final 5L value set may still change; I used the most recent published scores. I also matched my levels to a personal 5L value set which I did using this survey which uses discrete choice experiments and involves comparing a set of pairs of EQ-5D-5L health states. I found doing this fascinating and it made me think about how mutually exclusive the EQ-5D dimensions are, and whether some health states are actually implausible: for instance, is it possible to be in extreme pain but not have any impairment on usual activities?

Surprisingly, my average EQ-5D-3L index score (0.982) was higher than the population averages for my age group (for England age 35-44 it is 0.888 based on Szende et al 2014); I expected them to be lower. In fact my average index scores were higher than the average for 18-24 year olds (0.922). I thought that measuring EQ-5D more often and having more granularity would lead to lower average scores but it actually led to high average scores.

My average score from the personal 5L value set was slightly higher than the England population value set (0.983 vs 0.975). Digging into the data, the main differences were that I thought that usual activities were slightly more important, and pain slightly less important, than the general population. The 5L (England tariff) correlated more closely with the VAS than the 3L (r2 =0.746 vs. r2 =0.586) but the 5L (personal tariff) correlated most closely with the VAS (r2 =0.792). So based on my N=1 sample, this suggests that the 5L is a better predictor of overall health than the 3L, and that the personal value set has validity in predicting VAS scores.

## Reflection

I definitely regretted doing the EQ-5D every day and was glad when the year was over! I would have preferred to have done it every week but I think that would have missed a lot of subtleties in how I felt from day to day. On reflection the way I was approaching it was that the end of each day I would try to recall if I was stressed, or if anything hurt, and adjust the level on the relevant dimension. But I wonder if I was prompted at any moment during the day as to whether I was stressed, had some mobility issues, or pain, would I say I did? It makes me think about Kahneman and Riis’s ‘remembering brain’ and ‘experiencing brain’. Was my EQ-5D profile a slave to my ‘remembering brain’ rather than my ‘experiencing brain’?

One thing when my score was low for a few days was when I had a really painful abscess on my tooth. At the time I felt like the pain was unbearable so had a high pain score, but looking back I wonder if it was that bad, but I didn’t want to retrospectively change my score. Strangely, I had the flu twice in this year which gave me some health decrements, which I don’t think has ever happened to me before (I don’t think it was just ‘man flu’!).

I knew that I was going to have a baby this year but I didn’t know that I would spend 18 days in hospital, despite not being ill myself. This has led me to think a lot more about ‘caregiver effects‘ – the impact of close relatives being ill; it is unnerving spending night after night in hospital, in this case because my wife was very ill after giving birth, and then when my baby son was two months old, he got very ill (both are doing a lot better now). Being in hospital with a sick relative is a strange feeling, stressful and boring at the same time. I spent a long time staring out of the window or scrolling through Twitter. When my baby son was really ill he would not sleep and did not want to be put down, so my arms were aching after holding him all night. I was lucky that I had understanding managers in work and I was not significantly financially disadvantaged by caring for sick relatives. And glad of the NHS and not getting a huge bill when family members are discharged from hospital.

### Health, wellbeing & exercise

Doing this made me think more about the difference between health and wellbeing; there might be days where I was really happy but it wasn’t reflected in my EQ-5D index score. I noticed that doing exercise always led to a higher VAS score – maybe subconsciously I was thinking exercise was increasing my ‘health stock‘. I probably used the VAS score more like an overall wellbeing score rather than just health which is not correct – but I wonder if other people do this as well, and that is why there are less pronounced ceiling effects with the VAS score.

### Could trials measure EQ-5D every day?

One advantage of EQ-5D and QALYs over other health outcomes is that they should be measured over a schedule and use the area under the curve. Completing an EQ5D every day has shown me that health does vary every day, but I still think it might be impractical for trial participants to complete an EQ-5D questionnaire every day. Perhaps EQ-5D data could be combined with a simple daily VAS score, possibly out of ten rather than 100 for simplicity.

Joint worst day: 6th and 7th October: EQ-5D-3L index 0.264, EQ-5D-5L index 0.724; personal EQ-5D-5L index 0.824; VAS score 60 – ‘abscess on tooth, couldn’t sleep, face swollen’.

Joint best day: 27th January, 7th September, 11th September, 18th November, 4th December, 30th December: EQ-5D-3L index 1.00;  both EQ-5D-5L index scores 1.00; VAS score 95 – notes include ‘lovely day with family’, ‘went for a run’, ‘holiday’, ‘met up with friends’.

# Sam Watson’s journal round-up for 10th September 2018

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Probabilistic sensitivity analysis in cost-effectiveness models: determining model convergence in cohort models. PharmacoEconomics [PubMed] Published 27th July 2018

Probabilistic sensitivity analysis (PSA) is rightfully a required component of economic evaluations. Deterministic sensitivity analyses are generally biased; averaging the outputs of a model based on a choice of values from a complex joint distribution is not likely to be a good reflection of the true model mean. PSA involves repeatedly sampling parameters from their respective distributions and analysing the resulting model outputs. But how many times should you do this? Most times, an arbitrary number is selected that seems “big enough”, say 1,000 or 10,000. But these simulations themselves exhibit variance; so-called Monte Carlo error. This paper discusses making the choice of the number of simulations more formal by assessing the “convergence” of simulation output.

In the same way as sample sizes are chosen for trials, the number of simulations should provide an adequate level of precision, anything more wastes resources without improving inferences. For example, if the statistic of interest is the net monetary benefit, then we would want the confidence interval (CI) to exclude zero as this should be a sufficient level of certainty for an investment decision. The paper, therefore, proposed conducting a number of simulations, examining the CI for when it is ‘narrow enough’, and conducting further simulations if it is not. However, I see a problem with this proposal: the variance of a statistic from a sequence of simulations itself has variance. The stopping points at which we might check CI are themselves arbitrary: additional simulations can increase the width of the CI as well as reduce them. Consider the following set of simulations from a simple ratio of random variables $ICER = gamma(1,0.01)/normal(0.01,0.01)$:The “stopping rule” therefore proposed doesn’t necessarily indicate “convergence” as a few more simulations could lead to a wider, as well as narrower, CI. The heuristic approach is undoubtedly an improvement on the current way things are usually done, but I think there is scope here for a more rigorous method of assessing convergence in PSA.

Mortality due to low-quality health systems in the universal health coverage era: a systematic analysis of amenable deaths in 137 countries. The Lancet [PubMed] Published 5th September 2018

Richard Horton, the oracular editor-in-chief of the Lancet, tweeted last week:

There is certainly an argument that academic journals are good forums to make advocacy arguments. Who better to interpret the analyses presented in these journals than the authors and audiences themselves? But, without a strict editorial bulkhead between analysis and opinion, we run the risk that the articles and their content are influenced or dictated by the political whims of editors rather than scientific merit. Unfortunately, I think this article is evidence of that.

No-one debates that improving health care quality will improve patient outcomes and experience. It is in the very definition of ‘quality’. This paper aims to estimate the numbers of deaths each year due to ‘poor quality’ in low- and middle-income countries (LMICs). The trouble with this is two-fold: given the number of unknown quantities required to get a handle on this figure, the definition of quality notwithstanding, the uncertainty around this figure should be incredibly high (see below); and, attributing these deaths in a causal way to a nebulous definition of ‘quality’ is tenuous at best. The approach of the article is, in essence, to assume that the differences in fatality rates of treatable conditions between LMICs and the best performing health systems on Earth, among people who attend health services, are entirely caused by ‘poor quality’. This definition of quality would therefore seem to encompass low resourcing, poor supply of human resources, a lack of access to medicines, as well as everything else that’s different in health systems. Then, to get to this figure, the authors have multiple sources of uncertainty including:

• Using a range of proxies for health care utilisation;
• Using global burden of disease epidemiology estimates, which have associated uncertainty;
• A number of data slicing decisions, such as truncating case fatality rates;
• Estimating utilisation rates based on a predictive model;
• Estimating the case-fatality rate for non-users of health services based on other estimated statistics.

Despite this, the authors claim to estimate a 95% uncertainty interval with a width of only 300,000 people, with a mean estimate of 5.0 million, due to ‘poor quality’. This seems highly implausible, and yet it is claimed to be a causal effect of an undefined ‘poor quality’. The timing of this article coincides with the Lancet Commission on care quality in LMICs and, one suspects, had it not been for the advocacy angle on care quality, it would not have been published in this journal.

Embedding as a pitfall for survey‐based welfare indicators: evidence from an experiment. Journal of the Royal Statistical Society: Series A Published 4th September 2018

Health economists will be well aware of the various measures used to evaluate welfare and well-being. Surveys are typically used that are comprised of questions relating to a number of different dimensions. These could include emotional and social well-being or physical functioning. Similar types of surveys are also used to collect population preferences over states of the world or policy options, for example, Kahneman and Knetsch conducted a survey of WTP for different environmental policies. These surveys can exhibit what is called an ’embedding effect’, which Kahneman and Knetsch described as when the value of a good varies “depending on whether the good is assessed on its own or embedded as part of a more inclusive package.” That is to say that the way people value single dimensional attributes or qualities can be distorted when they’re embedded as part of a multi-dimensional choice. This article reports the results of an experiment involving students who were asked to weight the relative importance of different dimensions of the Better Life Index, including jobs, housing, and income. The randomised treatment was whether they rated ‘jobs’ as a single category, or were presented with individual dimensions, such as the unemployment rate and job security. The experiment shows strong evidence of embedding – the overall weighting substantially differed by treatment. This, the authors conclude, means that the Better Life Index fails to accurately capture preferences and is subject to manipulation should a researcher be so inclined – if you want evidence to say your policy is the most important, just change the way the dimensions are presented.

Credits