Placebos for all, or why the p-value should have no place in healthcare decision making

David Colquhoun, professor of pharmacology at UCL, has a new essay over at Aeon opining about the problems with p-values. A short while back, we also discussed p-value problems, and Colquhoun arrives at the same conclusions as us about the need to abandon ideas of ‘statistical significance’. Despite mentioning Bayes theorem, Colquhoun’s essay is firmly based in the frequentist statistical paradigm. He frames his discussion around the frequency of false positive and negative findings in repeated trials:

An example should make the idea more concrete. Imagine testing 1,000 different drugs, one at a time, to sort out which works and which doesn’t. You’d be lucky if 10 per cent of them were effective, so let’s proceed by assuming a prevalence or prior probability of 10 per cent. Say we observe a ‘just significant’ result, for example, a P = 0.047 in a single test, and declare that this is evidence that we have made a discovery. That claim will be wrong, not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high. Just as in screening tests, the reason for this large number of mistakes is that the number of false positives in the tests where there is no real effect outweighs the number of true positives that arise from the cases in which there is a real effect.

The argument is focused on the idea that the aim of medical research is to determine whether an effect exists or not and this determination is made through repeated testing. This idea, imbued by null hypothesis significance testing, supports the notion of the researcher as discoverer. By finding p<0.05 the researcher has ‘discovered’ an effect.

Perhaps I am being unfair on Colquhoun. His essay is a well written argument about how the p-value fails on its own terms. Nevertheless, in 1,000 clinical drug trials, as the quote above considers, why would we expect any of the drugs to not have an effect? If a drug has reached the stage of clinical trials, then it has been designed on the basis of biological and pharmacological theory and it has likely been tested in animals and early phase clinical trials. All of these things would suggest that the drug has some kind of physiological mechanism and should demonstrate an effect. Even placebos exhibit a physiological response, otherwise why would we need placebo controls in trials?

The evidence suggests that a placebo, on average, reduces the relative risk of a given outcome by around 5%. One would therefore need an exceptionally large sample size to find a statistically significant effect of a placebo versus no treatment with 80% power. If we were examining in-hospital mortality as our outcome with a baseline risk of, say, 3%, then we would need approximately 210,000 patients for a 5% significance level. But, let’s say we did do this trial and found p<0.05. Given this ‘finding’, should hospitals provide sugar pills at the door? Even if patients are aware that a drug is a placebo it can still have an effect. The cost of producing and distributing sugar pills is likely to be relatively tiny. In a hospital with 50,000 annual admissions, 75 deaths per year could be averted by our sugar pill, so even if it cost £1,000,000 per year to provide our sugar pill, it would result in a cost per death averted of approximately £13,000 – highly cost-effective in all likelihood.

I think most people would unequivocally reject provision of sugar pills in hospitals for the same reason many people reject other sham treatments like homeopathy. It is potentially cost-effective, but it is perhaps unethical or inequitable for a health service to invest in treatments whose effects are not medically founded. But as we delve deeper into the question of the sugar pill, it is clear that at no point does the p-value matter to our reasoning. Whether or not we conducted our placebo mega-trial, it is the magnitude of the effect, uncertainty, prior evidence and understanding of the treatment, cost-effectiveness, and ethics and equity that we must consider.

Colquhoun acknowledges that the scientific process relies on inductive enquiry and that p-values cannot satisfy this. Even if there were not the significant problems of false positives and negatives that he discusses, p-values have no useful role to play in making the substantive decisions the healthcare service has to take. Nevertheless, health technology appraisal bodies, such as NICE in England and Wales, often use p<0.05 as a heuristic to filter out ‘ineffective’ treatments. But, as the literature warning against the use and misuse of null hypothesis significance testing grows, the tide may yet turn.


Chris Sampson’s journal round-up for 17th October 2016

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Estimating health-state utility for economic models in clinical studies: an ISPOR Good Research Practices Task Force report. Value in Health [PubMedPublished 3rd October 2016

When it comes to model-based cost-per-QALY analyses, researchers normally just use utility values from a single clinical study. So we best be sure that these studies are collecting the right data. This ISPOR Task Force report presents guidelines for the collection and reporting of utility values in the context of clinical studies, with a view to making them as useful as possible to the modelling process. The recommendations are quite general and would apply to most aspects of clinical studies: do some early planning; make sure the values are relevant to the population being modelled; bear HTA agencies’ expectations in mind. It bothers me though that the basis for the recommendations is not very concrete (the word “may” appears more than 100 times). The audience for this report isn’t so much people building models, or people conducting clinical trials. Rather, it’s people who are conducting some modelling within a clinical study (or vice versa). I’m in that position, so why don’t the guidelines strike me as useful? They expect a lot of time to be dedicated to the development of the model structure and aims before the clinical study gets underway. So modelling work would be conducted alongside the full duration of a clinical study. In my experience, that isn’t how things usually work. And even if that does happen, practical limitations to data collection will thwart the satisfaction of the vast majority of the recommendations. In short, I think the Task Force’s position puts the cart on top of the horse. Models require data and, yes, models can be used to inform data collection. But seldom can proposed modelling work be the principal basis for determining data collection in a clinical study. I think that may be a good thing and that a more incremental approach (review – model – collect data – repeat) is more fruitful. Having said all that, and having read the paper, I do think it’s useful. It isn’t useful as a set of recommendations that we might expect from an ISPOR Task Force, but rather as a list of things to think about if you’re somebody involved in the collection of health state utility data. If you’re one of those people then it’s well worth a read.

Reliability, validity, and feasibility of direct elicitation of children’s preferences for health states: a systematic review. Medical Decision Making [PubMedPublished 30th September 2016

Set aside for the moment the question of whose preferences we ought to use in valuing health improvements. There are undoubtedly situations in which it would be interesting and useful to know patients’ preferences. What if those patients are children? This study presents the findings from a systematic review of attempts at direct elicitation of preferences from children, focusing on psychometric properties and with the hope of identifying the best approach. To be included in the review, studies needed to report validity, reliability and/or feasibility. 26 studies were included, with most of them using time trade-off (n=14) or standard gamble (n=11). 7 studies reported validity and the findings suggested good construct validity with condition-specific but not generic measures. 4 studies reported reliability and TTO came off better than visual analogue scales. 9 studies reported on feasibility in terms of completion rates and generally found it to be high. The authors also extracted information about the use of preference elicitation in different age groups and found that studies making such comparisons suggested that it may not be appropriate for younger children. Generally speaking, it seems that standard gamble and time trade-off are acceptably valid, reliable and feasible. It’s important to note that there was a lot of potential for bias in the included studies, and that a number of them seemed somewhat lacking in their reporting. And there’s a definite risk of publication and reporting bias lurking here. I think a key issue that the study can’t really enlighten us on is the question of age. There might not be all that much difference between a 17 year old and a 27 year old, but there’s going to be a big difference between a 17 year old and a 7 year old. Future research needs to investigate the notion of an age threshold for valid preference elicitation. I’d like to see a more thorough quantitative analysis of findings from direct preference elicitation studies in children. But what we really need is a big new study in which children (both patients and general public) are asked to complete various direct preference elicitation tasks at multiple time points. Because right now, there just isn’t enough evidence.

Economic evaluation of integrated new technologies for health and social care: suggestions for policy makers, users and evaluators. Social Science & Medicine [PubMedPublished 24th September 2016

There are many debates that take place at the nexus of health care and social care, whether they be about funding, costs or outcome measurement. This study focusses on a specific example of health and social care integration – assisted living technologies (ALTs) – and tries to come up with a new and more appropriate method of economic evaluation. In this context, outcomes might matter ‘beyond health’. I should like this paper. It tries to propose an approach that might satisfy the suggestions I made in a recent essay. Why, then, am I not convinced? The authors outline their proposal as consisting of 3 steps: i) identify attributes relevant to the intervention, ii) value these in monetary terms and iii) value the health benefit. In essence the plan is to estimate QALYs for the health bit and then a monetary valuation for the other bits, with the ‘other bits’ specified in advance of the evaluation. That’s very easily said and not at all easily done. And the paper makes no argument that this is actually what we ought to be doing. Capabilities work their way in as attributes, but little consideration is given to the normative differences between this and other approaches (what I have termed ‘consequents’). The focus on ALTs is odd. The authors fill a lot of space arguing (unconvincingly) that it is a special case, before stating that their approach should be generalisable. The main problem can be summarised by a sentence that appears in the introduction: “the approach is highly flexible because the use of a consistent numeraire (either monetary or health) means that programmes can be compared even if the underlying attributes differ“. Maybe they can, but they shouldn’t. Or at least that’s what a lot of people think, which is precisely why we use QALYs. An ‘anything goes’ approach means that any intervention could easily be demonstrated to be more cost-effective than another if we just pick the right attributes. I’m glad to see researchers trying to tackle these problems, and this could be the start of something important, but I was disappointed that this paper couldn’t offer anything concrete.


Sam Watson’s journal round-up for 10th October 2016

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

This week’s journal round up-is a special edition featuring a series of papers on health econometrics published in this month’s issue of the Journal of the Royal Statistical Society: Series A.

Healthcare facility choice and user fee abolition: regression discontinuity in a multinomial choice setting. JRSS: A [RePEcPublished October 2016

Charges for access to healthcare – user fees – present a potential barrier to patients in accessing medical services. User fees were touted in the 1980s as a way to provide revenue for healthcare services in low and middle income countries, improve quality, and reduce overuse of limited services. However, a growing evidence base suggests that user fees do not achieve these ends and reduce uptake of preventative and curative services. This article seeks to provide new evidence on the topic using a regression discontinuity (RD) design while also exploring the use of RD with multinomial outcomes. Based on South African data, the discontinuity of interest is that children under the age of six are eligible for free public healthcare whereas older children must pay a fee; user fees for the under sixes were abolished following the end of apartheid in 1994. The results provide evidence that removal of user fees resulted in more patients using public healthcare facilities than costly private care or care at home. The authors describe how their non-parametric model performs better, in terms of out-of-sample predictive performance, than the parametric model. And when the non-parametric model is applied to examine treatment effects across income quantiles we find that the treatment effect is among poorer families and that it is principally due to them switching between home care and public healthcare. This analysis supports an already substantial literature on user fees, but a literature that has previously been criticised for a lack of methodological rigour, so this paper makes a welcome addition.

Do market incentives for hospitals affect health and service utilization?: evidence from prospective pay system–diagnosis-related groups tariffs in Italian regions. JRSS: A [RePEcPublished October 2016

The effect of pro-market reforms in the healthcare sector on hospital quality is a contentious and oft-discussed topic, not least due to the difficulties with measuring quality. We critically discussed a recent, prominent paper that analysed competitive reforms in the English NHS, for example. This article examines the effect of increased competition in Italy on health service utlisation: in the mid 1990s the Italian national health service moved from a system of national tariffs to region-specific tariffs in order for regions to better incentivise local health objectives and reflect production costs. For example, the tariffs for a vaginal delivery ranged from €697 to €1,750 in 2003. This variation between regions and over time provides a source of variation to analyse the effects of these reforms. The treatment is defined as a binary variable at each time point for whether the regions had switched from national to local tariffs, although one might suggest that this disposes of some interesting variation in how the policy was enacted. The headline finding is that the reforms had little or no effect on health, but did reduce utilisation of healthcare services. The authors interpret this as suggesting they reduce over-utilisation and hence improve efficiency. However, I am still pondering how this might work: presumably the marginal benefit of treating patients who do not require particular services is reduced, although the marginal cost of treating those patients who do not need it is likely also to be lower as they are healthier. The between-region differences in tariffs may well shed some light on this.

Short- and long-run estimates of the local effects of retirement on health. JRSS: A [RePEcPublished October 2016

The proportion of the population that is retired is growing. Governments have responded by increasing the retirement age to ensure the financial sustainability of pension schemes. But, retirement may have other consequences, not least on health. If retirement worsens one’s health then delaying the retirement age may improve population health, and if retirement is good for you, the opposite may occur. Retirement grants people a new lease of free time, which they may fill with health promoting activities, or the loss of activity and social relations may adversely impact on ones health and quality of life. In addition, people who are less healthy may be more likely to retire. Taken all together, estimating the effects of retirement on health presents an interesting statistical challenge with important implications for policy. This article uses the causal inference method du jour, regression discontinuity design, and the data are from that workhorse of British economic studies, the British Household Panel Survey. The discontinuity is obviously the retirement age; to deal with the potential reverse causality, eligibility for the state pension is used as an instrument. Overall the results suggest that the short term impact on health is minimal, although it does increase the risk of a person becoming sedentary, which in the long run may precipitate health problems.


Other articles on health econometrics in this special issue:

The association between asymmetric information, hospital competition and quality of healthcare: evidence from Italy.

This paper finds evidence that increased between hospital competition does not lead to improved outcomes as patients were choosing hospitals on the basis of information from their social networks. We featured this paper in a previous round-up.

A quasi-Monte-Carlo comparison of parametric and semiparametric regression methods for heavy-tailed and non-normal data: an application to healthcare costs.

This article considers the problem of modelling non-normally distributed healthcare costs data. Linear models with square root transformations and generalised linear models with square root link functions are found to perform the best.

Phantoms never die: living with unreliable population data.

Not strictly health econometrics, more demographics, this article explores how to make inferences about population mortality rates and trends when there are unreliable population data due to fluctuations in birth patterns. For researchers using macro health outcomes data, such corrections may prove useful.