Skip to content

Placebos for all, or why the p-value should have no place in healthcare decision making

David Colquhoun, professor of pharmacology at UCL, has a new essay over at Aeon opining about the problems with p-values. A short while back, we also discussed p-value problems, and Colquhoun arrives at the same conclusions as us about the need to abandon ideas of ‘statistical significance’. Despite mentioning Bayes theorem, Colquhoun’s essay is firmly based in the frequentist statistical paradigm. He frames his discussion around the frequency of false positive and negative findings in repeated trials:

An example should make the idea more concrete. Imagine testing 1,000 different drugs, one at a time, to sort out which works and which doesn’t. You’d be lucky if 10 per cent of them were effective, so let’s proceed by assuming a prevalence or prior probability of 10 per cent. Say we observe a ‘just significant’ result, for example, a P = 0.047 in a single test, and declare that this is evidence that we have made a discovery. That claim will be wrong, not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high. Just as in screening tests, the reason for this large number of mistakes is that the number of false positives in the tests where there is no real effect outweighs the number of true positives that arise from the cases in which there is a real effect.

The argument is focused on the idea that the aim of medical research is to determine whether an effect exists or not and this determination is made through repeated testing. This idea, imbued by null hypothesis significance testing, supports the notion of the researcher as discoverer. By finding p<0.05 the researcher has ‘discovered’ an effect.

Perhaps I am being unfair on Colquhoun. His essay is a well written argument about how the p-value fails on its own terms. Nevertheless, in 1,000 clinical drug trials, as the quote above considers, why would we expect any of the drugs to not have an effect? If a drug has reached the stage of clinical trials, then it has been designed on the basis of biological and pharmacological theory and it has likely been tested in animals and early phase clinical trials. All of these things would suggest that the drug has some kind of physiological mechanism and should demonstrate an effect. Even placebos exhibit a physiological response, otherwise why would we need placebo controls in trials?

The evidence suggests that a placebo, on average, reduces the relative risk of a given outcome by around 5%. One would therefore need an exceptionally large sample size to find a statistically significant effect of a placebo versus no treatment with 80% power. If we were examining in-hospital mortality as our outcome with a baseline risk of, say, 3%, then we would need approximately 210,000 patients for a 5% significance level. But, let’s say we did do this trial and found p<0.05. Given this ‘finding’, should hospitals provide sugar pills at the door? Even if patients are aware that a drug is a placebo it can still have an effect. The cost of producing and distributing sugar pills is likely to be relatively tiny. In a hospital with 50,000 annual admissions, 75 deaths per year could be averted by our sugar pill, so even if it cost £1,000,000 per year to provide our sugar pill, it would result in a cost per death averted of approximately £13,000 – highly cost-effective in all likelihood.

I think most people would unequivocally reject provision of sugar pills in hospitals for the same reason many people reject other sham treatments like homeopathy. It is potentially cost-effective, but it is perhaps unethical or inequitable for a health service to invest in treatments whose effects are not medically founded. But as we delve deeper into the question of the sugar pill, it is clear that at no point does the p-value matter to our reasoning. Whether or not we conducted our placebo mega-trial, it is the magnitude of the effect, uncertainty, prior evidence and understanding of the treatment, cost-effectiveness, and ethics and equity that we must consider.

Colquhoun acknowledges that the scientific process relies on inductive enquiry and that p-values cannot satisfy this. Even if there were not the significant problems of false positives and negatives that he discusses, p-values have no useful role to play in making the substantive decisions the healthcare service has to take. Nevertheless, health technology appraisal bodies, such as NICE in England and Wales, often use p<0.05 as a heuristic to filter out ‘ineffective’ treatments. But, as the literature warning against the use and misuse of null hypothesis significance testing grows, the tide may yet turn.

Credits

By

  • Sam Watson

    Health economics, statistics, and health services research at the University of Warwick. Also like rock climbing and making noise on the guitar.

We now have a newsletter!

Sign up to receive updates about the blog and the wider health economics world.

0 0 votes
Article Rating
Subscribe
Notify of
guest

14 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
trackback
4 years ago

[…] economic content but because it is a very good example how not to use statistical significance. In previous articles on the blog we’ve discussed the misuse and misinterpretation of p-values, but I […]

trackback
6 years ago

[…] widespread cautionary messages, p-values and claims of statistical significance are continuously misused. One of the […]

trackback
6 years ago

[…] Bayesian ranges of equivalents), the use of which has no place in decision making. On this point I wholeheartedly […]

trackback
6 years ago

[…] is an effect on whether there was statistical significance, a gripe we’ve contended with previously. And there are no corrections for multiple comparisons, despite the well over 100 hypothesis tests […]

eindt
6 years ago

re placebo: placebo interventions may bring about changes in how trial participants report symptoms without bringing about real changes in their health.

The Cochrane placebo review you cited concluded: “There was no evidence that placebo interventions in general have clinically important effects. A possible moderate effect on subjective continuous outcomes, especially pain, could not be clearly distinguished from bias.”

The danger of research findings being distorted by these sorts of biases is another reason for caution,

David Colquhoun
Member
6 years ago

Well the reason for not going fully Bayesian is the usual one. We don’t know what prior to use so you run the risk of getting different answers from every Bayesian that you ask, That’s why I restricted myself to finding a minimum value for the false positive rate. Of course we’d really like to be able to specify the risk of a false positive every time we do a test. Sadly. I don’t think that’s possible in most cases.

David Colquhoun
Reply to  Sam Watson
6 years ago

Well the “model” used for my simulated t tests is that which satisfies the assumptions made bythe t test. Of course those assumptions may not be true in particular cases. Presumably that would make the situation still worse than I describe.

David Colquhoun
Member
6 years ago

Thanks for the discussion. The Aeon piece was intended to give the gist of the argument to a wide audience. The details of my arguments are in the paper on which the Aeon piece is based: http://rsos.royalsocietypublishing.org/content/1/3/140216

But I do not agree that “the p-value should have no place in healthcare decision-making”, which is in your title,On the contrary, I say

“If you do a significance test, just state the p-value and give the effect size and confidence intervals. But be aware that 95% intervals may be misleadingly narrow, and they tell you nothing whatsoever about the false discovery rate. Confidence intervals are just a better way of presenting the same information that you get from a p-value”

I think that the best that can be done is to change the words that are used o descibe P values.

“Rather than the present nonsensical descriptions:

P > 0.05 not significant
P < 0.05 significant
P 0.05 very weak evidence
P = 0.05 weak evidence: worth another look
P = 0.01 moderate evidence for a real effect
P = 0.001 strong evidence for real effect”

Of course adopting this would mean missing more real effects. In practice the choice will depend on the relative costs (both in money and i reputation) of wrongly claiming a discovery, and of missing a real effect.

Chris Sampson
Reply to  David Colquhoun
6 years ago

But it does not follow that the p-values, or confidence intervals, or any characterisation of the likelihood of the existence or non-existence of an effect, should influence decision making. Can we have words that describe p-values in terms of their implication for a decision? For example, whether or not to provide A rather than B? What would they be?
P > 0.05 not relevant
P < 0.05 not relevant

Chris Sampson
Admin
6 years ago

I tend to agree. All comes back to the irrelevance of inference, I suppose: https://ideas.repec.org/a/eee/jhecon/v18y1999i3p341-364.html

14
0
Join the conversation, add a commentx
()
x
%d bloggers like this: