David Colquhoun, professor of pharmacology at UCL, has a new essay over at Aeon opining about the problems with p-values. A short while back, we also discussed p-value problems, and Colquhoun arrives at the same conclusions as us about the need to abandon ideas of ‘statistical significance’. Despite mentioning Bayes theorem, Colquhoun’s essay is firmly based in the frequentist statistical paradigm. He frames his discussion around the frequency of false positive and negative findings in repeated trials:
An example should make the idea more concrete. Imagine testing 1,000 different drugs, one at a time, to sort out which works and which doesn’t. You’d be lucky if 10 per cent of them were effective, so let’s proceed by assuming a prevalence or prior probability of 10 per cent. Say we observe a ‘just significant’ result, for example, a P = 0.047 in a single test, and declare that this is evidence that we have made a discovery. That claim will be wrong, not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high. Just as in screening tests, the reason for this large number of mistakes is that the number of false positives in the tests where there is no real effect outweighs the number of true positives that arise from the cases in which there is a real effect.
The argument is focused on the idea that the aim of medical research is to determine whether an effect exists or not and this determination is made through repeated testing. This idea, imbued by null hypothesis significance testing, supports the notion of the researcher as discoverer. By finding p<0.05 the researcher has ‘discovered’ an effect.
Perhaps I am being unfair on Colquhoun. His essay is a well written argument about how the p-value fails on its own terms. Nevertheless, in 1,000 clinical drug trials, as the quote above considers, why would we expect any of the drugs to not have an effect? If a drug has reached the stage of clinical trials, then it has been designed on the basis of biological and pharmacological theory and it has likely been tested in animals and early phase clinical trials. All of these things would suggest that the drug has some kind of physiological mechanism and should demonstrate an effect. Even placebos exhibit a physiological response, otherwise why would we need placebo controls in trials?
The evidence suggests that a placebo, on average, reduces the relative risk of a given outcome by around 5%. One would therefore need an exceptionally large sample size to find a statistically significant effect of a placebo versus no treatment with 80% power. If we were examining in-hospital mortality as our outcome with a baseline risk of, say, 3%, then we would need approximately 210,000 patients for a 5% significance level. But, let’s say we did do this trial and found p<0.05. Given this ‘finding’, should hospitals provide sugar pills at the door? Even if patients are aware that a drug is a placebo it can still have an effect. The cost of producing and distributing sugar pills is likely to be relatively tiny. In a hospital with 50,000 annual admissions, 75 deaths per year could be averted by our sugar pill, so even if it cost £1,000,000 per year to provide our sugar pill, it would result in a cost per death averted of approximately £13,000 – highly cost-effective in all likelihood.
I think most people would unequivocally reject provision of sugar pills in hospitals for the same reason many people reject other sham treatments like homeopathy. It is potentially cost-effective, but it is perhaps unethical or inequitable for a health service to invest in treatments whose effects are not medically founded. But as we delve deeper into the question of the sugar pill, it is clear that at no point does the p-value matter to our reasoning. Whether or not we conducted our placebo mega-trial, it is the magnitude of the effect, uncertainty, prior evidence and understanding of the treatment, cost-effectiveness, and ethics and equity that we must consider.
Colquhoun acknowledges that the scientific process relies on inductive enquiry and that p-values cannot satisfy this. Even if there were not the significant problems of false positives and negatives that he discusses, p-values have no useful role to play in making the substantive decisions the healthcare service has to take. Nevertheless, health technology appraisal bodies, such as NICE in England and Wales, often use p<0.05 as a heuristic to filter out ‘ineffective’ treatments. But, as the literature warning against the use and misuse of null hypothesis significance testing grows, the tide may yet turn.