David Colquhoun, professor of pharmacology at UCL, has a new essay over at Aeon opining about the problems with p-values. A short while back, we also discussed p-value problems, and Colquhoun arrives at the same conclusions as us about the need to abandon ideas of ‘statistical significance’. Despite mentioning Bayes theorem, Colquhoun’s essay is firmly based in the frequentist statistical paradigm. He frames his discussion around the frequency of false positive and negative findings in repeated trials:
An example should make the idea more concrete. Imagine testing 1,000 different drugs, one at a time, to sort out which works and which doesn’t. You’d be lucky if 10 per cent of them were effective, so let’s proceed by assuming a prevalence or prior probability of 10 per cent. Say we observe a ‘just significant’ result, for example, a P = 0.047 in a single test, and declare that this is evidence that we have made a discovery. That claim will be wrong, not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high. Just as in screening tests, the reason for this large number of mistakes is that the number of false positives in the tests where there is no real effect outweighs the number of true positives that arise from the cases in which there is a real effect.
The argument is focused on the idea that the aim of medical research is to determine whether an effect exists or not and this determination is made through repeated testing. This idea, imbued by null hypothesis significance testing, supports the notion of the researcher as discoverer. By finding p<0.05 the researcher has ‘discovered’ an effect.
Perhaps I am being unfair on Colquhoun. His essay is a well written argument about how the p-value fails on its own terms. Nevertheless, in 1,000 clinical drug trials, as the quote above considers, why would we expect any of the drugs to not have an effect? If a drug has reached the stage of clinical trials, then it has been designed on the basis of biological and pharmacological theory and it has likely been tested in animals and early phase clinical trials. All of these things would suggest that the drug has some kind of physiological mechanism and should demonstrate an effect. Even placebos exhibit a physiological response, otherwise why would we need placebo controls in trials?
The evidence suggests that a placebo, on average, reduces the relative risk of a given outcome by around 5%. One would therefore need an exceptionally large sample size to find a statistically significant effect of a placebo versus no treatment with 80% power. If we were examining in-hospital mortality as our outcome with a baseline risk of, say, 3%, then we would need approximately 210,000 patients for a 5% significance level. But, let’s say we did do this trial and found p<0.05. Given this ‘finding’, should hospitals provide sugar pills at the door? Even if patients are aware that a drug is a placebo it can still have an effect. The cost of producing and distributing sugar pills is likely to be relatively tiny. In a hospital with 50,000 annual admissions, 75 deaths per year could be averted by our sugar pill, so even if it cost £1,000,000 per year to provide our sugar pill, it would result in a cost per death averted of approximately £13,000 – highly cost-effective in all likelihood.
I think most people would unequivocally reject provision of sugar pills in hospitals for the same reason many people reject other sham treatments like homeopathy. It is potentially cost-effective, but it is perhaps unethical or inequitable for a health service to invest in treatments whose effects are not medically founded. But as we delve deeper into the question of the sugar pill, it is clear that at no point does the p-value matter to our reasoning. Whether or not we conducted our placebo mega-trial, it is the magnitude of the effect, uncertainty, prior evidence and understanding of the treatment, cost-effectiveness, and ethics and equity that we must consider.
Colquhoun acknowledges that the scientific process relies on inductive enquiry and that p-values cannot satisfy this. Even if there were not the significant problems of false positives and negatives that he discusses, p-values have no useful role to play in making the substantive decisions the healthcare service has to take. Nevertheless, health technology appraisal bodies, such as NICE in England and Wales, often use p<0.05 as a heuristic to filter out ‘ineffective’ treatments. But, as the literature warning against the use and misuse of null hypothesis significance testing grows, the tide may yet turn.
Credits
[…] economic content but because it is a very good example how not to use statistical significance. In previous articles on the blog we’ve discussed the misuse and misinterpretation of p-values, but I […]
[…] widespread cautionary messages, p-values and claims of statistical significance are continuously misused. One of the […]
[…] Bayesian ranges of equivalents), the use of which has no place in decision making. On this point I wholeheartedly […]
[…] is an effect on whether there was statistical significance, a gripe we’ve contended with previously. And there are no corrections for multiple comparisons, despite the well over 100 hypothesis tests […]
re placebo: placebo interventions may bring about changes in how trial participants report symptoms without bringing about real changes in their health.
The Cochrane placebo review you cited concluded: “There was no evidence that placebo interventions in general have clinically important effects. A possible moderate effect on subjective continuous outcomes, especially pain, could not be clearly distinguished from bias.”
The danger of research findings being distorted by these sorts of biases is another reason for caution,
I may have to defer to your better knowledge of placebos here, but I was under the impression that the placebo effect refers to physiological changes brought about by a placebo (not directly, of course). The Cochrane review concludes that the placebo doesn’t have “clinically important effects”. But, my point was that 5%, when applied across an entire patient population, is a potentially meaningful and cost-effective intervention. Indeed, many interventions at the hospital level (e.g. changes in doctor to patient ratios, electronic medical records) are probably going to exhibit effects of this magnitude or smaller.
Well the reason for not going fully Bayesian is the usual one. We don’t know what prior to use so you run the risk of getting different answers from every Bayesian that you ask, That’s why I restricted myself to finding a minimum value for the false positive rate. Of course we’d really like to be able to specify the risk of a false positive every time we do a test. Sadly. I don’t think that’s possible in most cases.
This objection has a long history and I think has been widely addressed elsewhere. My impression is that you might adhere to FIsher’s principle of using prior information only in study design and not analysis. But I think there is perhaps an ethical question there about *not* using prior information when it can improve the efficiency of your estimates. This is especially true when, as you illustrate, conclusions based on p-values and CIs run a high risk of being in error.
Examining the sensitivity of results to choice of prior and model is certainly required. I could equally object that you could get a different answer from every frequentist you ask as we don’t know what model to use, which data to include and exclude, which assumptions to use, etc. A good analysis addresses these questions of sensitivity in whichever paradigm. Further, I think in many cases there would be a consensus about various choices of weakly informative priors. For example, in a meta-analysis of the effects of a drug targeting cardiovascular disease mortality, we can probably all agree that it is very unlikely the drug will treble or more the risk of mortality or equally halve it or more, which might suggest a N(0, 0.5) prior for log relative risk or whatever.
Well the “model” used for my simulated t tests is that which satisfies the assumptions made bythe t test. Of course those assumptions may not be true in particular cases. Presumably that would make the situation still worse than I describe.
Thanks for the discussion. The Aeon piece was intended to give the gist of the argument to a wide audience. The details of my arguments are in the paper on which the Aeon piece is based: http://rsos.royalsocietypublishing.org/content/1/3/140216
But I do not agree that “the p-value should have no place in healthcare decision-making”, which is in your title,On the contrary, I say
“If you do a significance test, just state the p-value and give the effect size and confidence intervals. But be aware that 95% intervals may be misleadingly narrow, and they tell you nothing whatsoever about the false discovery rate. Confidence intervals are just a better way of presenting the same information that you get from a p-value”
I think that the best that can be done is to change the words that are used o descibe P values.
“Rather than the present nonsensical descriptions:
P > 0.05 not significant
P < 0.05 significant
P 0.05 very weak evidence
P = 0.05 weak evidence: worth another look
P = 0.01 moderate evidence for a real effect
P = 0.001 strong evidence for real effect”
Of course adopting this would mean missing more real effects. In practice the choice will depend on the relative costs (both in money and i reputation) of wrongly claiming a discovery, and of missing a real effect.
But it does not follow that the p-values, or confidence intervals, or any characterisation of the likelihood of the existence or non-existence of an effect, should influence decision making. Can we have words that describe p-values in terms of their implication for a decision? For example, whether or not to provide A rather than B? What would they be?
P > 0.05 not relevant
P < 0.05 not relevant
Perhaps my conclusion requires arguments that were left unsaid. I think the trouble is in approaching this as a problem of testing to see whether there a real effects or not. For a decision to be made we need to know the probability that the effect size is of a certain value, confidence intervals and p-values cannot tell us this. There’s no probability that the ‘true effect’ is in a confidence interval, it’s in there or it’s not. We could say that of 10,000 trials a certain number did contain the ‘true effect’, but that’s no use to decision making. Even then the interpretation of the confidence interval is with respect to the model and the choices made during the analysis. And then we would need to take into account what we know from outside of our trial. We might downgrade our assessment of ‘strong evidence’ if the trial produced a result contrary to all previous knowledge.
The effect size for the placebo I reported as 5% relative risk reduction comes from a Cochrane review, and it had a confidence interval that contained zero. But I don’t think anyone would say that we have little evidence that a placebo effect exists.
I would not argue that p-values or confidence intervals should not be reported. I would be foolish to suggest only point estimates be reported. Perhaps the trouble is with the frequentist interpretation overall. If the only way we can make sense of results and make decisions is by synthesizing the results with our current knowledge, why not just go Bayesian?
I tend to agree. All comes back to the irrelevance of inference, I suppose: https://ideas.repec.org/a/eee/jhecon/v18y1999i3p341-364.html
Although I would say that risk or inequality attitudes matters, which ENB assumes are either neutral or not required depending how you look at it. In that case, variance and uncertainty do matter. I just don’t think p values or CIs provide the right measure of uncertainty.