P-values do not indicate whether a scientific finding is true. Statistical significance does not equal economic or clinical significance. And p-values are often presented for tests that have no bearing on the questions being posed. So what’s the point?

Empirical economics papers often present parameter estimates alongside a number of asterisks (e.g. -0.01***). The asterisks indicate the p-value (more asterisks equal a smaller p-value) and hence statistical significance, but they are frequently interpreted as indicating a ‘finding’ or the importance of the result. In many fields a ‘negative finding’, i.e. a non-statistically significant result, will struggle to get published in a journal leading to a problem of publication bias. Some journals have indicated their willingness to publish negative findings; nevertheless careers are still built on statistically significant findings even if many of these findings are not reproducible. All of this despite a decades long literature decrying the misuse and misinterpretation of the p-value. Indeed, the situation recently prompted the American Statistical Association (ASA) to issue a position statement on p-values. So is there any point in publishing p-values at all? Let’s firstly consider what they’re not useful for, echoing a number of points in the ASA’s statement.

**P-values do not indicate a ‘true’ result**

The p-value does not equal the probability that a null hypothesis is true yet this remains a common misconception. This point was the subject of the widely cited paper by John Ioannidis, ‘Why most published research findings are false.’ While in general there is a positive correlation between the p-values and the probability of the null hypothesis given the data, it is certainly not enough to make any claims regarding the ‘truth’ of a finding. Indeed, rejection of the null hypothesis doesn’t even mean the alternative is necessarily preferred, many other hypotheses may better explain the data. Furthermore, under certain prior distributions for the hypotheses being tested, the same data that leads to rejection of the null hypothesis in a frequentist framework can give a high posterior probability in favour of the null hypothesis – this is Lindley’s paradox.

**Statistical significance does not equal economic or clinical significance**

This point is widely discussed and yet remains a problem in many areas. In their damning diatribe on statistical significance, Stephen Ziliak and Deirdre McCloskey surveyed all the articles published in the American Economic Review in the 1990s and found 80% conflated statistical significance with economic significance. A trivial difference can be made statistically significant with a large enough sample size. For huge sample sizes, such as 15 million admissions in the Hospital Episode Statistics, the p-value is almost meaningless. The exception of course is if the null hypothesis is true, but the null hypothesis is almost never likely to be true.

**P-values are often testing pointless hypotheses**

As Cohen (1990) describes it

A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that’s the only way you can take it in formal hypothesis testing), is always false in the real world….If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what’s the big deal about rejecting it?

Often p-values are presented for the hypothesis test that the coefficient of interest is exactly zero. But this test often has absolutely no bearing on the research question. If we are trying to estimate the price elasticity of demand for a product, why provide the results of a test that examines the probability the data was produced by a model in which the elasticity is exactly zero?

**A p-value does not lend support to a particular hypothesis in isolation**

No researcher should really be surprised by the results of their study. We decide to conduct a study on the basis of past evidence, theory, and other sources of knowledge which typically give us an indication of what to expect. If the result goes against all prior expectations, then it’s statistical significance does not provide any good reason to discount prior knowledge and theory. Indeed, the Duhem-Quine thesis states that it is impossible to test a hypothesis in isolation. Any study requires a large number of assumptions and auxiliary hypotheses. Even in a laboratory setting we are assuming the equipment works correctly and is measuring what we think it is measuring. No result can be interpreted in isolation of the context in which the study was conducted.

–

Some authors have suggested abandoning the p-value altogether and indeed some journals do not permit p-values at all. But, this is too strong a position. The p-value does tell us *something*. It tells us whether the data we’ve observed is compatible with a particular model. But it is just one piece of information among many that lead to decent scientific inferences. The robustness of the model and how it stands up to changes in background assumptions, the prior knowledge that went into building it, and the economic or clinical interpretation of the results are what are required. The American Economic Review does not publish asterisks alongside empirical estimates: other journals should follow suit. While I don’t think p-values should be abandoned, the phrase ‘statistical significance’ can probably be consigned to the dustbin.

Image credit: Repapetilto (CC BY-SA 3.0)

[…] significant’ where 95% confidence intervals clearly would not be – however, the lack of significance stars and p-values is refreshing). Such evidence should weigh heavily on policy makers’ minds when […]

[…] significant’ where 95% confidence intervals clearly would not be – however, the lack of significance stars and p-values is refreshing). Such evidence should weigh heavily on policy makers’ minds when […]

[…] widespread cautionary messages, p-values and claims of statistical significance are continuously misused. One […]

[…] and economic world. Everything is connected in some way. It’s one of the reasons I’ve argued before against null hypothesis significance testing: no effect is going to be exactly zero. Our job is […]

[…] if an increase in the morbidity rate visible in a figure is statistically significant’. Oh dear. Theoretically, the effect makes sense, alcohol does lead to physical and social harms. But […]

[…] Perhaps the statement that inference was irrelevant was made just to capture our attention. After all the process of updating our knowledge of the net benefits of alternatives from data is inference. But Claxton’s statement refers more to the process of hypothesis testing and p-values (or Bayesian ranges of equivalents), the use of which has no place in decision making. On this point I wholeheartedly agree. […]

[…] about whether there is an effect on whether there was statistical significance, a gripe we’ve contended with previously. And there are no corrections for multiple comparisons, despite the well over 100 […]

[…] economic studies, often entitled something beginning ‘The determinants of…’, abuse p-values to determine what’s driving changes to a health outcome of interest. It makes sense therefore […]

[…] And, with the very large sample sizes often used for these studies, these will likely appear “statistically significant“. Recent evidence from the UK has suggested that 27% of A&E attendances are admitted at […]

[…] essay over at Aeon opining about the problems with p-values. A short while back, we also discussed p-value problems, and Colquhoun arrives at the same conclusions as us about the need to abandon ideas of […]

[…] there isn’t a weekend effect. Perhaps this dichotomy has been ingrained into our psyches by hypothesis testing and p-values. But, it’s a bad way to think about it; care does differ between the weekend and weekdays […]

[…] The growth in healthcare expenditure is and has long been a concern for policy makers worldwide. Many factors contribute to this increase, for example it may be a consequence of economic growth, but perhaps the most widely cited determinant is an ageing population. A growing literature is questioning the simplicity of this assumption though: is it age per se that leads to increased healthcare costs or is it proximity to death? This study presents a new analysis of this question. More specifically, the authors propose that the observed decline in health related quality of life (HRQoL) associated with age is due to the increased age-specific mortality and the lower HRQoL associated with being close to death, and not age itself. The implication of this is that increased longevity is unlikely to have a large effect on overall healthcare expenditure. To examine this empirically the authors use longitudinal data on HRQoL from 356 individuals over 16 years. Issues such as the skewness of the outcome measure and it being bounded between zero and one, along with the correlation within individuals over time and the relationship between the mean and variance are accommodated using a Bayesian beta regression. Estimation using MCMC methods provides great flexibility in terms of complex models that may be intractable using classical maximum likelihood methods, and the inclusion of previous evidence through the prior reduces uncertainty that may arise due to smaller sample sizes. A wide range of sensitivity analyses are also conducted. The authors’ key finding is that when time to death is included as a variable the effect of age is almost negligible. This journal round-up author’s interest in Bayesian methods has grown exponentially over the last few years and most of his analyses are now in the Bayesian paradigm. Articles such as this demonstrate the power and flexibility of such methods and, importantly, they show how the emphasis is on the estimation problem rather than arbitrary hypothesis testing and estimation of p-values. […]

[…] this otherwise great paper provides a good example of where stars for significance are somewhat redundant in light of the sample size (>6million in some […]