P-values do not indicate whether a scientific finding is true. Statistical significance does not equal economic or clinical significance. And p-values are often presented for tests that have no bearing on the questions being posed. So what’s the point?
Empirical economics papers often present parameter estimates alongside a number of asterisks (e.g. -0.01***). The asterisks indicate the p-value (more asterisks equal a smaller p-value) and hence statistical significance, but they are frequently interpreted as indicating a ‘finding’ or the importance of the result. In many fields a ‘negative finding’, i.e. a non-statistically significant result, will struggle to get published in a journal leading to a problem of publication bias. Some journals have indicated their willingness to publish negative findings; nevertheless careers are still built on statistically significant findings even if many of these findings are not reproducible. All of this despite a decades long literature decrying the misuse and misinterpretation of the p-value. Indeed, the situation recently prompted the American Statistical Association (ASA) to issue a position statement on p-values. So is there any point in publishing p-values at all? Let’s firstly consider what they’re not useful for, echoing a number of points in the ASA’s statement.
P-values do not indicate a ‘true’ result
The p-value does not equal the probability that a null hypothesis is true yet this remains a common misconception. This point was the subject of the widely cited paper by John Ioannidis, ‘Why most published research findings are false.’ While in general there is a positive correlation between the p-values and the probability of the null hypothesis given the data, it is certainly not enough to make any claims regarding the ‘truth’ of a finding. Indeed, rejection of the null hypothesis doesn’t even mean the alternative is necessarily preferred, many other hypotheses may better explain the data. Furthermore, under certain prior distributions for the hypotheses being tested, the same data that leads to rejection of the null hypothesis in a frequentist framework can give a high posterior probability in favour of the null hypothesis – this is Lindley’s paradox.
Statistical significance does not equal economic or clinical significance
This point is widely discussed and yet remains a problem in many areas. In their damning diatribe on statistical significance, Stephen Ziliak and Deirdre McCloskey surveyed all the articles published in the American Economic Review in the 1990s and found 80% conflated statistical significance with economic significance. A trivial difference can be made statistically significant with a large enough sample size. For huge sample sizes, such as 15 million admissions in the Hospital Episode Statistics, the p-value is almost meaningless. The exception of course is if the null hypothesis is true, but the null hypothesis is almost never likely to be true.
P-values are often testing pointless hypotheses
As Cohen (1990) describes it
A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that’s the only way you can take it in formal hypothesis testing), is always false in the real world….If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what’s the big deal about rejecting it?
Often p-values are presented for the hypothesis test that the coefficient of interest is exactly zero. But this test often has absolutely no bearing on the research question. If we are trying to estimate the price elasticity of demand for a product, why provide the results of a test that examines the probability the data was produced by a model in which the elasticity is exactly zero?
A p-value does not lend support to a particular hypothesis in isolation
No researcher should really be surprised by the results of their study. We decide to conduct a study on the basis of past evidence, theory, and other sources of knowledge which typically give us an indication of what to expect. If the result goes against all prior expectations, then it’s statistical significance does not provide any good reason to discount prior knowledge and theory. Indeed, the Duhem-Quine thesis states that it is impossible to test a hypothesis in isolation. Any study requires a large number of assumptions and auxiliary hypotheses. Even in a laboratory setting we are assuming the equipment works correctly and is measuring what we think it is measuring. No result can be interpreted in isolation of the context in which the study was conducted.
Some authors have suggested abandoning the p-value altogether and indeed some journals do not permit p-values at all. But, this is too strong a position. The p-value does tell us something. It tells us whether the data we’ve observed is compatible with a particular model. But it is just one piece of information among many that lead to decent scientific inferences. The robustness of the model and how it stands up to changes in background assumptions, the prior knowledge that went into building it, and the economic or clinical interpretation of the results are what are required. The American Economic Review does not publish asterisks alongside empirical estimates: other journals should follow suit. While I don’t think p-values should be abandoned, the phrase ‘statistical significance’ can probably be consigned to the dustbin.
Image credit: Repapetilto (CC BY-SA 3.0)