Should we just abandon p-values altogether?

P-values do not indicate whether a scientific finding is true. Statistical significance does not equal economic or clinical significance. And p-values are often presented for tests that have no bearing on the questions being posed. So what’s the point?

Empirical economics papers often present parameter estimates alongside a number of asterisks (e.g. -0.01***). The asterisks indicate the p-value (more asterisks equal a smaller p-value) and hence statistical significance, but they are frequently interpreted as indicating a ‘finding’ or the importance of the result. In many fields a ‘negative finding’, i.e. a non-statistically significant result, will struggle to get published in a journal leading to a problem of publication bias. Some journals have indicated their willingness to publish negative findings; nevertheless careers are still built on statistically significant findings even if many of these findings are not reproducible. All of this despite a decades long literature decrying the misuse and misinterpretation of the p-value. Indeed, the situation recently prompted the American Statistical Association (ASA) to issue a position statement on p-values. So is there any point in publishing p-values at all? Let’s firstly consider what they’re not useful for, echoing a number of points in the ASA’s statement.

P-values do not indicate a ‘true’ result

The p-value does not equal the probability that a null hypothesis is true yet this remains a common misconception. This point was the subject of the widely cited paper by John Ioannidis, ‘Why most published research findings are false.’ While in general there is a positive correlation between the p-values and the probability of the null hypothesis given the data, it is certainly not enough to make any claims regarding the ‘truth’ of a finding. Indeed, rejection of the null hypothesis doesn’t even mean the alternative is necessarily preferred, many other hypotheses may better explain the data. Furthermore, under certain prior distributions for the hypotheses being tested, the same data that leads to rejection of the null hypothesis in a frequentist framework can give a high posterior probability in favour of the null hypothesis – this is Lindley’s paradox.

Statistical significance does not equal economic or clinical significance

This point is widely discussed and yet remains a problem in many areas. In their damning diatribe on statistical significance, Stephen Ziliak and Deirdre McCloskey surveyed all the articles published in the American Economic Review in the 1990s and found 80% conflated statistical significance with economic significance. A trivial difference can be made statistically significant with a large enough sample size. For huge sample sizes, such as 15 million admissions in the Hospital Episode Statistics, the p-value is almost meaningless. The exception of course is if the null hypothesis is true, but the null hypothesis is almost never likely to be true.

P-values are often testing pointless hypotheses

As Cohen (1990) describes it

A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that’s the only way you can take it in formal hypothesis testing), is always false in the real world….If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what’s the big deal about rejecting it?

Often p-values are presented for the hypothesis test that the coefficient of interest is exactly zero. But this test often has absolutely no bearing on the research question. If we are trying to estimate the price elasticity of demand for a product, why provide the results of a test that examines the probability the data was produced by a model in which the elasticity is exactly zero?

A p-value does not lend support to a particular hypothesis in isolation

No researcher should really be surprised by the results of their study. We decide to conduct a study on the basis of past evidence, theory, and other sources of knowledge which typically give us an indication of what to expect. If the result goes against all prior expectations, then it’s statistical significance does not provide any good reason to discount prior knowledge and theory. Indeed, the Duhem-Quine thesis states that it is impossible to test a hypothesis in isolation. Any study requires a large number of assumptions and auxiliary hypotheses. Even in a laboratory setting we are assuming the equipment works correctly and is measuring what we think it is measuring. No result can be interpreted in isolation of the context in which the study was conducted.

Some authors have suggested abandoning the p-value altogether and indeed some journals do not permit p-values at all. But, this is too strong a position. The p-value does tell us something. It tells us whether the data we’ve observed is compatible with a particular model. But it is just one piece of information among many that lead to decent scientific inferences. The robustness of the model and how it stands up to changes in background assumptions, the prior knowledge that went into building it, and the economic or clinical interpretation of the results are what are required. The American Economic Review does not publish asterisks alongside empirical estimates: other journals should follow suit. While I don’t think p-values should be abandoned, the phrase ‘statistical significance’ can probably be consigned to the dustbin.

Image credit: Repapetilto (CC BY-SA 3.0)

Chris Sampson’s journal round-up for 25th July 2016

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

The income-health relationship ‘beyond the mean’: new evidence from biomarkers. Health Economics [PubMed] Published 15th July 2016

Going ‘beyond the mean’ is becoming a big deal in health economics, as we get better data and develop new tools for analysis. In economic evaluation we’re finding our feet in the age of personalised medicine. As this new study shows, analogous changes are taking place in the econometrics literature. We all know that income correlates with measures of health, but we know a lot less about the nature of this correlation. If we want to target policy in the most cost-effective way, simply asserting that higher income (on average) improves health is not that useful. This study uses a new econometric technique known as the recentered influence function (RIF) to look at the income-health relationship ‘beyond the mean’. It considers blood-based biomarkers with known disease associations as indicators of health, specifically: cholesterol, HbA1c, Fibrinogen and Ferritin. Even for someone with limited willingness to engage with econometrics (e.g. me) the methods are surprisingly elegant and intuitive. In short, the analysis divides people (in terms of each biomarker) into quantiles. So, for example, we can look at the people with high HbA1c (related to diabetes) and see if the relationship with income is different to that for people with a low HbA1c. The study finds that the income-health relationship is non-linear across the health distribution, thus proving the merit of the RIF approach. Generally, the income gradients were higher at the top quintiles. This suggests that income may be more important in tipping a person over the edge – in terms of clinical cut-offs – than in affecting the health of people who are closer to the average. The analysis for cholesterol showed that looking only at the mean (i.e. income increases cholesterol) might hide a positive relationship for most of the distribution but a negative relationship at the top end. This could translate into very different policy implications. The study carried out further decomposition analyses to look at gender differences, which support further differentiation in policy. This kind of analysis will become increasingly important in policy development and evaluation. We might start to see public interventions being exposed as useless for most people, and perhaps actively harmful for some, even if they look good on average.

Using patient-reported outcomes for economic evaluation: getting the timing right. Value in Health Published 15th July 2016

The estimation of QALYs involves an ‘area under the curve’ approach to outcome measurement. How accurately the estimate represents the ‘true’ number of QALYs (if there is such a thing) depends both on where the dots (i.e. data collection points) are and how we connect them. This study looks at the importance of these methodological decisions. Most of us (I think) would use linear interpolation between time points, but the authors also consider an alternative assumption that the health state utility value applies to the whole of the preceding period. The study looks at data for total knee arthroplasty with SF-12 data at 6 weeks, 3 and 6 months and then annually up to 5 years after the operation. The authors evaluated the use of alternative single postoperative SF-6D scores compared with using all of the data, and both linear and immediate interpolation. This gave 12 alternative scenarios. Collecting only at 3 months and using linear interpolation gave a surprisingly similar profile to the ‘true’ number of QALYs, at only about 5% too high. Collecting only at 6 weeks would underestimate QALY gain by 41%, while 6 months and 12 months would be 18% too high and 8% too low, respectively. It’s easy to see that the more data you can collect, the more accurate will be your results. This study shows how important it can be to collect health state data at the most appropriate time. 3 months seems to be the figure for total knee arthroplasty, but it will likely differ for other interventions.

Should the NHS abolish the purchaser-provider split? BMJ [PubMed] Published 12th July 2016

The NHS in England (notably not Scotland or Wales) operates with what’s known as the ‘internal market’, which separates the NHS’s functions as purchasers of health care and as providers of health care. In this BMJ ‘Head to Head’, Alan Maynard argues that it ought to be abolished, while Michael Dixon (a GP) defends its maintenance. Maynard argues that the internal market has been an expensive experiment, and that the results of the experiment have not been well-recorded. The Care Quality Commission and Monitor – organisations supporting the internal market – cost around £300 million to run in 2014/15. Dixon argues that the purchaser-provider split offered “refreshingly new accountability” to local commissioners with front-line experience rather than to the Department of Health. Though Dixon seems to be defending an idealised version of commissioning, rather than what is actually observed in practice. Neither party’s argument is particularly compelling because neither draws on any strong empirical findings. That’s because convincing evidence doesn’t exist either way.

The impact of women’s health clinic closures on preventive care. American Economic Journal: Applied Economics [RePEcPublished July 2016

More than the UK, the US has a problem with anti-abortion campaigns having political influence to the extent that they affect the availability of health services for women. This study is interested in cancer screenings and routine check-ups, which aren’t politically contentious. The authors obtain data that include clinic locations and survey responses from the Behavioural Risk Factor Surveillance System. The analysis relates to Texas and Wisconsin, which are states that implemented major funding cuts to family planning services and women’s health centres between 2007 and 2012. 25% of clinics in Texas closed during this period. As centres close, and women are required to travel further, we’d expect use of services to decline. There might also be knock-on effects in terms of waiting times and prices at the remaining centres. The analyses focus on the effect of distance to the nearest facility on use of preventive services, controlling for demographics and fixed effects relating to location and time. The principal finding is that an increase in distance to a woman’s nearest facility is likely to reduce use of preventive care, namely Pap tests and clinical breast exams. A 100-mile increase in the distance to the nearest centre was associated with a 7.4% percentage point drop in propensity to receive a breast exam in the past year, and 8.7% for Pap tests. Furthermore, the analysis shows that the impact is greater for individuals with lower educational attainment, particularly in the case of mammography. These findings demonstrate the threat to women’s health posed by political posturing.

Photo credit: Antony Theobald (CC BY-NC-ND 2.0)

Free to choose?: A comment on Gaynor, Propper, and Seiler (2016)

Free to choose? Reform, choice, and consideration sets in the English National Health Service. M Gaynor, C Propper, and S Seiler. 2016. American Economic Review [RePEcForthcoming

The enhancement of patient choice about healthcare provider is a popular target for reform across many European countries, including the UK. In 2006, the government in the UK mandated that patients had to be given the choice of at least five providers when being referred for treatment. Prior to this time the decision lay principally with the referring clinician. The impact of this reform was previously examined in two papers: Gaynor, Moreno-Serra, and Propper (2012) and Cooper et al. (2011). The latter of these attracted some criticism particularly after it was used in support of the controversial Health and Social Care Act (2012). One key aspect of this criticism revolved around the use of mortality from acute myocardial infarction (AMI) as a quality indicator, despite AMI being an emergency condition over which patients have no choice about their treatment hospital. The former of those two papers expands the analysis to consider other outcomes such as all cause death.

In this new paper, examining the same 2006 reform, the authors this time examine coronary artery bypass graft surgery (CABG). CABG is an elective condition thus permitting patient choice. The analysis considers where patients chose to go and on what basis, the effect of choice on patient mortality, and the effect of competition on hospital market share. The authors develop a novel method to analyse consideration sets to compare choices made prior to and after the reform. One of the key findings is that patients respond to signals of quality – in this case hospital mortality rates. And this improved sorting of patients into hospitals with lower mortality rates. However, here the distinction between a quality signal and actual quality is blurred.

It stands to reason that a patient would prefer a hospital with lower apparent mortality rates. But, mortality rates, whether adjusted or unadjusted, have been shown to be poorly correlated with preventable mortality in the NHS. The mortality rates used in this paper are the estimated (OLS) coefficients from a model of in-hospital death regressed on dummy variables for each hospital, thus estimating the crude mortality rate. To address the potential mismatch between mortality rates and the causal effect of a hospital on patient mortality, Gaynor, Propper, and Seiler also use an instrumental variable (IV) estimator for the hospital dummy with patient distance to each hospital as the instrument. This follows the method of Gowrisankaran and Town (1999). Gaynor, Propper, and Seiler state that a Hausman test does not reject the hypothesis that the OLS and IV coefficients are different and so use the OLS crude mortality rate estimates in the primary analysis. Nevertheless they repeat the analysis and show that patient hospital of choice is also associated with the IV estimated mortality rate. But the question still remains as to whether these estimates can be relied upon to demonstrate that the reforms improved mortality risk in the CABG cohort.

Gowrisankaran and Town showed there was little correlation in their study between GLS and IV estimates of hospital quality (see the Figure). Hogan et al. (2015) showed that the association between standardised hospital mortality ratios (SMR) and the proportion of preventable deaths was very weak. And Girling et al. (2012) estimated that if 6% of hospital deaths are preventable then the predictive value of the SMR can be no greater than 9%. However, they suggest that this could rise to 30% if 15% of deaths were preventable. So it seems perhaps surprising that Gaynor, Propper, and Seiler find no evidence of a difference between their OLS and IV estimators. Now, for CABG, the proportion of preventable deaths may be very high, Guru et al. (2008) estimated it to be as high as 32%. But, they also find there to be no correlation between preventable deaths and mortality rates in hospitals. Taken altogether this might suggest a flaw in the analysis of Gaynor, Propper, and Seiler.

Scatterplot of the GLS and IV estimates of hospital quality from separate years regression.

Scatterplot of the GLS and IV estimates of hospital quality. (c) Elsevier Science B.V.

When choosing between healthcare providers patients are provided with information about quality. This normally comes in the form of SMRs as we have previously discussed. Gaynor, Propper, and Seiler demonstrate that patients respond to this information. But, as we have argued, these signals are poor with respect to actual quality. Thus the consequences of patients sorting into hospitals in terms of actual deaths avoided is difficult to ascertain. A Hausman test suggests that the OLS and IV results are similar in this study and there is an association between patient choice and the IV estimated quality variable. But many arguments may run counter to these findings: the Hausman test could have low power, the IV estimator may be biased by the large number of moment restrictions, the instruments may not be conditionally independent of the hospitals, the common support between hospitals may not include the highest risk patients, and so forth. This paper successfully demonstrates how patients respond to information in making their choice between hospitals, but whether the reforms reduced mortality remains unanswered in my opinion.

Photo credit: Ramdlon (CC0)