Widespread misuse of statistical significance in health economics

Despite widespread cautionary messages, p-values and claims of statistical significance are continuously misused. One of the most common errors is to mistake statistical significance for economic, clinical, or political significance. This error may manifest itself by authors interpreting only ‘statistically significant’ results as important, or even neglecting to examine the magnitude of estimated coefficients. For example, we’ve written previously about a claim of how statistically insignificant results are ‘meaningless’. Another common error is to ‘transpose the conditional’, that is to interpret the p-value as the posterior probability of a null hypothesis. For example, in an exchange on Twitter recently, David Colquhoun, whose discussions of p-values we’ve also previously covered, made the statement:

However, the p-value does not provide probability/evidence of a null hypothesis (that an effect ‘exists’). P-values are correlated with the posterior probability of the null hypothesis in a way that depends on statistical power, choice of significance level, and prior probability of the null. But observing a significant p-value only means that the data were unlikely to be produced by a particular model, not that the alternative hypothesis is true. Indeed, the null hypothesis may be a poor explanation for the observed data, but that does not mean it is a better explanation than the alternative. This is the essence of Lindley’s paradox.

So what can we say about p-values? The six principles of the ASA’s statement on p-values are:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.


In 1996, Deirdre McClosky and Stephen Ziliak surveyed economics papers published in the American Economic Review in the 1980s for p-value misuse. Overall, 70% did not distinguish statistical from economic significance and 96% misused a test statistic in some way. Things hadn’t improved when they repeated the study ten years later. Unfortunately, these problems are not exclusive to the AER. A quick survey of a top health economics journal, Health Economics, finds similar misuse as we discuss below. This journal is not singled out for any particular reason beyond that it’s one of the key journals in the field covered by this blog, and frequently features in our journal round-ups. Similarly, no comment is made on the quality of the studies or authors beyond the claims and use of statistical significance. Nevertheless, where there are p-values, there are problems. For such a pivotal statistic, one that careers can be made or broken on, we should at least get it right!

Nine studies were published in the May 2017 issue of Health Economics. The list below shows some examples of p-value errors in the text of the articles. The most common issue was using the p-value to interpret whether an effect exists or not, or using it as the (only) evidence to support or reject a particular hypothesis. As described above, the statistical significance of a coefficient does not imply the existence of an effect. Some of the statements claimed below to be erroneous may be contentious as, in the broader context of the paper, they may make sense. For example, claiming that a statistically significant estimate is evidence of an effect may be correct where the broader totality of the evidence suggests that any observed data would be incompatible with a particular model. However, this is generally not the way the p‘s are used.

Examples of p-value (mis-)statements

Even the CMI has no statistically significant effect on the facilitation ratio. Thus, the diversity and complexity of treated patients do not play a role for the subsidy level of hospitals.

the coefficient for the baserate is statistically significant for PFP hospitals in the FE model, indicating that a higher price level is associated with a lower level of subsidies.

Using the GLM we achieved nine significant effects, including, among others, Parkinson’s disease and osteoporosis. In all components we found more significant effects compared with the GLM approach. The number of significant effects decreases from component 2 (44 significant effects) to component 4 (29 significant effects). Although the GLM lead to significant results for intestinal diverticulosis, none of the component showed equivalent results. This might give a hint that taking the component based heterogeneity into account, intestinal diverticulosis does not significantly affect costs in multimorbidity patients. Besides this, certain coefficients are significant in only one component.

[It is unclear what ‘significant’ and ‘not significant’ refer to or how they are calculated but appear to refer to t>1.96. Not clear if corrections for multiple comparisons.]

There is evidence of upcoding as the coefficient of spreadp_posis statistically significant.

Neither [variable for upcoding] is statistically significant. The incentive for upcoding is, according to these results, independent of the statutory nature of hospitals.

The checkup significantly raises the willingness to pay any positive amount, although it does not significantly affect the amount reported by those willing to pay some positive amount.

[The significance is with reference to statistical significance].

Similarly, among the intervention group, there were lower probabilities of unhappiness or depression (−0.14, p = 0.045), being constantly under strain (0.098, p = 0.013), and anxiety or depression (−0.10, p = 0.016). There was no difference between the intervention group and control group 1 (eligible non-recipients) in terms of the change in the likelihood of hearing problems (p = 0.64), experiencing elevate blood pressure (p = 0.58), and the number of cigarettes smoked (p = 0.26).

The ∆CEs are also statistically significant in some educational categories. At T + 1, the only significant ∆CE is observed for cancer survivors with a university degree for whom the cancer effect on the probability of working is 2.5 percentage points higher than the overall effect. At T + 3, the only significant ∆CE is observed for those with no high school diploma; it is 2.2 percentage points lower than the overall cancer effect on the probability of working at T + 3.

And, just for balance, here is a couple from this year’s winner of the Arrow prize at iHEA, which gets bonus points for the phrase ‘marginally significant’, which can be used both to confirm and refute a hypothesis depending on the inclination of the author:

Our estimated net effect of waiting times for high-income patients (i.e., adding the waiting time coefficient and the interaction of waiting times and high income) is positive, but only marginally significant (p-value 0.055).

We find that patients care about distance to the hospital and both of the distance coefficients are highly significant in the patient utility function.


As we’ve argued before, p-values should not be the primary result reported. Their interpretation is complex and so often leads to mistakes. Our goal is to understand economic systems and to determine the economic, clinical, or policy relevant effects of interventions or modifiable characteristics. The p-value does provide some useful information but not enough to support the claims made from it.


#HEJC for 24/10/2014

The next #HEJC discussion will take place Friday 24th October, at 1pm London time on Twitter. To see what this means for your time zone visit Time.is or join the Facebook event. For more information about the Health Economics Journal Club and how to take part, click here.

The paper for discussion is a working paper published by Glasgow Caledonian University’s Yunus Centre. The authors are Neil McHugh and colleagues. The title of the paper is:

Extending life for people with a terminal illness: a moral right or an expensive death? Exploring societal perspectives

Following the meeting, a transcript of the Twitter discussion can be downloaded here.

Links to the article

Direct: http://www.gcu.ac.uk/media/gcalwebv2/ycsbh/yunuscentre/Extending%20Life%20for%20People%20with%20a%20Terminal%20Illness.pdf

RePEc: https://ideas.repec.org/p/yun/hewpse/201403.html

Summary of the paper

A lot of research effort has been spent on whether health economists’ most ingrained normative assumption should hold; is a QALY of equal value regardless of to whom it accrues. In the UK, the National Institute for Health and Care Excellence has given weighting to ‘special cases’; namely, life-extending drugs for patients near the end of their life (mainly for cancer). However, existing empirical research about whether societal values support such a weighting has given conflicting results.

McHugh et al, in their new working paper, present the first major mixed methods study of societal perspectives for QALY-weighting. The authors use Q methodology – which involves the ranking of opinion statements according to agreement – to elicit societal perspectives on the relative value of life extension for people with terminal illness. Opinion statements were collected from 4 sources:

  • newspaper articles
  • a NICE public consultation
  • 16 interviews with key informants
  • 3 focus groups with the general public

The Q sort was conducted with people from academia, the pharmaceutical industry, charities, patient groups, religious groups, clinicians, people with experience of terminal illness and a sample of the general public. The authors’ final sample included 61 Q sorts and factor analysis identified 3 distinguishable perspectives, which can be summarised as:

  1. A population perspective (value for money, no special cases)
  2. An individual perspective (value of life, not cost)
  3. A mixed perspective

Factor 1 individuals are unlikely to support any QALY-weighting, maintaining a utilitarian-type health-maximising perspective. Factor 2 respondents reject the denial of life extending treatments and assert that patients and their families should decided whether or not they wish to receive the treatment; regardless of cost. This group appear to disagree with cost-effectiveness analysis altogether. Factor 3 represents a more nuanced view, asserting that value is broader than health gain alone. However, factor 3 was associated with a focus on quality of life, and so support for expensive life-extending treatment would depend on this. It is unclear whether QALY-weighting would adequately achieve this.

Discussion points

  • Is the question of QALY-weighting a normative one or a positive one?
  • Are the three factors likely to be robust across ethical dilemmas other than terminal illness?
  • To what extent are the opinions associated with the 3 factors likely to be robust to further deliberation?
  • Are factor 2 respondents simply wrong?
  • Should QALY-weighting be based on democratic processes?
  • Is it of concern that current policy appears to reflect the views of health economists better than other groups?
  • Where do you stand?

Can’t join in with the Twitter discussion? Add your thoughts on the paper in the comments below.

A(nother) new #HEJC format

In recent months attendance at the journal club has declined to zero, despite maintaining interest in the build-up. Here at AHE blog towers we don’t have the resources to promote #HEJC any more. Promotion is unlikely to make a difference anyway; the world of health economics is a relatively small one and it will always be difficult to satisfy enough people’s preferences regarding topics and timings. However, we still think a health economics journal club serves a purpose and know that people are interested. So, we are introducing a new format. This new format depends more on individuals (you) than on large numbers attending on a regular basis.

Here’s how it will work:

  1. On the first day of each month a list of recently published working papers and discussion papers will be posted. This will be accompanied by a call for discussants.
  2. Readers of the blog can volunteer to discuss a paper by completing the accompanying form. A discussant will be expected to write a short summary of the paper and a short discussion. The discussant will decide when the Twitter discussion will take place.
  3. The discussant’s submission will be posted on the blog one week in advance of the Twitter discussion.
  4. The Twitter discussion will take place in the same way as previously.

For more details, visit the updated #HEJC page.