Poor statistical communication means poor statistics

Statistics is a broad and complex field. For a given research question any number of statistical approaches could be taken. In an article published last year, researchers asked 61 analysts to use the same dataset to address the question of whether referees were more likely to give dark skinned players a red card than light skinned players. They got 61 different responses. Each analysis had its advantages and disadvantages and I’m sure each analyst would have defended their work. However, as many statisticians and economists may well know, the merit of an approach is not the only factor that matters in its adoption.

There has, for decades, been criticism about the misunderstanding and misuse of null hypothesis significance testing (NHST). P-values have been a common topic on this blog. Despite this, NHST remains the predominant paradigm for most statistical work. If used appropriately this needn’t be a problem, but if it were being used appropriately it wouldn’t be used nearly as much: p-values can’t perform the inferential role many expect of them. It’s not difficult to understand why things are this way: most published work uses NHST, we teach students NHST in order to understand the published work, students become researchers who use NHST, and so on. Part of statistical education involves teaching the arbitrary conventions that have gone before such as that p-values are ‘significant’ if below 0.05 or a study is ‘adequately powered’ if power is above 80%. One of the most pernicious consequences of this is that these heuristics become a substitute for thinking. The presence of these key figures is expected and their absence often marked by a request from reviewers and other readers for their inclusion.

I have argued on this blog and elsewhere for a wider use of Bayesian methods (and less NHST) and I try to practice what I preach. For an ongoing randomised trial I am involved with, I adopted a Bayesian approach to design and analysis. Instead of the usual power calculation, I conducted a Bayesian assurance analysis (which Anthony O’Hagan has written some good articles on for those wanting more information). I’ll try to summarise the differences between ‘power’ and ‘assurance’ calculations by attempting to define them, which is actually quite hard!

Power calculation. If we were to repeat a trial infinitely many times, what sample size would we need so that in x% of trials the assumed data generating model produces data which would fall in the α% most extreme quantiles of the distribution of data that would be produced from the same data generating model but with one parameter set to exactly zero (or any equivalent hypothesis). Typically we set x%to be 80% (power) and α% to be 5% (statistical significance threshold).

Assurance calculation. For a given data generating model, what sample size do we need so that there is a x% probability that we will be 1-α% certain that the parameter is positive (or any equivalent choice).

The assurance calculation could be reframed in a decision framework as what sample size do we need so that there is a x% probability we will make the right decision about whether a parameter is positive (or any equivalent decision) given the costs of making the wrong decision.

Both of these are complex but I would argue it is the assurance calculation that gives us what we want to know most of the time when designing a trial. The assurance analysis also better represents uncertainty since we specify distributions over all the uncertain parameters rather than choose exact values. Despite this though, the funder of the trial mentioned above, who shall remain nameless, insisted on the results of a power calculation in order to be able to determine whether the trial was worth continuing with because that’s “what they’re used to.”

The main culprit for this issue is, I believe, communication. A simpler explanation with better presentation may have been easier to understand and accept. This is not to say that I do not believe the funder was substituting the heuristic ‘80% or more power = good’ for actually thinking about what we could learn from the trial. But until statisticians, economists, and other data analytic researchers start communicating better, how can we expect others to listen?

Image credit: Geralt

Sam Watson’s journal round-up for 11th February 2019

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Contest models highlight inherent inefficiencies of scientific funding competitions. PLoS Biology [PubMed] Published 2nd January 2019

If you work in research you will have no doubt thought to yourself at one point that you spend more time applying to do research than actually doing it. You can spend weeks working on (what you believe to be) a strong proposal only for it to fail against other strong bids. That time could have been spent collecting and analysing data. Indeed, the opportunity cost of writing extensive proposals can be very high. The question arises as to whether there is another method of allocating research funding that reduces this waste and inefficiency. This paper compares the proposal competition to a partial lottery. In this lottery system, proposals are short, and among those that meet some qualifying standard those that are funded are selected at random. This system has the benefit of not taking up too much time but has the cost of reducing the average scientific value of the winning proposals. The authors compare the two approaches using an economic model of contests, which takes into account factors like proposal strength, public benefits, benefits to the scientist like reputation and prestige, and scientific value. Ultimately they conclude that, when the number of awards is smaller than the number of proposals worthy of funding, the proposal competition is inescapably inefficient. It means that researchers have to invest heavily to get a good project funded, and even if it is good enough it may still not get funded. The stiffer the competition the more researchers have to work to win the award. And what little evidence there is suggests that the format of the application makes little difference to the amount of time spent by researchers on writing it. The lottery mechanism only requires the researcher to propose something that is good enough to get into the lottery. Far less time would therefore be devoted to writing it and more time spent on actual science. I’m all for it!

Preventability of early versus late hospital readmissions in a national cohort of general medicine patients. Annals of Internal Medicine [PubMed] Published 5th June 2018

Hospital quality is hard to judge. We’ve discussed on this blog before the pitfalls of using measures such as adjusted mortality differences for this purpose. Just because a hospital has higher than expected mortality does not mean those death could have been prevented with higher quality care. More thorough methods assess errors and preventable harm in care. Case note review studies have suggested as little as 5% of deaths might be preventable in England and Wales. Another paper we have covered previously suggests then that the predictive value of standardised mortality ratios for preventable deaths may be less than 10%.

Another commonly used metric is readmission rates. Poor care can mean patients have to return to the hospital. But again, the question remains as to how preventable these readmissions are. Indeed, there may also be substantial differences between those patients who are readmitted shortly after discharge and those for whom it may take a longer time. This article explores the preventability of early and late readmissions in ten hospitals in the US. It uses case note review and a number of reviewers to evaluate preventability. The headline figures are that 36% of early readmissions are considered preventable compared to 23% of late readmissions. Moreover, it was considered that the early readmissions were most likely to have been preventable at the hospital whereas for late readmissions, an outpatient clinic or the home would have had more impact. All in all, another paper which provides evidence to suggest crude, or even adjusted rates, are not good indicators of hospital quality.

Visualisation in Bayesian workflow. Journal of the Royal Statistical Society: Series A (Statistics in Society) [RePEc] Published 15th January 2019

This article stems from a broader programme of work from these authors on good “Bayesian workflow”. That is to say, if we’re taking a Bayesian approach to analysing data, what steps ought we to be taking to ensure our analyses are as robust and reliable as possible? I’ve been following this work for a while as this type of pragmatic advice is invaluable. I’ve often read empirical papers where the authors have chosen, say, a logistic regression model with covariates x, y, and z and reported the outcomes, but at no point ever justified why this particular model might be any good at all for these data or the research objective. The key steps of the workflow include, first, exploratory data analysis to help set up a model, and second, performing model checks before estimating model parameters. This latter step is important: one can generate data from a model and set of prior distributions, and if the data that this model generates looks nothing like what we would expect the real data to look like, then clearly the model is not very good. Following this, we should check whether our inference algorithm is doing its job, for example, are the MCMC chains converging? We can also conduct posterior predictive model checks. These have had their criticisms in the literature for using the same data to both estimate and check the model which could lead to the model generalising poorly to new data. Indeed in a recent paper of my own, posterior predictive checks showed poor fit of a model to my data and that a more complex alternative was better fitting. But other model fit statistics, which penalise numbers of parameters, led to the alternative conclusions. So the simpler model was preferred on the grounds that the more complex model was overfitting the data. So I would argue posterior predictive model checks are a sensible test to perform but must be interpreted carefully as one step among many. Finally, we can compare models using tools like cross-validation.

This article discusses the use of visualisation to aid in this workflow. They use the running example of building a model to estimate exposure to small particulate matter from air pollution across the world. Plots are produced for each of the steps and show just how bad some models can be and how we can refine our model step by step to arrive at a convincing analysis. I agree wholeheartedly with the authors when they write, “Visualization is probably the most important tool in an applied statistician’s toolbox and is an important complement to quantitative statistical procedures.”