Visualising PROMs data

The patient reported outcomes measures, or PROMs, is a large database with before and after health-related quality of life (HRQoL) measures for a large number of patients undergoing four key conditions: hip replacement, knee replacement, varicose vein surgery and surgery for groin hernia. The outcome measures are the EQ-5D index and visual analogue scale (and a disease-specific measure for three of the interventions). These data also contain the provider of the operation. Being publicly available, these data allow us to look at a range of different questions: what’s the average effect of the surgery on HRQoL? What are the differences between providers in gains to HRQoL or in patient casemix? Great!

The first thing we should always do with new data is to look at it. This might be in an exploratory way to determine the questions to ask of the data or in an analytical way to get an idea of the relationships between variables. Plotting the data communicates more about what’s going on than any table of statistics alone. However, the plots on the NHS Digital website might be accused of being a little uninspired as they collapse a lot of the variation into simple charts that conceal a lot of what’s going on. For example:

So let’s consider other ways of visualising this data. For all these plots a walk through of the code is at the end of this post.

Now, I’m not a regular user of PROMs data, so what I think are the interesting features of the data may not reflect what the data are generally used for. For this, I think the interesting features are:

  • The joint distribution of pre- and post-op scores
  • The marginal distributions of pre- and post-op scores
  • The relationship between pre- and post-op scores over time

We will pool all the data from six years’ worth of PROMs data. This gives us over 200,000 observations. A scatter plot with this information is useless as the density of the points will be very high. A useful alternative is hexagonal binning, which is like a two-dimensional histogram. Hexagonal tiles, which usefully tessellate and are more interesting to look at than squares, can be shaded or coloured with respect to the number of observations in each bin across the support of the joint distribution of pre- and post-op scores (which is [-0.5,1]x[-0.5,1]). We can add the marginal distributions to the axes and then add smoothed trend lines for each year. Since the data are constrained between -0.5 and 1, the mean may not be a very good summary statistic, so we’ll plot a smoothed median trend line for each year. Finally, we’ll add a line on the diagonal. Patients above this line have improved and patients below it deteriorated.

Hip replacement results

Hip replacement results

There’s a lot going on in the graph, but I think it reveals a number of key points about the data that we wouldn’t have seen from the standard plots on the website:

  • There appear to be four clusters of patients:
    • Those who were in close to full health prior to the operation and were in ‘perfect’ health (score = 1) after;
    • Those who were in close to full health pre-op and who didn’t really improve post-op;
    • Those who were in poor health (score close to zero) and made a full recovery;
    • Those who were in poor health and who made a partial recovery.
  • The median change is an improvement in health.
  • The median change improves modestly from year to year for a given pre-op score.
  • There are ceiling effects for the EQ-5D.

None of this is news to those who study these data. But this way of presenting the data certainly tells more of a story that the current plots on the website.

R code

We’re going to consider hip replacement, but the code is easily modified for the other outcomes. Firstly we will take the pre- and post-op score and their difference and pool them into one data frame.

# df 14/15
df<-read.csv("C:/docs/proms/Record Level Hip Replacement 1415.csv")

df$post<- df$Post.Op.Q.EQ5D.Index
df$diff<- df$post - df$pre

df1415 <- df[,c('Provider.Code','pre','post','diff')]

# df 13/14
df<-read.csv("C:/docs/proms/Record Level Hip Replacement 1314.csv")

df$post<- df$Post.Op.Q.EQ5D.Index
df$diff<- df$post - df$pre

df1314 <- df[,c('Provider.Code','pre','post','diff')]

# df 12/13
df<-read.csv("C:/docs/proms/Record Level Hip Replacement 1213.csv")

df$post<- df$Post.Op.Q.EQ5D.Index
df$diff<- df$post - df$pre

df1213 <- df[,c('Provider.Code','pre','post','diff')]

# df 11/12
df<-read.csv("C:/docs/proms/Hip Replacement 1112.csv")

df$post<- df$Q2_EQ5D_INDEX
df$diff<- df$post - df$pre

df1112 <- df[,c('Provider.Code','pre','post','diff')]

# df 10/11
df<-read.csv("C:/docs/proms/Record Level Hip Replacement 1011.csv")

df$post<- df$Q2_EQ5D_INDEX
df$diff<- df$post - df$pre

df1011 <- df[,c('Provider.Code','pre','post','diff')]




Now, for the plot. We will need the packages ggplot2, ggExtra, and extrafont. The latter package is just to change the plot fonts, not essential, but aesthetically pleasing.

loadfonts(device = "win")

 geom_quantile(aes(color=year),method = "rqss", lambda = 2,quantiles=0.5,size=1)+
 scale_fill_gradient2(name="Count (000s)",low="light grey",midpoint = 15000,
   mid="blue",high = "red",
 labs(x="Pre-op EQ-5D index score",y="Post-op EQ-5D index score")+
 theme(legend.position = "bottom",text=element_text(family="Gill Sans MT"))

ggMarginal(p, type = "histogram")

Sam Watson’s journal round-up for 6th March 2017

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

It’s good to be first: order bias in reading and citing NBER working papers. The Review of Economics and Statistics [RePEcPublished 23rd February 2017

Each week one of the authors at this blog choose three or four recently published studies to summarise and briefly discuss. Making this choice from the many thousands of articles published every week can be difficult. I browse those journals that publish in my area and search recently published economics papers on PubMed and Econlit for titles that pique my interest. But this strategy is not without its own flaws as this study aptly demonstrates. When making a choice among many alternatives, people aren’t typically presented with a set of choices, rather a list. This arises in healthcare as well. In an effort to promote competition, at least in the UK, patients are presented with a list of possible of providers and some basic information about those providers. We recently covered a paper that explored this expansion of choice ‘sets’ and investigated its effects on quality. We have previously criticised the use of such lists. People often skim these lists relying on simple heuristics to make choices. This article shows that for the weekly email of new papers published by the National Bureau of Economic Research (NBER), being listed first leads to an increase of approximately 30% in downloads and citations, despite the essentially random ordering of the list. This is certainly not the first study to illustrate the biases in human decision making, but it shows that this journal round-up may not be a fair reflection of the literature, and providing more information about healthcare providers may not have the impact on quality that might be hypothesised.

Economic conditions, illicit drug use, and substance use disorders in the United States. Journal of Health Economics [PubMed] Published March 2017

We have featured a large number of papers about the relationship between macroeconomic conditions and health and health-related behaviours on this blog. It is certainly one of the health economic issues du jour and one we have discussed in detail. Generally speaking, when looking at an aggregate level, such as countries or states, all-cause mortality appears to be pro-cyclical: it declines in economic downturns. Whereas an examination at individual or household levels suggest unemployment and reduced income is generally bad for health. It is certainly possible to reconcile these two effects as any discussion of Simpson’s paradox will reveal. This study takes the aggregate approach to looking at US state-level unemployment rates and their relationship with drug use. It’s relevant to the discussion around economic conditions and health; the US has seen soaring rates of opiate-related deaths recently, although whether this is linked to the prevailing economic conditions remains to be seen. Unfortunately, this paper predicates a lot of its discussion about whether there is an effect on whether there was statistical significance, a gripe we’ve contended with previously. And there are no corrections for multiple comparisons, despite the well over 100 hypothesis tests that are conducted. That aside, the authors conclude that the evidence suggests that use of ecstasy and heroin is procyclical with respect to unemployment (i.e increase with greater unemployment) and LSD, crack cocaine, and cocaine use is counter-cyclical. The results appear robust to the model specifications they compare, but I find it hard to reconcile some of the findings with the prior information about how people actually consume drugs. Many drugs are substitutes and/or compliments for one another. For example, many heroin users began using opiates through abuse of prescription drugs such as oxycodone but made the switch as heroin is generally much cheaper. Alcohol and marijuana have been shown to be substitutes for one another. All of this suggesting a lack of independence between the different outcomes considered. People may also lose their job because of drug use. Taken all together I remain a little sceptical of the conclusions from the study, but it is nevertheless an interesting and timely piece of research.

Child-to-adult neurodevelopmental and mental health trajectories after early life deprivation: the young adult follow-up of the longitudinal English and Romanian Adoptees study. The Lancet [PubMedPublished 22nd February 2017

Does early life deprivation lead to later life mental health issues? A question that is difficult to answer with observational data. Children from deprived backgrounds may be predisposed to mental health issues, perhaps through familial inheritance. To attempt to discern whether deprivation in early life is a cause of mental health issues this paper uses data derived from a cohort of Romanian children who spent time in one of the terribly deprived institutions of Ceaușescu’s Romania and who were later adopted by British families. These institutions were characterised by poor hygiene, inadequate food, and lack of social or educational stimulation. A cohort of British adoptees was used for comparison. For children who spent more than six months in one of the deprived institutions, there was a large increase in cognitive and social problems in later life compared with either British adoptees or those who spent less than six months in an institution. The evidence is convincing, with differences being displayed across multiple dimensions of mental health, and a clear causal mechanism by which deprivation acts. However, for this and many other studies that I write about on this blog, a disclaimer might be needed when there is significant (pun intended) abuse and misuse of p-values. Ziliak and McClosky’s damning diatribe on p-values, The Cult of Statistical Significance, presents examples of lists of p-values being given completely out of context, with no reference to the model or hypothesis test they are derived from, and with the implication that they represent whether an effect exists or not. This study does just that. I’ll leave you with this extract from the abstract:

Cognitive impairment in the group who spent more than 6 months in an institution remitted from markedly higher rates at ages 6 years (p=0·0001) and 11 years (p=0·0016) compared with UK controls, to normal rates at young adulthood (p=0·76). By contrast, self-rated emotional symptoms showed a late onset pattern with minimal differences versus UK controls at ages 11 years (p=0·0449) and 15 years (p=0·17), and then marked increases by young adulthood (p=0·0005), with similar effects seen for parent ratings. The high deprivation group also had a higher proportion of people with low educational achievement (p=0·0195), unemployment (p=0·0124), and mental health service use (p=0·0120, p=0·0032, and p=0·0003 for use when aged <11 years, 11–14 years, and 15–23 years, respectively) than the UK control group.


Brent Gibbons’s journal round-up for 30th January 2017

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

For this week’s round-up, I selected three papers from December’s issue of Health Services Research. I didn’t intend to to limit my selections to one issue of one journal but as I narrowed down my selections from several journals, these three papers stood out.

Treatment effect estimation using nonlinear two-stage instrumental variable estimators: another cautionary note. Health Services Research [PubMed] Published December 2016

This paper by Chapman and Brooks evaluates the properties of a non-linear instrumental variables (IV) estimator called two-stage residual inclusion or 2SRI. 2SRI has been more recently suggested as a consistent estimator of treatment effects under conditions of selection bias and where the dependent variable of the 2nd-stage equation is either binary or otherwise non-linear in its distribution. Terza, Bradford, and Dismuke (2007) and Terza, Basu, and Rathouz (2008) furthermore claimed that 2SRI estimates can produce unbiased estimates not just of local average treatment effects (LATE) but of average treatment effects (ATE). However, Chapman and Brooks question why 2SRI, which is analogous to two-stage least squares (2SLS) when both the first and second stage equations are linear, should not require similar assumptions as in 2SLS when generalizing beyond LATE to ATE. Backing up a step, when estimating treatment effects using observational data, one worry when trying to establish a causal effect is bias due to treatment choice. Where patient characteristics related to treatment choice are unobservable and one or more instruments is available, linear IV estimation (i.e. 2SLS) produces unbiased and consistent estimates of treatment effects for “marginal patients” or compliers. These are the patients whose treatment effects were influenced by the instrument and their treatment effects are termed LATE. But if there is heterogeneity in treatment effects, a case needs to be made that treatment effect heterogeneity is not related to treatment choice in order to generalize to ATE.  Moving to non-linear IV estimation, Chapman and Brooks are skeptical that this case for generalizing LATE to ATE no longer needs to be made with 2SRI. 2SRI, for those not familiar, uses the residual from stage 1 of a two-stage estimator as a variable in the 2nd-stage equation that uses a non-linear estimator for a binary outcome (e.g. probit) or another non-linear estimator (e.g. poisson). The authors produce a simulation that tests the 2SRI properties over varying conditions of uniqueness of the marginal patient population and the strength of the instrument. The uniqueness of the marginal population is defined as the extent of the difference in treatment effects for the marginal population as compared to the general population. For each scenario tested, the bias between the estimated LATE and the true LATE and ATE is calculated. The findings support the authors’ suspicions that 2SRI is subject to biased results when uniqueness is high. In fact, the 2SRI results were only practically unbiased when uniqueness was low, but were biased for both ATE and LATE when uniqueness was high. Having very strong instruments did help reduce bias. In contrast, 2SLS was always practically unbiased for LATE for different scenarios and the authors use these results to caution researchers on using “new” estimation methods without thoroughly understanding their properties. In this case, old 2SLS still outperformed 2SRI even when dependent variables were non-linear in nature.

Testing the replicability of a successful care management program: results from a randomized trial and likely explanations for why impacts did not replicate. Health Services Research [PubMed] Published December 2016

As is widely known, how to rein in U.S. healthcare costs has been a source of much hand-wringing. One promising strategy has been to promote better management of care in particular for persons with chronic illnesses. This includes coordinating care between multiple providers, encouraging patient adherence to care recommendations, and promoting preventative care. The hope was that by managing care for patients with more complex needs, higher cost services such as emergency visits and hospitalizations could be avoided. CMS, the Centers for Medicare and Medicaid Services, funded a demonstration of a number of care management programs to study what models might be successful in improving quality and reducing costs. One program implemented by Health Quality Partners (HQP) for Medicare Fee-For-Service patients was successful in reducing hospitalizations (by 34 percent) and expenditures (by 22 percent) for a select group of patients who were identified as high-risk. The demonstration occurred from 2002 – 2010 and this paper reports results for a second phase of the demonstration where HQP was given additional funding to continue treating only high-risk patients in the years 2010 – 2014. High-risk patients were identified as having a diagnosis of congestive heart failure (CHF), chronic obstructive pulmonary disease (COPD), coronary artery disease (CAD), or diabetes and had a hospitalization in the year prior to enrollment. In essence, phase II of the demonstration for HQP served as a replication of the original demonstration. The HQP care management program was delivered by nurse coordinators who regularly talked with patients and provided coordinated care between primary care physicians and specialists, as well as other services such as medication guidance. All positive results from phase I vanished in phase II and the authors test several hypotheses for why results did not replicate. They find that treatment group patients had similar hospitalization rates between phase I and II, but that control group patients had substantially lower phase II hospitalization rates. Outcome differences between phase I and phase II were risk-adjusted as phase II had an older population with higher severity of illness. The authors also used propensity score re-weighting to further control for differences in phase I and phase II populations. The affordable care act did promote similar care management services through patient-centered medical homes and accountable care organizations that likely contributed to the usual care of control group patients improving. The authors also note that the effectiveness of care management may be sensitive to the complexity of the target population needs. For example, the phase II population was more homebound and was therefore unable to participate in group classes. The big lesson in this paper though is that demonstration results may not replicate for different populations or even different time periods.

A machine learning framework for plan payment risk adjustment. Health Services Research [PubMed] Published December 2016

Since my company has been subsumed under IBM Watson Health, I have been trying to wrap my head around this big data revolution and the potential of technological advances such as artificial intelligence or machine learning. While machine learning has infiltrated other disciplines, it is really just starting to influence health economics, so watch out! This paper by Sherri Rose is a nice introduction into a range of machine learning techniques that she applies to the formulation of plan payment risk adjustments. In insurance systems where patients can choose from a range of insurance plans, there is the problem of adverse selection where some plans may attract an abundance of high risk patients. To control for this, plans (e.g. in the affordable care act marketplaces) with high percentages of high risk consumers get compensated based on a formula that predicts spending based on population characteristics, including diagnoses. Rose says that these formulas are still based on a 1970s framework of linear regression and may benefit from machine learning algorithms. Given that plan payment risk adjustments are essentially predictions, this does seem like a good application. In addition to testing goodness of fit of machine learning algorithms, Rose is interested in whether such techniques can reduce the number of variable inputs. Without going into any detail, insurers have found ways to “game” the system and fewer variable inputs would restrict this activity. Rose introduces a number of concepts in the paper (at least they were new to me) such as ensemble machine learningdiscrete learning frameworks and super learning frameworks. She uses a large private insurance claims dataset and breaks the dataset into what she calls 10 “folds” which allows her to run 5 prediction models, each with its own cross-validation dataset. Aside from one parametric regression model, she uses several penalized regression models, neural net, single-tree, and random forest models. She describes machine learning as aiming to smooth over data in a similar manner to parametric regression but with fewer assumptions and allowing for more flexibility. To reduce the number of variables in models, she applies techniques that limit variables to, for example, just the 10 most influential. She concludes that applying machine learning to plan payment risk adjustment models can increase efficiencies and her results suggest that it is possible to get similar results even with a limited number of variables. It is curious that the parametric model performed as well as or better than many of the different machine learning algorithms. I’ll take that to mean we can continue using our trusted regression methods for at least a few more years.