Method of the month: Synthetic control

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is synthetic control.


Health researchers are often interested in estimating the effect of a policy of change at the aggregate level. This might include a change in admissions policies at a particular hospital, or a new public health policy applied to a state or city. A common approach to inference in these settings is difference in differences (DiD) methods. Pre- and post-intervention outcomes in a treated unit are compared with outcomes in the same periods for a control unit. The aim is to estimate a counterfactual outcome for the treated unit in the post-intervention period. To do this, DiD assumes that the trend over time in the outcome is the same for both treated and control units.

It is often the case in practice that we have multiple possible control units and multiple time periods of data. To predict the post-intervention counterfactual outcomes, we can note that there are three sources of information: i) the outcomes in the treated unit prior to the intervention, ii) the behaviour of other time series predictive of that in the treated unit, including outcomes in similar but untreated units and exogenous predictors, and iii) prior knowledge of the effect of the intervention. The latter of these only really comes into play in Bayesian set-ups of this method. With longitudinal data we could just throw all this into a regression model and estimate the parameters. However, generally, this doesn’t allow for unobserved confounders to vary over time. The synthetic control method does.


Abadie, Diamond, and Haimueller motivate the synthetic control method using the following model:

y_{it} = \delta_t + \theta_t Z_i + \lambda_t \mu_i + \epsilon_{it}

where y_{it} is the outcome for unit i at time t, \delta_t are common time effects, Z_i are observed covariates with time-varying parameters \theta_t, \lambda_t are unobserved common factors with \mu_i as unobserved factor loadings, and \epsilon_{it} is an error term. Abadie et al show in this paper that one can derive a set of weights for the outcomes of control units that can be used to estimate the post-intervention counterfactual outcomes in the treated unit. The weights are estimated as those that would minimise the distance between the outcome and covariates in the treated unit and the weighted outcomes and covariates in the control units. Kreif et al (2016) extended this idea to multiple treated units.

Inference is difficult in this framework. So to produce confidence intervals, ‘placebo’ methods are proposed. The essence of this is to re-estimate the models, but using a non-intervention point in time as the intervention date to determine the frequency with which differences of a given order of magnitude are observed.

Brodersen et al take a different approach to motivating these models. They begin with a structural time-series model, which is a form of state-space model:

y_t = Z'_t \alpha_t + \epsilon_t

\alpha_{t+1} = T_t \alpha_t + R_t \eta_t

where in this case, y_t is the outcome at time t, \alpha_t is the state vector and Z_t is an output vector with \epsilon_t as an error term. The second equation is the state equation that governs the evolution of the state vector over time where T_t is a transition matrix, R_t is a diffusion matrix, and \eta_t is the system error.

From this setup, Brodersen et al expand the model to allow for control time series (e.g. Z_t = X'_t \beta), local linear time trends, seasonal components, and allowing for dynamic effects of covariates. In this sense the model is perhaps more flexible than that of Abadie et al. Not all of the large number of covariates may be necessary, so they propose a ‘slab and spike’ prior, which combines a point mass at zero with a weakly informative distribution over the non-zero values. This lets the data select the coefficients, as it were.

Inference in this framework is simpler than above. The posterior predictive distribution can be ‘simply’ estimated for the counterfactual time series to give posterior probabilities of differences of various magnitudes.



  • Synth Implements the method of Abadie et al.


  • Synth Implements the method of Abadie et al.
  • CausalImpact Implements the method of Brodersen et al.


Kreif et al (2016) estimate the effect of pay for performance schemes in hospitals in England and compare the synthetic control method to DiD. Pieters et al (2016) estimate the effects of democratic reform on under-five mortality. We previously covered this paper in a journal round-up and a subsequent post, for which we also used the Brodersen et al method described above. We recently featured a paper by Lépine et al (2017) in a discussion of user fees. The synthetic control method was used to estimate the impact that the removal of user fees had in various districts of Zambia on use of health care.


Review: Health Econometrics Using Stata (Partha Deb et al)

Health Econometrics Using Stata

Partha Deb, Edward C. Norton, Willard G. Manning

Paperback, 264 pages, ISBN: 978-1-59718-228-7, published 31 August 2017

Amazon / Google Books / Stata Press

This book is the perfect guide to understanding the various econometric methods available for modelling of costs and counts data for the individual who understands econometrics best after applying it to a dataset (like myself). Pre-requisites include a decent knowledge of Stata and a desire to apply econometric methods to a cost or count outcome variable

It’s important to say that this book does not cover all aspects of econometrics within health economics, but instead focuses on ‘modelling health care costs and counts’ (the title of the short course from which the book evolved). As expected from this range of texts, the vast majority of the book comes with detailed example Stata code for all of the methods described, with illustrations either using a publicly available sample of MEPS data or simulated data.

Like many papers in this field, the focus of the book revolves around the non-normal characteristics of health care resource use distributions. These are the mass point at zero, right-hand skew and inherent heteroskedasticity. As such the book covers the broad suite of models that have been developed in order to account for these features, ranging from two-part models, transformation of the data (and the problematic re-transformation of estimated effects) to non-linear modelling methods such as generalised linear models (GLMs). Unlike many papers in this field, the authors emphasise the need – and provide guidance on how – to delve deep into the underlying data in order to appreciate the most appropriate methods (there is even a chapter on design effects) and encourage rigorous testing of model specification. In addition, Health Econometrics Using Stata considers the important issue of endogeneity and is not solely fixated on distributional issues, providing important insight and code for estimation of non-linear models that control for potential endogeneity (interested readers may wish to heed the published cautionary notes for some of these methods, e.g. Chapman and Brooks). Finally, the book describes more advanced methods for estimating heterogeneous effects, although code is not provided for all of these methods, which is a bit of a shame (but perhaps understandable given the complexity).

This could be a very dry text, but it is not – emphatically! The personality of the authors comes through very strongly from the writing. Reading it brought back many pleasant memories from the course ‘modelling health care costs and counts’ that I sat in 2012. The book also features a dedication to Willard Manning, which is a fitting tribute to a man who was both a great academic and an outstanding mentor. One particular highlight, with which past course attendants will be familiar, is the section ‘top 10 myths in health econometrics’. This straightforward and punchy presentation, backed up by rigorous methodological research, is a great way to get these key messages across in an accessible format. Other great features of this book include the use of simulations to illustrate important features of the econometric models (with code provided to recreate) and a personal highlight (granted, a niche interest…) was the code to generate comparable AIC and BIC across GLM families.

Of course, Health Econometrics Using Stata cannot be comprehensive and there are developments in this field that are not covered. Most notably, there is no discussion of how to model these data in a panel/longitudinal setting, which is crucially important for estimating parameters for decision models, for example. Potential issues around missing data and censoring are also not discussed. Also, this text does not cover advances in flexible parametric modelling, which enable modelling of data that are both highly skewed and leptokurtic (see Jones 2017 for an excellent summary of this literature along with a primer on data visualisation using Stata).

I heartily recommend Health Econometrics Using Stata to interested colleagues who want practical advice – on model selection and specification testing with cost and count outcome data – from some of the top specialists in our field, in their own words.


Widespread misuse of statistical significance in health economics

Despite widespread cautionary messages, p-values and claims of statistical significance are continuously misused. One of the most common errors is to mistake statistical significance for economic, clinical, or political significance. This error may manifest itself by authors interpreting only ‘statistically significant’ results as important, or even neglecting to examine the magnitude of estimated coefficients. For example, we’ve written previously about a claim of how statistically insignificant results are ‘meaningless’. Another common error is to ‘transpose the conditional’, that is to interpret the p-value as the posterior probability of a null hypothesis. For example, in an exchange on Twitter recently, David Colquhoun, whose discussions of p-values we’ve also previously covered, made the statement:

However, the p-value does not provide probability/evidence of a null hypothesis (that an effect ‘exists’). P-values are correlated with the posterior probability of the null hypothesis in a way that depends on statistical power, choice of significance level, and prior probability of the null. But observing a significant p-value only means that the data were unlikely to be produced by a particular model, not that the alternative hypothesis is true. Indeed, the null hypothesis may be a poor explanation for the observed data, but that does not mean it is a better explanation than the alternative. This is the essence of Lindley’s paradox.

So what can we say about p-values? The six principles of the ASA’s statement on p-values are:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.


In 1996, Deirdre McClosky and Stephen Ziliak surveyed economics papers published in the American Economic Review in the 1980s for p-value misuse. Overall, 70% did not distinguish statistical from economic significance and 96% misused a test statistic in some way. Things hadn’t improved when they repeated the study ten years later. Unfortunately, these problems are not exclusive to the AER. A quick survey of a top health economics journal, Health Economics, finds similar misuse as we discuss below. This journal is not singled out for any particular reason beyond that it’s one of the key journals in the field covered by this blog, and frequently features in our journal round-ups. Similarly, no comment is made on the quality of the studies or authors beyond the claims and use of statistical significance. Nevertheless, where there are p-values, there are problems. For such a pivotal statistic, one that careers can be made or broken on, we should at least get it right!

Nine studies were published in the May 2017 issue of Health Economics. The list below shows some examples of p-value errors in the text of the articles. The most common issue was using the p-value to interpret whether an effect exists or not, or using it as the (only) evidence to support or reject a particular hypothesis. As described above, the statistical significance of a coefficient does not imply the existence of an effect. Some of the statements claimed below to be erroneous may be contentious as, in the broader context of the paper, they may make sense. For example, claiming that a statistically significant estimate is evidence of an effect may be correct where the broader totality of the evidence suggests that any observed data would be incompatible with a particular model. However, this is generally not the way the p‘s are used.

Examples of p-value (mis-)statements

Even the CMI has no statistically significant effect on the facilitation ratio. Thus, the diversity and complexity of treated patients do not play a role for the subsidy level of hospitals.

the coefficient for the baserate is statistically significant for PFP hospitals in the FE model, indicating that a higher price level is associated with a lower level of subsidies.

Using the GLM we achieved nine significant effects, including, among others, Parkinson’s disease and osteoporosis. In all components we found more significant effects compared with the GLM approach. The number of significant effects decreases from component 2 (44 significant effects) to component 4 (29 significant effects). Although the GLM lead to significant results for intestinal diverticulosis, none of the component showed equivalent results. This might give a hint that taking the component based heterogeneity into account, intestinal diverticulosis does not significantly affect costs in multimorbidity patients. Besides this, certain coefficients are significant in only one component.

[It is unclear what ‘significant’ and ‘not significant’ refer to or how they are calculated but appear to refer to t>1.96. Not clear if corrections for multiple comparisons.]

There is evidence of upcoding as the coefficient of spreadp_posis statistically significant.

Neither [variable for upcoding] is statistically significant. The incentive for upcoding is, according to these results, independent of the statutory nature of hospitals.

The checkup significantly raises the willingness to pay any positive amount, although it does not significantly affect the amount reported by those willing to pay some positive amount.

[The significance is with reference to statistical significance].

Similarly, among the intervention group, there were lower probabilities of unhappiness or depression (−0.14, p = 0.045), being constantly under strain (0.098, p = 0.013), and anxiety or depression (−0.10, p = 0.016). There was no difference between the intervention group and control group 1 (eligible non-recipients) in terms of the change in the likelihood of hearing problems (p = 0.64), experiencing elevate blood pressure (p = 0.58), and the number of cigarettes smoked (p = 0.26).

The ∆CEs are also statistically significant in some educational categories. At T + 1, the only significant ∆CE is observed for cancer survivors with a university degree for whom the cancer effect on the probability of working is 2.5 percentage points higher than the overall effect. At T + 3, the only significant ∆CE is observed for those with no high school diploma; it is 2.2 percentage points lower than the overall cancer effect on the probability of working at T + 3.

And, just for balance, here is a couple from this year’s winner of the Arrow prize at iHEA, which gets bonus points for the phrase ‘marginally significant’, which can be used both to confirm and refute a hypothesis depending on the inclination of the author:

Our estimated net effect of waiting times for high-income patients (i.e., adding the waiting time coefficient and the interaction of waiting times and high income) is positive, but only marginally significant (p-value 0.055).

We find that patients care about distance to the hospital and both of the distance coefficients are highly significant in the patient utility function.


As we’ve argued before, p-values should not be the primary result reported. Their interpretation is complex and so often leads to mistakes. Our goal is to understand economic systems and to determine the economic, clinical, or policy relevant effects of interventions or modifiable characteristics. The p-value does provide some useful information but not enough to support the claims made from it.