Poor statistical communication means poor statistics

Statistics is a broad and complex field. For a given research question any number of statistical approaches could be taken. In an article published last year, researchers asked 61 analysts to use the same dataset to address the question of whether referees were more likely to give dark skinned players a red card than light skinned players. They got 61 different responses. Each analysis had its advantages and disadvantages and I’m sure each analyst would have defended their work. However, as many statisticians and economists may well know, the merit of an approach is not the only factor that matters in its adoption.

There has, for decades, been criticism about the misunderstanding and misuse of null hypothesis significance testing (NHST). P-values have been a common topic on this blog. Despite this, NHST remains the predominant paradigm for most statistical work. If used appropriately this needn’t be a problem, but if it were being used appropriately it wouldn’t be used nearly as much: p-values can’t perform the inferential role many expect of them. It’s not difficult to understand why things are this way: most published work uses NHST, we teach students NHST in order to understand the published work, students become researchers who use NHST, and so on. Part of statistical education involves teaching the arbitrary conventions that have gone before such as that p-values are ‘significant’ if below 0.05 or a study is ‘adequately powered’ if power is above 80%. One of the most pernicious consequences of this is that these heuristics become a substitute for thinking. The presence of these key figures is expected and their absence often marked by a request from reviewers and other readers for their inclusion.

I have argued on this blog and elsewhere for a wider use of Bayesian methods (and less NHST) and I try to practice what I preach. For an ongoing randomised trial I am involved with, I adopted a Bayesian approach to design and analysis. Instead of the usual power calculation, I conducted a Bayesian assurance analysis (which Anthony O’Hagan has written some good articles on for those wanting more information). I’ll try to summarise the differences between ‘power’ and ‘assurance’ calculations by attempting to define them, which is actually quite hard!

Power calculation. If we were to repeat a trial infinitely many times, what sample size would we need so that in x% of trials the assumed data generating model produces data which would fall in the α% most extreme quantiles of the distribution of data that would be produced from the same data generating model but with one parameter set to exactly zero (or any equivalent hypothesis). Typically we set x%to be 80% (power) and α% to be 5% (statistical significance threshold).

Assurance calculation. For a given data generating model, what sample size do we need so that there is a x% probability that we will be 1-α% certain that the parameter is positive (or any equivalent choice).

The assurance calculation could be reframed in a decision framework as what sample size do we need so that there is a x% probability we will make the right decision about whether a parameter is positive (or any equivalent decision) given the costs of making the wrong decision.

Both of these are complex but I would argue it is the assurance calculation that gives us what we want to know most of the time when designing a trial. The assurance analysis also better represents uncertainty since we specify distributions over all the uncertain parameters rather than choose exact values. Despite this though, the funder of the trial mentioned above, who shall remain nameless, insisted on the results of a power calculation in order to be able to determine whether the trial was worth continuing with because that’s “what they’re used to.”

The main culprit for this issue is, I believe, communication. A simpler explanation with better presentation may have been easier to understand and accept. This is not to say that I do not believe the funder was substituting the heuristic ‘80% or more power = good’ for actually thinking about what we could learn from the trial. But until statisticians, economists, and other data analytic researchers start communicating better, how can we expect others to listen?

Image credit: Geralt

The trouble with estimating neighbourhood effects, part 2

When we think of the causal effect of living in one neighbourhood compared to another we think of how the social interactions and lifestyle of that area produce better outcomes. Does living in an area with more obese people cause me to become fatter? (Quite possibly). Or, if a family moves to an area where people earn more will they earn more? (Read on).

In a previous post, we discussed such effects in the context of slums, where the synergy of poor water and sanitation, low quality housing, small incomes, and high population density likely has a negative effect on residents’ health. However, we also discussed how difficult it is to estimate neighbourhood effects empirically for a number of reasons. On top of this, are the different ways neighbourhood effects can manifest. Social interactions may mean behaviours that lead to better health or incomes rub off on one another. But also there may be some underlying cause of the group’s, and hence each individual’s, outcomes. In the slum, low education may mean poor hygiene habits spread, or the shared environment may contain pathogens, for example. Both of these pathways may constitute a neighbourhood effect, but both imply very different explanations and potential policy remedies.

What should we make then of, not one, but two new articles by Raj Chetty and Nathaniel Henderen in the recent issue of Quarterly Journal of Economics? Both of which use observational data to estimate neighbourhood effects.

Paper 1: The Impacts of Neighborhoods on Intergenerational Mobility I: Childhood Exposure Effects.

The authors have an impressive data set. They use federal tax records from the US between 1996 and 2012 and identify all children born between 1980 and 1988 and their parents (or parent). For each of these family units they determine household income and then the income of the children when they are older. To summarise a rather long exegesis of the methods used, I’ll try to describe the principle finding in one sentence:

Among families moving between commuting zones in the US, the average income percentile of children at age 26 is 0.04 percentile points higher per year spent and per additional percentile point increase in the average income percentile of the children of permanent residents at age 26 in the destination where the family move to. (Phew!)

They interpret this as the outcomes of in-migrating children ‘converging’ to the outcomes of permanently resident children at a rate of 4% per year. That should provide an idea of how the outcomes and treatments were defined, and who constituted the sample. The paper makes the assumption that the effect is the same regardless of the age of the child. Or to perhaps make it a bit clearer, the claim can be interpreted as that human capital, H, does something like this (ignoring growth over childhood due to schooling etc.):


where ‘good’ and ‘bad’ mean ‘good neighbourhood’ and ‘bad neighbourhood’. This could be called the better neighbourhoods cause you to do better hypothesis.

The analyses also take account of parental income at the time of the move and looks at families who moved due to a natural disaster or other ‘exogenous’ shock. The different analyses generally support the original estimate putting the result in the region of 0.03 to 0.05 percentile points.

But are these neighbourhood effects?

A different way of interpreting these results is that there is an underlying effect driving incomes in each area. Areas with higher incomes for their children in the future are those that have a higher market price for labour in the future. So we could imagine that this is what is going on with human capital instead:


This is the those moving to areas where people will earn more in the future, also earn more in the future because of differences in the labour market hypothesis. The Bureau of Labour Statistics, for example, cites the wage rate for a registered nurse as $22.61 in Iowa and $36.13 in California. But we can’t say from the data whether the children are sorting into different occupations or are getting paid different amounts for the same occupations.

The reflection problem

Manksi (1993) called the issue the ‘reflection problem’, which he described as arising when

a researcher observes the distribution of a behaviour in a population and wishes to infer whether the average behaviour in some group influences the behaviour of the individuals that compose the group.

What we have here is a linear-in-means model estimating the effect of average incomes on individual incomes. But what we cannot distinguish between is the competing explanations of, what Manski called, endogenous effects that result from the interaction  with families with higher incomes, and correlated effects that lead to similar outcomes due to exposure to the same underlying latent forces, i.e. the market. We could also add contextual effects that manifest due to shared group characteristics (e.g. levels of schooling or experience). When we think of a ‘neighbourhood effect’ I tend to think of them as of the endogenous variety, i.e. the direct effects of living in a certain neighbourhood. For example, under different labour market conditions, both my income and the average income of the permanent residents of the neighbourhood I move to might be lower, but not because of the neighbourhood.

The third hypothesis

There’s also the third hypothesis, families that are better off move to better areas (i.e. effects are accounted for by unobserved family differences):


The paper presents lots of modifications to the baseline model, but none of them can provide an exogenous choice of destination. They look at an exogenous cause of moving – natural disasters – and also instrument with the expected difference in income percentiles for parents from the same zip code, but I can’t see how this instrument is valid. Selection bias is acknowledged in the paper but without some exogenous variation in where a family moves to it’ll be difficult to really claim to have identified a causal effect. The choice to move is in the vast majority of family’s cases based on preferences over welfare and well-being, especially income. Indeed, why would a family move to a worse off area unless their circumstances demanded it of them? So in reality, I would imagine the truth would lie somewhere in between these three explanations.

Robust analysis?

As a slight detour, we might want to consider if these are causal effects, even if the underlying assumptions hold. The paper presents a range of analyses to show that the results are robust. But these analyses represent just a handful of those possible. Given that the key finding is relatively small in magnitude, one wonders what would have happened under different scenarios and choices – the so-called garden of forking paths problem. To illustrate, consider some of the choices that were made about the data and models, and all the possible alternative choices. The sample included only those with a mean positive income between 1996 to 2004 and those living in commuter zones with populations of over 250,000 in the 2000 census. Those whose income was missing were assigned a value of zero. Average income over 1996 to 2000 is a proxy for lifetime income. If the marital status of the parents changed then the child was assigned to the mother’s location. Non-filers were coded as single. Income is measured in percentile ranks and not dollar terms. The authors justify each of the choices, but an equally valid analysis would have resulted from different choices and possibly produced very different results.


Paper 2The Impacts of Neighborhoods on Intergenerational Mobility II: County-Level Estimates

The strategy of this paper is much like the first one, except that rather than trying to estimate the average effect of moving to higher or lower income areas, they try to estimate the effect of moving to each of 3,000 counties in the US. To do this they assume that the number of years exposure to the county is as good as random after taking account of i) origin fixed effects, ii) parental income percentile, and iii) a quadratic function of birth cohort year and parental income percentile to try and control for some differences in labour market conditions. An even stronger assumption than before! The hierarchical model is estimated using some complex two-step method for ‘computational tractability’ (I’d have just used a Bayesian estimator). There’s some further strange calculations, like conversion from percentile ranks into dollar terms by regressing the dollar amounts on average income ranks and multiplying everything by the coefficient, rather than just estimating the model with dollars as the outcome (I suspect it’s to do with their complicated estimation strategy). Nevertheless, we are presented with some (noisy) county-level estimates of the effect of an additional year spent there in childhood. There is a weak correlation with the income ranks of permanent residents. Again, though, we have the issue of many competing explanations for the observed effects.

The differences in predicted causal effect by county don’t help distinguish between our hypotheses. Consider this figure:


Do children of poorer parents in the Southern states end up with lower human capital and lower-skilled jobs than in the Midwest? Or does the market mean that people get paid less for the same job in the South? Compare the map above to the maps below showing wage rates of two common lower-skilled professions, cashiers (right) or teaching assistants (left):

A similar pattern is seen. While this is obviously just a correlation, one suspects that such variation in wages is not being driven by large differences in human capital generated through personal interaction with higher earning individuals. This is also without taking into account any differences in purchasing power between geographic areas.

What can we conclude?

I’ve only discussed a fraction of the contents of these two enormous papers. The contents could fill many more blog posts to come. But it all hinges on whether we can interpret the results as the average causal effect of a person moving to a given place. Not nearly enough information is given to know whether families moving to areas with lower future incomes are comparable to those with higher future incomes. Also, we could easily imagine a world where the same people were all induced to move to different areas – this might produce completely different sets of neighbourhood effects since they themselves contribute to those effects. But I feel that the greatest issue is the reflection problem. Even random assignment won’t get around this. This is not to discount the value or interest these papers generate, but I can’t help but feel too much time is devoted to trying to convince the reader of a ‘causal effect’. A detailed exploration of the relationships in the data between parental incomes, average incomes, spatial variation, later life outcomes, and so forth, might have been more useful for generating understanding and future analyses. Perhaps sometimes in economics we spend too long obsessing over estimating unconvincing ‘causal effects’ and ‘quasi-experimental’ studies that really aren’t and forget the value of just a good exploration of data with some nice plots.


Image credits:

Method of the month: Semiparametric models with penalised splines

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is semiparametric models with penalised splines.


A common assumption of regression models is that effects are linear and additive. However, nothing is ever really that simple. One might respond that all models are wrong, but some are useful, as George Box once said. And the linear, additive regression model has coefficients that can be interpreted as average treatment effects under the right assumptions. Sometimes though we are interested in conditional average treatment effects and how the impact of an intervention varies according to the value of some variable of interest. Often this relationship is not linear and we don’t know its functional form. Splines provide a way of estimating curves (or surfaces) of unknown functional form and are a widely used tool for semiparametric regression models. The term ‘spline’ was derived from the tool shipbuilders and drafters used to construct smooth edges: a bendable piece of material that when fixed at a number of points would relax into the desired shape.


Our interest lies in estimating the unknown function m:

y_i = m(x_i) + e_i

A ‘spline’ in the mathematical sense is a function constructed piece-wise from polynomial functions. The places where the functions meet are known as knots and the spline has order equal to one more than the degree of the underlying polynomial terms. Basis-splines or B-splines are the typical starting point for spline functions. These are curves that are defined recursively as a sum of ‘basis functions’, which depend only on the polynomial degree and the knots. A spline function can be represented as a linear combination of B-splines, the parameters dictating this combination can be estimated using standard regression model estimation techniques. If we have N B-splines then our regression function can be estimated as:

y_i = \sum_{j=1}^N ( \alpha_j B_j(x_i) ) + e_i

by minimising \sum_{i=1}^N \{ y_i - \sum_{j=1}^N ( \alpha_j B_j(x_i) ) \} ^2. Where the B_j are the B-splines and the \alpha_j are coefficients to be estimated.

Useful technical explainers of splines and B-splines can be found here [PDF] and here [PDF].

One issue with fitting splines to data is that we run the risk of ‘overfitting’. Outliers might distort the curve we fit, damaging the external validity of conclusions we might make. To deal with this, we can enforce a certain level of smoothness using so-called penalty functions. The smoothness (or conversely the ‘roughness’) of a curve is often defined by the integral of the square of the second derivative of the curve function. Penalised-splines, or P-splines, were therefore proposed which added on this smoothness term multiplied by a smoothing parameter \lambda. In this case, we look to minimising:

\sum_{i=1}^N \{ y_i - \sum_{j=1}^N ( \alpha_j B_j(x_i) ) \}^2 + \lambda\int m''(x_i)^2 dx

to estimate our parameters. Many other different variations on this penalty have been proposed. This article provides a good explanation of P-splines.

An attractive type of spline has become the ‘low rank thin plate spline‘. This type of spline is defined by its penalty, which has a physical analogy with the resistance that a thin sheet of metal puts up when it is bent. This type of spline removes the problem associated with thin plate splines of having too many parameters to estimate by taking a ‘low rank’ approximation, and it is generally insensitive to the choice of knots, which other penalised spline regression models are not.

Crainiceanu and colleagues show how the low rank thin plate smooth splines can be represented as a generalised linear mixed model. In particular, our model can be represented as:

m(x_i) = \beta_0 + \beta_1x_i  + \sum_{k=1}^K u_k |x_i - \kappa_k|^3

where \kappa_k, k=1,...,K, are the knots. The parameters, \theta = (\beta_0,\beta_1,u_k)', can be estimated by minimising

\sum_{i=1}^N \{ y_i - m(x_i) \} ^2 + \frac{1}{\lambda} \theta ^T D \theta .

This is shown to give the mixed model

y_i = \beta_0 + \beta_1 + Z'b + u_i

where each random coefficient in the vector b is distributed as N(0,\sigma^2_b) and Z and D are given in the paper cited above.

As a final note, we have discussed splines in one dimension, but they can be extended to more dimensions. A two-dimensional spline can be generated by taking the tensor product of the two one dimensional spline functions. I leave this as an exercise for the reader.



  • The package gamm4 provides the tools necessary for a frequentist analysis along the lines described in this post. It uses restricted maximum likelihood estimation with the package lme4 to estimate the parameters of the thin plate spline model.
  • A Bayesian version of this functionality is implemented in the package rstanarm, which uses gamm4 to produce the matrices for thin plate spline models and Stan for the estimation through the stan_gamm4 function.

If you wanted to implement these models for yourself from scratch, Crainiceanu and colleagues provide the R code to generate the matrices necessary to estimate the spline function:

sqrt.OMEGA_all<-t(svd.OMEGA_all$v %*%


I will temper this advice by cautioning that I have never estimated a spline-based semi-parametric model in Stata, so what follows may be hopelessly incorrect. The only implementation of penalised splines in Stata is the package and associated function psplineHowever, I cannot find any information about the penalty function used, so I would advise some caution when implementing. An alternative is to program the model yourself, through conversion of the above R code in Mata to generate the matrix Z and then the parameters could be estimated with xtmixed. 


Applications of these semi-parametric models in the world of health economics have tended to appear more in technical or statistical journals than health economics journals or economics more generally. For example, recent examples include Li et al who use penalised splines to estimate the relationship between disease duration and health care costs. Wunder and co look at how reported well-being varies over the course of the lifespan. And finally, we have Stollenwerk and colleagues who use splines to estimate flexible predictive models for cost-of-illness studies with ‘big data’.