Empirical research into health and health related outcomes is characterised by a predominance of binary outcomes. One of the most popular outcomes (in a statistical sense) is mortality, for example. This type of outcome warrants its own type of model that allows the outcome to be observed with probability p – the outcome is a draw from a Bernoulli distribution with this probability; p is allowed to vary across individuals. One distinction arises between the economics literature and the medical literature in that the linear probability model (LPM) is fairly popular in the former whereas it is never seen in the latter. The LPM is simply OLS estimation of the binary outcome on the regressors and is popular due to its easily interpretable marginal effects (equal to the estimated coefficient) and ease with which other procedures such as instrumental variables estimation can be used. However, the LPM may predict probabilities outside of the zero to one range and may even be inconsistent in many cases (see here). The probit model is also popular in the economics literature but is likewise rarely seen in medical studies. It is the logit which is ubiquitous in these analyses. But, in the medical literature, studies don’t usually try to address endogeneity whereas in economics we often do. Usually instrumental variables are employed to tackle this problem, but how do we use them in logit models?
In a linear model the two stage least squares (2SLS) estimator can be used. The endogenous variables are regressed on all the exogenous variables including the instruments, then the predicted value of the endogenous variable is used in place of its actual value in a regression. But in a model where the outcome is a nonlinear function of the regressors, such as a logit, this method would be inconsistent. To see why, note that we are trying to estimate a model that assumes:
Where x is assumed to be exogenous of which is a subset, c is unobserved and z is allowed to be correlated with c so that it is potentially endogenous. We model:
This issue is that if ρ≠0 then z is endogenous. For our estimates to be consistent we essentially require the conditional mean in (1) to be correctly specified. We have two options, we can estimate and substitute it for z in (1) or we can eliminate c. In the latter case, assuming , we can rewrite (1) as:
But, we do not observe v, however, we can consistently estimate it as and include these values in our regression. This method is known as two stage residual inclusion (2SRI).
Our other method however is inconsistent; we can use estimates of in place of z but this does not eliminate c since, even though , the expectation does not ‘pass through’ the nonlinear function m(.).
A further useful feature of this comes from the fact that exogeneity of z only happens when ρ=0. We can test this empirically when we estimate (2). This is equivalent to a Hausman test for the exogeneity of z.
The standard errors won’t be correct when estimating (2). Calculating the correct standard errors is not too difficult. But, often in health econometric applications, we want to adjust for clustering within hospitals or regions. To accommodate this into our standard errors makes the calculation considerably more difficult, if not intractable. In this case bootstrapping is the preferred solution.
I have noted in previous posts that endogeneity could be a serious issue in health econometrics, particularly when these types of studies are used to inform healthcare policy. Clearly there are methods for dealing with this, though having a suitable methodology is only the first hurdle. The next one is convincing non-economists why you are using it.