# Ambulance and economics

I have recently been watching the BBC series AmbulanceIt is a fly-on-the-wall documentary following the West Midlands Ambulance Service interspersed with candid interviews with ambulance staff, much in the same vein as other health care documentaries like 24 Hours in A&EAs much as anything it provides a (stylised) look at the conditions on the ground for staff and illustrates how health care institutions are as much social institutions as essential services. In a recent episode, the cost of a hoax call was noted as some thousands of pounds. Indeed, the media and health services often talk about the cost of hoax calls in this way:

Warning for parents as one hoax call costs public £2,465 and diverts ambulance from real emergency call.

Frequent 999 callers cost NHS millions of pounds a year.

Nuisance caller cost the taxpayer £78,000 by making 408 calls to the ambulance service in two years.

But these are accounting costs, not the full economic cost. The first headline almost captures this by suggesting the opportunity cost was attendance at a real emergency call. However, given the way that ambulance resources are deployed and triaged across calls, it is very difficult to say what the opportunity cost is: what would be the marginal benefit of having an additional ambulance crew for the duration of a hoax call? What is the shadow price of an ambulance unit?

Few studies have looked at this question. The widely discussed study by Claxton et al. in the UK, looked at shadow prices of health care across different types of care, but noted that:

Expenditure on, for example, community care, A&E, ambulance services, and outpatients can be difficult to attribute to a particular [program budget category].

One review identified a small number of studies examining the cost-benefit and cost-effectiveness of emergency response services. Estimates of the marginal cost per life saved ranged from approximately \$5,000 to \$50,000. However, this doesn’t really tell us the impact of an additional crew, nor were many of these studies comparable in terms of the types of services they looked at, and these were all US-based.

There does exist the appropriately titled paper Ambulance EconomicsThis paper approaches the question we’re interested in, in the following way:

The centrepiece of our analysis is what we call the Ambulance Response Curve (ARC). This shows the relationship between the response time for an individual call (r) and the number of ambulances available and not in use (n) at the time the call was made. For example, let us suppose that 35 ambulances are on duty and 10 of them are being used. Then n has the value of 25 when the next call is taken. Ceteris paribus, as increases, we expect that r will fall.

On this basis, one can look at how an additional ambulance affects response times, on average. One might then be able to extrapolate the health effects of that delay. This paper suggests that an additional ambulance would reduce response times by around nine seconds on average for the service they looked at – not actually very much. However, the data are 20 years old, and significant changes to demand and supply over that period are likely to have a large effect on the ARC. Nevertheless, changes in response time of the order of minutes are required in order to have a clinically significant impact on survival, which are unlikely to occur with one additional ambulance.

Taken altogether, the opportunity cost of a hoax call is not likely to be large. This is not to downplay the stupidity of such calls, but it is perhaps reassuring that lives are not likely to be in the balance and is a testament to the ability of the service to appropriately deploy their limited resources.

Credits

# Widespread misuse of statistical significance in health economics

Despite widespread cautionary messages, p-values and claims of statistical significance are continuously misused. One of the most common errors is to mistake statistical significance for economic, clinical, or political significance. This error may manifest itself by authors interpreting only ‘statistically significant’ results as important, or even neglecting to examine the magnitude of estimated coefficients. For example, we’ve written previously about a claim of how statistically insignificant results are ‘meaningless’. Another common error is to ‘transpose the conditional’, that is to interpret the p-value as the posterior probability of a null hypothesis. For example, in an exchange on Twitter recently, David Colquhoun, whose discussions of p-values we’ve also previously covered, made the statement:

However, the p-value does not provide probability/evidence of a null hypothesis (that an effect ‘exists’). P-values are correlated with the posterior probability of the null hypothesis in a way that depends on statistical power, choice of significance level, and prior probability of the null. But observing a significant p-value only means that the data were unlikely to be produced by a particular model, not that the alternative hypothesis is true. Indeed, the null hypothesis may be a poor explanation for the observed data, but that does not mean it is a better explanation than the alternative. This is the essence of Lindley’s paradox.

So what can we say about p-values? The six principles of the ASA’s statement on p-values are:

1. P-values can indicate how incompatible the data are with a specified statistical model.
2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

***

In 1996, Deirdre McClosky and Stephen Ziliak surveyed economics papers published in the American Economic Review in the 1980s for p-value misuse. Overall, 70% did not distinguish statistical from economic significance and 96% misused a test statistic in some way. Things hadn’t improved when they repeated the study ten years later. Unfortunately, these problems are not exclusive to the AER. A quick survey of a top health economics journal, Health Economics, finds similar misuse as we discuss below. This journal is not singled out for any particular reason beyond that it’s one of the key journals in the field covered by this blog, and frequently features in our journal round-ups. Similarly, no comment is made on the quality of the studies or authors beyond the claims and use of statistical significance. Nevertheless, where there are p-values, there are problems. For such a pivotal statistic, one that careers can be made or broken on, we should at least get it right!

Nine studies were published in the May 2017 issue of Health Economics. The list below shows some examples of p-value errors in the text of the articles. The most common issue was using the p-value to interpret whether an effect exists or not, or using it as the (only) evidence to support or reject a particular hypothesis. As described above, the statistical significance of a coefficient does not imply the existence of an effect. Some of the statements claimed below to be erroneous may be contentious as, in the broader context of the paper, they may make sense. For example, claiming that a statistically significant estimate is evidence of an effect may be correct where the broader totality of the evidence suggests that any observed data would be incompatible with a particular model. However, this is generally not the way the p‘s are used.

## Examples of p-value (mis-)statements

Even the CMI has no statistically significant effect on the facilitation ratio. Thus, the diversity and complexity of treated patients do not play a role for the subsidy level of hospitals.

the coefficient for the baserate is statistically significant for PFP hospitals in the FE model, indicating that a higher price level is associated with a lower level of subsidies.

Using the GLM we achieved nine significant effects, including, among others, Parkinson’s disease and osteoporosis. In all components we found more significant effects compared with the GLM approach. The number of significant effects decreases from component 2 (44 significant effects) to component 4 (29 significant effects). Although the GLM lead to significant results for intestinal diverticulosis, none of the component showed equivalent results. This might give a hint that taking the component based heterogeneity into account, intestinal diverticulosis does not significantly affect costs in multimorbidity patients. Besides this, certain coefficients are significant in only one component.

[It is unclear what ‘significant’ and ‘not significant’ refer to or how they are calculated but appear to refer to t>1.96. Not clear if corrections for multiple comparisons.]

There is evidence of upcoding as the coefficient of spreadp_posis statistically significant.

Neither [variable for upcoding] is statistically significant. The incentive for upcoding is, according to these results, independent of the statutory nature of hospitals.

The checkup significantly raises the willingness to pay any positive amount, although it does not significantly affect the amount reported by those willing to pay some positive amount.

[The significance is with reference to statistical significance].

Similarly, among the intervention group, there were lower probabilities of unhappiness or depression (−0.14, p = 0.045), being constantly under strain (0.098, p = 0.013), and anxiety or depression (−0.10, p = 0.016). There was no difference between the intervention group and control group 1 (eligible non-recipients) in terms of the change in the likelihood of hearing problems (p = 0.64), experiencing elevate blood pressure (p = 0.58), and the number of cigarettes smoked (p = 0.26).

The ∆CEs are also statistically significant in some educational categories. At T + 1, the only significant ∆CE is observed for cancer survivors with a university degree for whom the cancer effect on the probability of working is 2.5 percentage points higher than the overall effect. At T + 3, the only significant ∆CE is observed for those with no high school diploma; it is 2.2 percentage points lower than the overall cancer effect on the probability of working at T + 3.

And, just for balance, here is a couple from this year’s winner of the Arrow prize at iHEA, which gets bonus points for the phrase ‘marginally significant’, which can be used both to confirm and refute a hypothesis depending on the inclination of the author:

Our estimated net effect of waiting times for high-income patients (i.e., adding the waiting time coefficient and the interaction of waiting times and high income) is positive, but only marginally significant (p-value 0.055).

We find that patients care about distance to the hospital and both of the distance coefficients are highly significant in the patient utility function.

***

As we’ve argued before, p-values should not be the primary result reported. Their interpretation is complex and so often leads to mistakes. Our goal is to understand economic systems and to determine the economic, clinical, or policy relevant effects of interventions or modifiable characteristics. The p-value does provide some useful information but not enough to support the claims made from it.

Credits

# Hawking is right, Jeremy Hunt does egregiously cherry pick the evidence

I’m beginning to think Jeremy Hunt doesn’t actually care what the evidence says on the weekend effect. Last week, renowned physicist Stephen Hawking criticized Hunt for ‘cherry picking’ evidence with regard to the ‘weekend effect’: that patients admitted at the weekend are observed to be more likely than their counterparts admitted on a weekday to die. Hunt responded by doubling down on his claims:

Some people have questioned Hawking’s credentials to speak on the topic beyond being a user of the NHS. But it has taken a respected public figure to speak out to elicit a response from the Secretary of State for Health, and that should be welcomed. It remains the case though that a multitude of experts do continue to be ignored. Even the oft-quoted Freemantle paper is partially ignored where it notes of the ‘excess’ weekend deaths, “to assume that [these deaths] are avoidable would be rash and misleading.”

We produced a simple tool to demonstrate how weekend effect studies might estimate an increased risk of mortality associated with weekend admissions even in the case of no difference in care quality. However, the causal model underlying these arguments is not always obvious. So here it is:

A simple model of the effect of the weekend on patient health outcomes. The dashed line represents unobserved effects

So what do we know about the weekend effect?

1. The weekend effect exists. A multitude of studies have observed that patients admitted at the weekend are more likely to die than those admitted on a weekday. This amounts to having shown that $E(Y|W,S) \neq E(Y|W',S)$. As our causal model demonstrates, being admitted is correlated with health and, importantly, the day of the week. So, this is not the same as saying that risk of adverse clinical outcomes differs by day of the week if you take into account propensity for admission, we can’t say $E(Y|W) \neq E(Y|W')$. Nor does this evidence imply care quality differs at the weekend, $E(Q|W) \neq E(Q|W')$. In fact, the evidence only implies differences in care quality if the propensity to be admitted is independent of (unobserved) health status, i.e. $Pr(S|U,X) = Pr(S|X)$ (or if health outcomes are uncorrelated with health status, which is definitely not the case!).
2. Admissions are different at the weekend. Fewer patients are admitted at the weekend and those that are admitted are on average more severely unwell. Evidence suggests that the better patient severity is controlled for, the smaller the estimated weekend effect. Weekend effect estimates also diminish in models that account for the selection mechanism.
3. There is some evidence that care quality may be worse at the weekend (at least in the United States). So $E(Q|W) \neq E(Q|W')$. Although this has not been established in the UK (we’re currently investigating it!)
4. Staffing levels, particularly specialist to patient ratios, are different at the weekend, $E(X|W) \neq E(X|W')$.
5. There is little evidence to suggest how staffing levels and care quality are related. While the relationship seems evident prima facie, its extent is not well understood, for example, we might expect a diminishing return to increased staffing levels.
6. There is a reasonable amount of evidence on the impact of care quality (preventable errors and adverse events) on patient health outcomes.

But what are we actually interested in from a policy perspective? Do we actually care that it is the weekend per se? I would say no, we care that there is potentially a lapse in care quality. So, it’s a two part question: (i) how does care quality (and hence avoidable patient harm) differ at the weekend $E(Q|W) - E(Q|W') = ?$; and (ii) what effect does this have on patient outcomes $E(Y|Q)=?$. The first question answers to what extent policy may affect change and the second gives us a way of valuing that change and yet the vast majority of studies in the area address neither. Despite there being a number of publicly funded research projects looking at these questions right now, it’s the studies that are not useful for policy that keep being quoted by those with the power to make change.

Hawking is right, Jeremy Hunt has egregiously cherry picked and misrepresented the evidence, as has been pointed out again and again and again and again and … One begins to wonder if there isn’t some motive other than ensuring long run efficiency and equity in the health service.

Credits