One of the advantages of writing blogs is that it can help to refresh and consolidate you thoughts on a topic. And when you spend a lot of time writing stats code, other people’s blogs that discuss how to code specific statistical models can be invaluable. I have recently found myself delving into spatial modelling and geostatistics with the aim of taking survey data from a number of discrete locations to model an underlying continuous distribution. This is typically for something like disease prevalence; the map featured in our post about slum health is an example. However, I have wanted to code my own Bayesian geostatistical model. There is a wealth of literature on these models: we recently featured one on a journal round up. But, moving from these models to usable code is not always obvious, so this post outlines an example using the example of mapping childhood stunting in Kenya with data from the Kenya Demographic and Health Survey.
A geostatistical model
A geostatistical model, in its most basic form, analyses spatially discrete data sampled across an area that are assumed to be sampled from some underlying and unobserved continuous process. We often want to model this continuous process to be able to predict outcomes at a new location.
The survey locations across a given area, , can be denoted as
for locations
. At each location there is an underlying prevalence or probability,
, the number of sampled individuals or households,
and the observed outcomes
. The standard sampling model is then binomial:
To make inferences about the distribution of the prevalence of a disease or probability of some outcome across the area and to predict the prevalence at new locations, we need to model the link between the at different locations. The following logistic regression model is often used:
where is a vector of explanatory variables,
are independent
variates, and
is a Gaussian process explained below.
Gaussian processes
A Gaussian process is a continuous stochastic process that can be interpreted as providing a probability distribution over functions. The key property is that the distribution of the function’s value at a finite number of points is multivariate normal. For our above we have a zero mean multivariate normal:
with covariance function . A popular choice of correlation function is the Matérn function. But, we’ll use the following exponential function to keep things simple:
where is the Euclidean distance between two points,
is the indicator function equal to one if indices are the same and zero otherwise, and
is a ‘noise’ term.
Priors
As this is a Bayesian model, we need to specify priors for our parameters and hyperparameters. For simplicities sake we are not going to include any covariates, so that is just a vector of ones, and drop the term
for now. We’ll go for some weakly informative priors: half-normal(0,5) priors for
and
and a N(0,5) prior for our intercept
.
Bayesian approaches to geostatistical modelling
A great place to start reading about these sorts of models is a paper by Diggle, Tawn, and Moyeed (1998) and the example in this post is very similar to an example in their paper. They estimate a Bayesian model using MCMC. But one of the problems with MCMC is that the parameters in a Gaussian process model are often highly correlated resulting in very slow convergence of the chains. For this reason we’ll use Stan, which is described below. Another problem with fully Bayesian approaches to geostatistics, and other Gaussian process models, is that they can be prohibitively computationally intensive. For a given sample size , the memory requirements scale with
and the computational time scales with
. New approaches are being developed, such as the use of Kroenecker decompositions [PDF] of the covariance matrix, but these can’t be used for non-conjugate models such as ours. The DHS Program itself has a report on geostatistical modelling with DHS data. They use an INLA algorithm to estimate the models, which is another alternative to that employed here.
Coding the model
We are going to use Stan to code and sample from our model. Stan is a ‘probabilistic programming language’ that implements a ‘no-U-turn sampler’ – a kind of Hamiltonian Monte Carlo sampler that adaptively tunes its step parameters. We have mentioned Stan before on this blog. I think it’s great and have been adapting most of my models to it. The documentation, on the website, is in depth, thorough, and chock full of examples including, and thankfully for me, how to code Gaussian processes! Some of the online forums are also useful, including this discussion, which I should credit. The code for the model described above is as follows:
data { int<lower=1> N1; //number of data points int x1[N1]; //number of outcomes int n1[N1]; // number of observations int<lower=1> N2; // number of new points matrix[N1+N2,N1+N2] dist; //distances between points } transformed data{ int<lower=1> N; N = N1 + N2; } parameters{ vector[N1] y1; vector[N2] y2; real beta; real sigma_sq; real phi; } transformed parameters{ vector[N1+N2] mu; for(i in 1:N) mu[i] = beta; } model{ vector[N] y; matrix[N,N] Sigma; matrix[N,N] L; for(i in 1:N1) y[i] = y1[i]; for(i in 1:N2) y[N1+i] = y2[i]; for(i in 1:(N-1)){ for(j in (i+1):N){ Sigma[i,j] = exp((-1)*phi*dist[i,j]); Sigma[j,i] = Sigma[i,j]; } } for(i in 1:N) Sigma[i,i] = sigma_sq; L = cholesky_decompose(Sigma); sigma_sq ~ normal(0, 5); phi ~ normal(0, 5); y ~ multi_normal_cholesky(mu,L); beta ~ normal(0,5); x1 ~ binomial_logit(n1,y1); } generated quantities{ vector[N2] y_pred; for(i in 1:N2) y_pred[i] = inv_logit(beta+y2[i]); }
Some notes:
- We provide (below) the data and the locations of the survey clusters as well as the locations of all the points we want to predict. We also provide the matrix with the distance between all points both in the data and in the new locations.
- The generated quantities block predict the probability at the new locations.
- We have used the Cholesky decomposition of the covariance matrix, which provides a speed boost. (See the Stan manual for a nice explanation).
Data
We are going to use the most recent Kenya DHS survey. Unfortunately, these data can’t be shared here, but access can be sought at the DHS website. There you’ll also find data definitions and further info. As discussed above, large sample sizes can be prohibitive in the fully Bayesian set up. So for this example we’ll focus on just one province of Kenya, the Western Province (or what was formerly the Western Province, as now Kenya is divided into counties).
We’ll use R for the analysis. We’ll need the individual and geographic survey data, which we’ll read into R
library(foreign) df<-read.dta("C:/individual data set file path") geo<-read.dta("C:/geo dataset file path")
From the individual dataset we only need the cluster id (v001) and variables for child height for age as percentile (hw4). Stunting is defined as being in the fifth percentile or below for height for age, so we’ll convert to a binary variable:
vars<-c('h11_1','h11_2','h11_3','h11_4','h11_5','h11_6') df<-df[,c('v001',vars)] for(i in vars){ df[!is.na(df[,i])&df[,i]<500,i]=500,i]
and then we’ll aggregate all the children within each cluster by creating a new data frame (called dfc) with a row per cluster and then looping over each cluster id
dfc<-data.frame(cluster=unique(df$v001),n=NA,x=NA) for(i in unique(df$v001)){ tmp<-df[ df$v001==i,2:7] tmp<-as.numeric(unlist(tmp)) tmp<-tmp[!is.na(tmp)] dfc[which(dfc$cluster==i),'n']<-length(tmp) dfc[which(dfc$cluster==i),'x']<-length(tmp[tmp==1]) }
Now, we assign the longitude and latitude from the geographic dataset (note: the locations in the DHS are randomly displaced by up to 2km, we are assuming that this is small enough not to matter on the regional scale):
dfc$lat<-geo[match(dfc$cluster,geo$DHSCLUST),'LATNUM'] dfc$long<-geo[match(dfc$cluster,geo$DHSCLUST),'LONGNUM']
The next step is to construct the grid of new locations we want to estimate the prevalence at. Firstly, we’ll create a rectangular grid that covers the Western province, then we’ll remove the points outside of the province, which is a bit of a faff (although I’ll admit there may be a better way)! We’ll need a shapefile for Kenya, which I got from here, and we’ll use the code from here to put it into a useful format.
require("rgdal") require("maptools") require("ggplot2") require("plyr") area<-readOGR(dsn="C:/stan/kenya/County.shp",layer="County") area@data$id = rownames(area@data) area.points = fortify(area, region="id") area.df = join(area.points, area@data, by="id") area@data$id = rownames(area@data) area.points = fortify(area, region="id") area.df = join(area.points, area@data, by="id")
Then we can select the counties in our province and remove the rest
counties<-c('Kakamega','Vihiga','Bungoma','Busia') area.df<-area.df[area.df$COUNTY%in%counties,]
We can plot to make sure it’s right:
ggplot(area.df) + aes(long,lat,group=group) + geom_polygon() + geom_path(color="white",size=0.05)
Looks good. Now, let’s create grid of new points based on the longitude and latitude at the corners and a step size of 0.06, which could be made smaller or larger,
coords<-c(33.9,35.2,-0.1,1.2) step<-0.06 fil<-data.frame(x=rep(seq(coords[1],coords[2],by=step),length(seq(coords[3],coords[4],by=step))), y=rep(seq(coords[3],coords[4],by=step),each=length(seq(coords[1],coords[2],by=step))))
Then we go through each of the counties and determine if each point is in it using the is.in.polygon function of the sp package. We will also do the same for our data.
fil$isin<-0 dfc$isin<-0 for(i in counties){ fil$isin<-fil$isin+point.in.polygon(fil[,1],fil[,2], area.df[ area.df$COUNTY==i,'long'],area.df[ area.df$COUNTY==i,'lat']) dfc$isin<-dfc$isin+point.in.polygon(dfc[,'long'],dfc[,'lat'], area.df[ area.df$COUNTY==i,'long'],area.df[ area.df$COUNTY==i,'lat']) } fil<-fil[fil$isin>0,] dfc<-dfc[dfc$isin>0,]
Almost there! We finally convert the longitude and latitude to eastings and northings
tmp <- projectMercator(fil$y,fil$x) fil$x2<-tmp[,1];fil$y2<-tmp[,2] tmp<- projectMercator(dfc$lat,dfc$long) dfc$lat2<-tmp[,1];dfc$long2<-tmp[,2]
and create our distance matrix (dividing by 10,000 so that we’re working on a scale of tens of kms)
df.loc<-data.frame(x1=c(dfc$lat2,fil$x2),y1=c(dfc$long2,fil$y2)) d1<-as.matrix(dist(df.loc)) d1<-d1/10000
and all that remains is to run the code! But before we do that let’s just take a look at the locations of the two data frames. I like to use maps with nice features, so we’ll plot a map of Kenya using the OpenStreetMap package and add on the survey locations (blue) and also look at our new grid of points (red) (hence why we needed the eastings and northings):
p2<-openmap(c(1.2,33.9), c(-0.1,35.2), type="osm") p2<-autoplot(p2) dfc$p<-dfc$x/dfc$n p2+geom_point(data=dfc,aes(x=lat2,y=long2,color=p)) +geom_point(data=fil,aes(x=x2,y=y2).color="red")
We may want to make the grid of points a little more dense, but you can see the principle. In the map, ‘p’ is the crude prevalence at each cluster.
Estimation
The Stan code above, we have saved to, say, “C:/model.stan”, and it is then easy to run from R.
dat<-list( N1 = length(dfc[,1]), x1 = dfc$x, n1 = dfc$n, N2 = length(fil[,1]), dist = d1 ) rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectCores()) fit <- stan( "C:/model.stan", data = dat, chains = 3, iter = 500 )
Results
Now, let’s take a look at the output. Firstly, let’s look at the parameter estimates for ,
, and
.
print(fit,pars=c('beta','phi','sigma_sq')) Inference for Stan model: geo4. 3 chains, each with iter=500; warmup=250; thin=1; post-warmup draws per chain=250, total post-warmup draws=750. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat beta -0.74 0.03 0.63 -2.03 -1.11 -0.74 -0.35 0.52 368 1.00 phi 9.02 0.30 3.20 3.65 6.67 8.92 10.87 15.69 114 1.03 sigma_sq 1.08 0.01 0.06 1.01 1.04 1.07 1.11 1.21 23 1.07 Samples were drawn using NUTS(diag_e) at Tue Dec 06 15:37:55 2016. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
The Rhat stat gives an indication of convergence of the chains, we’re looking for a value typically under 1.01. This results suggest we don’t quite have convergence for and
. We can also assess convergence by looking at trace plot of the chains.
traceplot(fit,pars=c('beta','phi','sigma_sq'))
The traceplot for does suggest a lack of convergence, which may suggest there’s no unique posterior mode or another problem. We can try re-specifying the model or running for more iterations to see if things improve, but we’ll leave it for now.
Now let’s plot the output on our map of Kenya. There are lots of different ways we can visualise this: the predicted prevalence, the probability the prevalence is above or below a certain threshold, or a combination of the two. We’ll plot the estimated prevalence and add in contour lines for where there is a greater than 80% probability that the prevalence of stunting is greater than 10%, which is considered serious:
pred<-extract(fit,'y_pred')[[1]] fil$est<-colMeans(pred) fil$prob<-apply(pred,2,function(x)I(length(x[x>0.10])/length(x) > 0.8)*1) p2+ geom_raster(data=fil,aes(x=x2,y=y2,fill=est),alpha=0.8,interpolate=TRUE)+ scale_fill_gradient(low="turquoise",high="red")+ geom_contour(data=fil,aes(x=x2,y=y2,z=prob),bins=1)
Conclusions
Hopefully this post will be useful to those looking to do some geostatistical modelling or for an example of converting statistical models into usable code. I’ve opted for a Bayesian approach, which allows us to allow for and average over the uncertainty in the model parameters and hyperparameters. There are lots of other ways of estimating these sorts of models, for example, check out the R package PrevMap which implements Markov Chain maximum likelihood methods. Finally, check out these two recent papers (among many) for good, recent discussions of these methods.
P.S.
Some may have wondered why the featured image for this post was of people panning for gold. Well, geostatistics has its origins in the South African gold mining industry. Gold occurs in thin seams between sedimentary strata and is often very deep and is of variable quality even within a seam. Geostatistics was developed in an attempt to predict the location of gold seams from a small number of samples. One of the pioneers was Danie Krige, hence why geostatistical estimation is often referred to as kriging.
Hi Sam,
This is a helpful post for a beginner like me. Can you explain why you use a mean of beta instead of 0, and then also add beta to y2 while generating y_pred ? I’d assume you’d need to only do one of those.
Hi Sam, was going through the code and am getting the following error
Error in FUN(X[[i]], …) : Stan does not support NA (in x1) in data
failed to preprocess the data; sampling not done
Going through the code and it looks like there are no values for x in the dataframe dfc. Could there be an issue with assigning x and n values to dfc?
Thanks
Hi, it’s hard to tell where the error is coming from without seeing the data or code – there seem to be a couple of errors in the R code above where WordPress mangled it a bit. To debug I would just run each line of code a look at what it’s doing and you should see where the NAs are being introduced. It’s possible that your data frames have differently ordered columns or something as well.
Looks like i found where an issue is..
dfc<-data.frame(cluster=unique(df$v001),n=NA,x=NA)
for(i in unique(df$v001)){
tmp<-df[ df$v001==i,2:7]
tmp<-as.numeric(unlist(tmp))
tmp<-tmp[!is.na(tmp)]
dfc[which(dfc$cluster==i),’n’]<-length(tmp)
dfc[which(dfc$cluster==i),’x’]<-length(tmp[tmp==1])
}
the final dfc has all values in x and n with NAs
It shouldn’t do as the NA values are removed from tmp. Besides ‘length’ will work on NA values, so it shouldn’t produce NA outcomes. What is likely happening is that either (i) you don’t have a column called cluster or (ii) the unique values in df$v001 don’t match whatever is in the cluster column
Okay, will check on it.. Do you have the code on GitHub? might be useful.
Thaks
Hi, Sam Watson,
I have tried to access the base of your example, but I could not. Please could you help me or send me the database
Thank you
What do you mean ‘the base of the example’? The database is DHS data, which you can request access to at https://dhsprogram.com/
Hi Sam,
Been following your code, am new to stan and i was wondering if this set of loop has an error
for(i in vars){
df[!is.na(df[,i])&df[,i]<500,i]=500,i]
Yes – that does look like an error, which sometimes happens copying R code on here! I would probably just put something like:
for(i in vars) df[!is.na(df[,i]),i] <- ifelse(df[!is.na(df[,i]),i]<500,1,0)
Thanks, really appreciate the feedback
I’ve been going through this model in careful detail to try to recreate it for a beta-distributed response variable and one thing that is confusing me is the use of beta to describe the mean the multivariate distribution for y. Am I wrong in thinking that y stands for the entire gaussian process? If so, I thought beta was additive to y rather than describing it’s distribution (as it is for the predicted values). I’m considering the case of adding regression coefficients to this model – I’m assuming that they would be added to beta to equal mu. However, that would only be changing the gaussian process part of the model and not the global mean, correct?
I think what is confusing is that I changed notation for the programming! Focussing on the Stan model – I didn’t include any covariates so just had beta as the global mean (mu[i] = beta) as you say (equivalent to setting the z(x_i) to 1 in the model at the top). To add covariates into the model, you have to decide what to do about predicting the process at a new location, but I’ll assume you have the covariate values at the new locations for now. You would need to read in a matrix of covariates (and include a column of 1s for the intercept), one way of doing this is to read in two new data matrices of the covariate values for locations and new locations by adding the following to the data block (calling it Z just to add more confusion!) along with the number of covariates (including intercept) P:
int P;
matrix[N1,P] Z1;
matrix[N2,P] Z2;
then change the line ‘real beta;’ in the parameters block to
vector[P] beta;
and then in the transformed parameters block change the loop to the following:
for(i in 1:N1) mu[i] = Z1[i]*beta;
for (j in 1:N2) mu[(N1+j)] = Z2[j]*beta;
change the prior declaration in the model block for beta:
for(p in 1:P) beta[p] ~ normal(0,5);
and finally change the generated quantities block final line to:
for(i in 1:N2) y_pred[i] = inv_logit(mu[(N1+i)]+y2[i]);
And of course you could extend it in other ways, like making the parameters hierarchical as required. I’ll also add a disclaimer to say there’s always likely to be a more efficient or nicer way of coding these things, but I hope this will get you on your way.
Thanks! I really appreciate you taking the time to respond to my question. This is helpful. So just to make sure – mu is used both in generating the distribution for y and is also added to it in order to predict the quantities?
[…] of the data, such as the long-run trend or weekly periodic variation. We have previously used Gaussian processes in a geostatistical model on this blog. Gaussian processes are a flexible class of models for which any finite dimensional […]