# Method of the month: custom likelihoods with Stan

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is custom likelihoods with Stan.

# Principles

Regular readers of this blog will know that I am a fan of Bayesian methods. The exponential growth in personal computing power has opened up a whole new range of Bayesian models at home. WinBUGS and JAGS were the go-to pieces of software for estimating Bayesian models, both using Markov Chain Monte Carlo (MCMC) methods.  Theoretically, an MCMC chain will explore the posterior distribution. But MCMC has flaws. For example, if the target distribution has a high degree of curvature, such as many hierarchical models might exhibit, then MCMC chains can have trouble exploring it. To compensate, the chains stay in the ‘difficult’ bit of the space for longer before leaving to go elsewhere so its average oscillates around the true value. Asymptotically, these oscillations balance out, but in real, finite time, they ultimately lead to bias. And further, MCMC chains are very slow to converge to the target distribution, and for complex models can take a literal lifetime. An alternative, Hamiltonian Monte Carlo (HMC), provides a solution to these issues. Michael Betancourt’s introduction to HCM is great for anyone interested in the topic.

Stan is a ‘probabilistic programming language’ that implements HMC. A huge range of probability distributions are already implemented in the software, check out the manual for more information. And there is an R package, rstanarm, that estimates a number of standard models using normal R code that even means you can use these tools without learning the code. However, Stan may not have the necessary distributions for more complex econometric or statistical models. It used to be the case that you would have to build your own MCMC sampler – but given the problems with MCMC, this is now strongly advised against in lieu of HMC. Fortunately, we can implement our own probability density functions in Stan. So, if you can write down the (log) likelihood for your model, you can estimate it in Stan!

The aim of this post is to provide an example of implementing a custom probability function in Stan from the likelihood of our model. We will look at the nested logit model. These models have been widely used for multinomial choice problems. An area of interest among health economists is the choice of hospital provider. A basic multinomial choice model, such as a multinomial logit, requires an independence of irrelevant alternatives (IIA) assumption that says the odds of choosing one option over another is independent of any other alternative. For example, it would assume that the odds of me choosing the pharmacy in town over the hospital in town would be unaffected by a pharmacy opening on my own road. This is likely too strong. There are many ways to relax this assumption, the nested logit being one. The nested logit is useful when choices can be ‘nested’ in groups and assumes there is a correlation among choices with each group. For example, we can group health care providers into pharmacies, primary care clinics, hospitals, etc. such as this:

# Implementation

## Econometric model

Firstly, we need a nesting structure for our choices, like that described above. We’ll consider a 2-level nesting structure, with branches and total choices, with Rt choices in each branch t. Like with most choice models we start from an additive random utility model, which is, for individual i=1,…,N, and with choice over branch and option:

$U_{itr} = V_{itr} + \epsilon_{itr}$

Then the chosen option is the one with the highest utility. The motivating feature of the nested logit is that the hierarchical structure allows us to factorise the joint probability of choosing branch and option r into a conditional and marginal model:

$p_{itr} = p_{it} \times p_{ir|t}$

Multinomial choice models arise when the errors are assumed to have a generalised extreme value (GEV) distribution, which gives use the multinomial logit model. We will model the deterministic part of the equation with branch-varying and option-varying variables:

$V_{itr} = Z_{it}'\alpha + X_{itr}'\beta_t$

Then the model can be written as:

$p_{itr} = p_{it} \times p_{ir|t} = \frac{exp(Z_{it}'\alpha + \rho_t I_{it})}{\sum_{k \in T} exp(Z_{ik}'\alpha + \rho_k I_{ik})} \times \frac{exp(X_{itr}'\beta_t/\rho_t)}{\sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t) }$

where $\rho_t$ is variously called a scale parameter, correlation parameter, etc. and defines the within branch correlation (arising from the GEV distribution). We also have the log-sum, which is also called the inclusive value:

$I_{it} = log \left( \sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t) \right)$.

Now we have our model setup, the log likelihood over all individuals is

$\sum_{i=1}^N \sum_{k \in T} \sum_{m \in R_t} y_{itr} \left[ Z_{it}'\alpha + \rho_t I_{it} - log \left( \sum_{k \in T} exp(Z_{ik}'\alpha + \rho_k I_{ik}) \right) + X_{itr}'\beta_t/\rho_t - log \left( \sum_{m \in R_t} exp( X_{itm}'\beta_t/\rho_t) \right) \right]$

As a further note, for the model to be compatible with an ARUM specification, a number of conditions need to be satisfied. One of these is satisfied is $0<\rho_t \leq 1$, so we will make that restriction. We have also only included alternative-varying variables, but we are often interested in individual varying variables and allowing parameters to vary over alternatives, which can be simply added to this specification, but we will leave them out for now to keep things “simple”. We will also use basic weakly informative priors and leave prior specification as a separate issue we won’t consider further:

$\alpha \sim normal(0,5), \beta_t \sim normal(0,5), \rho_t \sim Uniform(0,1)$

## Software

DISCLAIMER: This code is unlikely to be the most efficient, nor can I guarantee it is 100% correct – use at your peril!

The following assumes a familiarity with Stan and R.

Stan programs are divided into blocks including data, parameters, and model. The functions block allows us to define custom (log) probability density functions. These take a form something like:

real xxx_lpdf(real y, ...){}

which says that the function outputs a real valued variable and take a real valued variable, y, as one of its arguments. The _lpdf suffix allows the function to act as a density function in the program (and equivalently _lpmf for log probability mass functions for discrete variables). Now we just have to convert the log likelihood above into a function. But first, let’s just consider what data we will be passing to the program:

• N, the number of observations;
• T, the number of branches;
• P, the number of branch-varying variables;
• Q, the number of choice-varying variables;
• R, a T x 1 vector with the number of choices in each branch, from which we can also derive the total number of options as sum(R). We will call the total number of options Rk for now;
• Y, a N x Rk vector, where Y[i,j] = 1 if individual i=1,…,N chose choice j=1,…,Rk;
• Z, a N x T x P array of branch-varying variables;
• X, a N x Rk x Q array of choice-varying variables.

And the parameters:

• $\rho$, a T x 1 vector of correlation parameters;
• $\alpha$, a P x 1 vector of branch-level covariates;
• $\beta$, a P x T matrix of choice-varying covariates.

Now, to develop the code, we will specify the function for individual observations of Y, rather than the whole matrix, and then perform the sum over all the individuals in the model block. So we only need to feed in each individual’s observations into the function rather than the whole data set. The model is specified in blocks as follows (with all the data and parameter as arguments to the function):

functions{
real nlogit_lpdf(real[] y, real[,] Z, real[,] X, int[] R,
vector alpha, matrix beta, vector tau){
//first define our additional local variables
real lprob; //variable to hold log prob
int count1; //keep track of which option in the loops
int count2; //keep track of which option in the loops
vector[size(R)] I_val; //inclusive values
real denom; //sum denominator of marginal model
//for the variables appearing in sum loops, set them to zero
lprob = 0;
count1 = 0;
count2 = 0;
denom = 0;

// determine the log-sum for each conditional model, p_ir|t,
//i.e. inclusive value
for(k in 1:size(R)){
I_val[k] = 0;
for(m in 1:R[k]){
count1 = count1 + 1;
I_val[k] = I_val[k] + exp(to_row_vector(X[count1,])*
beta[,k] /tau[k]);
}
I_val[k] = log(I_val[k]);
}

//determine the sum for the marginal model, p_it, denomininator
for(k in 1:size(R)){
denom = denom + exp(to_row_vector(Z[k,])*alpha + tau[k]*I_val[k]);
}

//put everything together in the log likelihood
for(k in 1:size(R)){
for(m in 1:R[k]){
count2 = count2 + 1;
lprob = lprob + y[count2]*(to_row_vector(Z[k,])*alpha +
tau[k]*I_val[k] - log(denom) +
to_row_vector(X[count2,])*beta[,k] - I_val[k]);
}
}
// return the log likelihood value
return lprob;
}
}
data{
int N; //number of observations
int T; //number of branches
int R[T]; //number of options per branch
int P; //dim of Z
int Q; //dim of X
real y[N,sum(R)]; //outcomes array
real Z[N,T,P]; //branch-varying variables array
real X[N,sum(R),Q]; //option-varying variables array
}
parameters{
vector<lower=0, upper=1>[T] rho; //scale-parameters
vector[P] alpha; //branch-varying parameters
matrix[Q,T] beta; //option-varying parameters
}
model{
//specify priors
for(p in 1:P) alpha[p] ~ normal(0,5);
for(q in 1:Q) for(t in 1:T) beta[q,t] ~ normal(0,5);

//loop over all observations with the data
for(i in 1:N){
y[i] ~ nlogit(Z[i,,],X[i,,],R,alpha,beta,rho);
}
}

## Simulation model

To see whether our model is doing what we’re hoping it’s doing, we can run a simple test with simulated data. It may be useful to compare the result we get to those from other estimators; the nested logit is most frequently estimated using the FIML estimator. But, neither Stata nor R provide packages that estimate a model with branch-varying variables – another reason why we sometimes need to program our own models.

The code we’ll use to simulate the data is:

#### simulate 2-level nested logit data ###

N <- 300 #number of people
P <- 2 #number of branch variant variables
Q <- 2 #number of option variant variables
R <- c(2,2,2) #vector with number of options per branch
T <- length(R) #number of branches
Rk <- sum(R) #number of options

#simulate data

Z <- array(rnorm(N*T*P,0,0.5),dim = c(N,T,P))
X <- array(rnorm(N*Rk*Q,0,0.5), dim = c(N,Rk,Q))

#parameters
rho <- runif(3,0.5,1)
beta <- matrix(rnorm(T*Q,0,1),c(Q,T))
alpha <- rnorm(P,0,1)

#option models #change beta indexing as required
vals_opt <- cbind(exp(X[,1,]%*%beta[,1]/rho[1]),exp(X[,2,]%*%beta[,1]/rho[1]),exp(X[,3,]%*%beta[,2]/rho[2]),
exp(X[,4,]%*%beta[,2]/rho[2]),exp(X[,5,]%*%beta[,3]/rho[3]),exp(X[,6,]%*%beta[,3]/rho[3]))

incl_val <- cbind(vals_opt[,1]+vals_opt[,2],vals_opt[,3]+vals_opt[,4],vals_opt[,5]+vals_opt[,6])

vals_branch <- cbind(exp(Z[,1,]%*%alpha + rho[1]*log(incl_val[,1])),
exp(Z[,2,]%*%alpha + rho[2]*log(incl_val[,2])),
exp(Z[,3,]%*%alpha + rho[3]*log(incl_val[,3])))

sum_branch <- rowSums(vals_branch)

probs <- cbind((vals_opt[,1]/incl_val[,1])*(vals_branch[,1]/sum_branch),
(vals_opt[,2]/incl_val[,1])*(vals_branch[,1]/sum_branch),
(vals_opt[,3]/incl_val[,2])*(vals_branch[,2]/sum_branch),
(vals_opt[,4]/incl_val[,2])*(vals_branch[,2]/sum_branch),
(vals_opt[,5]/incl_val[,3])*(vals_branch[,3]/sum_branch),
(vals_opt[,6]/incl_val[,3])*(vals_branch[,3]/sum_branch))

Y = t(apply(probs, 1, rmultinom, n = 1, size = 1))

Then we’ll put the data into a list and run the Stan program with 500 iterations and 3 chains:

data <- list(
y = Y,
X = X,
Z = Z,
R = R,
T = T,
N = N,
P = P,
Q = Q
)

require(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

fit <- stan("C:/Users/Samuel/Dropbox/Code/nlogit.stan",
data = data,
chains = 3,
iter = 500)

Which gives results (with 25t and 75th percentiles dropped to fit on screen):

> print(fit)
Inference for Stan model: nlogit.
3 chains, each with iter=500; warmup=250; thin=1;
post-warmup draws per chain=250, total post-warmup draws=750.

mean se_mean   sd    2.5%     50%     75%   97.5% n_eff Rhat
rho[1]       1.00    0.00 0.00    0.99    1.00    1.00    1.00   750  1
rho[2]       0.87    0.00 0.10    0.63    0.89    0.95    1.00   750  1
rho[3]       0.95    0.00 0.04    0.84    0.97    0.99    1.00   750  1
alpha[1]    -1.00    0.01 0.17   -1.38   -0.99   -0.88   -0.67   750  1
alpha[2]    -0.56    0.01 0.16   -0.87   -0.56   -0.45   -0.26   750  1
beta[1,1]   -3.65    0.01 0.32   -4.31   -3.65   -3.44   -3.05   750  1
beta[1,2]   -0.28    0.01 0.24   -0.74   -0.27   -0.12    0.15   750  1
beta[1,3]    0.99    0.01 0.25    0.48    0.98    1.15    1.52   750  1
beta[2,1]   -0.15    0.01 0.25   -0.62   -0.16    0.00    0.38   750  1
beta[2,2]    0.28    0.01 0.24   -0.16    0.28    0.44    0.75   750  1
beta[2,3]    0.58    0.01 0.24    0.13    0.58    0.75    1.07   750  1
lp__      -412.84    0.14 2.53 -418.56 -412.43 -411.05 -409.06   326  1

Samples were drawn using NUTS(diag_e) at Sun May 06 14:16:43 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Which we can compare to the original parameters:

> beta
[,1]       [,2]      [,3]
[1,] -3.9381389 -0.3476054 0.7191652
[2,] -0.1182806  0.2736159 0.5237470
> alpha
[1] -0.9654045 -0.6505002
> rho
[1] 0.9503473 0.9950653 0.5801372

You can see that the posterior means and quantiles of the distribution provide pretty good estimates of the original parameters. Convergence diagnostics such as Rhat and traceplots (not reproduced here) show good convergence of the chains. But, of course, this is not enough for us to rely on it completely – you would want to investigate further to ensure that the chains were actually exploring the posterior of interest.

# Applications

I am not aware of any examples in health economics of using custom likelihoods in Stan. There are not even many examples of Bayesian nested logit models, one exception being a paper by Lahiri and Gao, who ‘analyse’ the nested logit using MCMC. But given the limitations of MCMC discussed above, one should prefer this implementation in the post rather than the MCMC samplers of that paper. It’s also a testament to computing advances and Stan that in 2001 an MCMC sampler and analysis could fill a paper in a decent econometrics journal and now we can knock one out for a blog post.

In terms of nested logit models in general in health economics, there are many examples going back 30 years (e.g. this article from 1987). More recent papers have preferred “mixed” or “random parameters” logit or probit specifications, which are much more flexible than the nested logit. We would advise these sorts of models for this reason. The nested logit was used as an illustrative example of estimating custom likelihoods for this post.

Credit

# Method of the month: Mobile data collection with a cloud server

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is mobile data collection with a cloud server.

## Principles

A departure from the usual format of this feature to bring you more of a tutorial based outline of this method. Surveys are an important data collection tool for social scientists. Among the most commonly used surveys for health economists are the British Household Panel Survey and Health Survey for England. But we also interrogate patients in trials and follow-up to complete well-being and quality of life surveys like the EQ-5D. Sometimes, when there’s funding, we’ll even do our own large-scale household survey. Many studies use paper-based methods of data collection when doing these surveys, but this is more expensive than it’s digital equivalent and more error-prone.

## Implementation

Rather than printing out survey forms then having to transcribe the results back into digital format, the surveys can be completed on a tablet or smartphone, uploaded to a central server, automatically error checked, and then saved in a ready to use format. However, setting up a system to do digital data collection can seem daunting and difficult. Most people do not know how to run a server or what software to use. In this post, we will describe the process of setting up a server using a cloud server provider and using open source survey software. Where other tutorials exist on individual steps we link to them, otherwise, the information is provided here.

### Software

OpenDataKit (ODK) is a family of programs designed for the collection of data on Android devices using a flexible system for designing survey forms. The programs include ODK Collect, which is installed on phones or tablets to complete forms; ODK Aggregate, which collates responses on a server; and ODK Briefcase, which works on a computer and ‘pulls’ data from the server for use on the computer.

Using ODK, mobile data collection can be implemented using a cloud server in the following 7 steps.

### 1. Set up a cloud server

For this tutorial we’ll be using Digital Ocean, a cloud server provider that is easy to use, permits you to deploy as many ‘droplets’ (i.e. cloud servers) as you want, and enables you to specify your hardware requirements. ODK Aggregate will run on Amazon Web Services and Google App Engine as well and is, in fact, easier to deploy on these platforms. However, we’ve chosen to go with Digital Ocean to make sure we control where our data are being stored – a key issue to ensure we’re compliant with EU data protection regulations, especially the GDPR. Digital Ocean met all our needs for information security.

You will need an account with Digital Ocean with a credit card to pay for services. The server we use costs around $15 a month to run (or 2 cents an hour), but can be ‘destroyed’ when not in use and then rebooted when needed by storing a ‘snapshot’ before you destroy it. Once you are logged in, create a new droplet with Ubuntu 16.04.4 x64. For our purposes, 2 GB of memory and 2 vCPUs will be sufficient. You will receive an email with the root password of the droplet and IP address. More info can be found here. To log into the server from a Windows computer, you can download and run Putty. From Linux based or Mac OS X you can use the ssh command in the terminal. We recommend you follow this tutorial to perform the initial server set up, i.e. creating a user and password. ODK Aggregate requires Tomcat to run, so we will install that. We then provide an optional step to allow access to the server only over https: (i.e. encrypted internet connections). This provides an extra layer of security, however, ODK also features RSA-key encryption of forms when transmitting to the server, so https: can be avoided if required. You will require a registered domain name to use https:. ### 2. Install Tomcat 8 Once you’re logged into the server and using your new user (it’s better to avoid using the root account when possible), run the following code: sudo apt-get update sudo apt-get install default-jdk sudo groupadd tomcat sudo useradd -s /bin/false -g tomcat -d /opt/tomcat tomcat cd /tmp In the next line, you may want to replace the line of code with a link to a more recent version of Tomcat wget http://mirror.ox.ac.uk/sites/rsync.apache.org/tomcat/tomcat-8/v8.5.27/bin/apache-tomcat-8.5.27.tar.gz -O apache-tomcat-8.5.24.tar.gz sudo mkdir /opt/tomcat sudo tar xzvf apache-tomcat-8*tar.gz -C /opt/tomcat --strip-components=1 cd /opt/tomcat sudo chgrp -R tomcat /opt/tomcat sudo chmod -R g+r conf sudo chmod g+x conf sudo chown -R tomcat webapps/ work/ temp/ logs/ Now we’re going to open up a file and edit the text using the nano text editor: sudo nano /etc/systemd/system/tomcat.service Then copy and paste the following: [Unit] Description=Apache Tomcat Web Application Container After=network.target [Service] Type=forking Environment=JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre Environment=CATALINA_PID=/opt/tomcat/temp/tomcat.pid Environment=CATALINA_HOME=/opt/tomcat Environment=CATALINA_BASE=/opt/tomcat Environment='CATALINA_OPTS=-Xms512M -Xmx1024M -server -XX:+UseParallelGC' Environment='JAVA_OPTS=-Djava.awt.headless=true -Djava.security.egd=file:/dev/./urandom' ExecStart=/opt/tomcat/bin/startup.sh ExecStop=/opt/tomcat/bin/shutdown.sh User=tomcat Group=tomcat UMask=0007 RestartSec=10 Restart=always [Install] WantedBy=multi-user.target Then save and close (Ctrl+x then Y then Enter). Going on, sudo systemctl daemon-reload sudo systemctl start tomcat sudo systemctl status tomcat sudo ufw allow 8080 sudo systemctl enable tomcat Again, we’re going to open a text file, this time to change the username and password to log in: sudo nano /opt/tomcat/conf/tomcat-users.xml Now we need to set the username and password. Here we’ve used the username ‘admin’, the password should be changed to something strong and memorable. The username and password are changed in this block: <tomcat-users . . .> <user username="admin" password="password" roles="manager-gui,admin-gui"/> </tomcat-users> Now, we need to comment out two blocks of text in two different files. The block of text is <Valve className="org.apache.catalina.valves.RemoteAddrValve" allow="127\.\d+\.\d+\.\d+|::1|0:0:0:0:0:0:0:1" /> and the two files can be accessed respectively with sudo nano /opt/tomcat/webapps/manager/META-INF/context.xml sudo nano /opt/tomcat/webapps/host-manager/META-INF/context.xml then restart Tomcat sudo systemctl restart tomcat ### 3. Install an SSL certificate (optional) Skip this step if you don’t want to use an SSL certificate. If you do, you will need a domain name (e.g. http://www.aheblog.com), and you will need to point that domain name to the IP address of your droplet. Follow these instructions to do that. It is possible to self-sign an SSL certificate and so not need a domain name, however, this will not work with ODK Collect as the certificate will not be trusted by Android. SSL certificates are issued by trusted authorities, a service for which they typically charge. However, Let’s Encrypt does it for free. To use it we need to install and use certbot: sudo apt-get update sudo apt-get install software-properties-common sudo add-apt-repository ppa:certbot/certbot sudo apt-get update sudo apt-get install certbot Now, use to get the certificates for your domain: sudo cerbot certonly Completing all the questions that are prompted – you want a certificate for a ‘standalone’. Now, we need to convert the certificate files into the Java Key Store format so that it can be used by Tomcat. We will need to do this as the root user (replace the domain name and passwords as appropriate): su - root cd /etc/letsencrypt/www.domainnamehere.com openssl pkcs12 -export -in fullchain.pem -inkey privkey.pem -out fullchain_and_key.p12 -name tomcat keytool -importkeystore -deststorepass PASSWORD -destkeypass PASSWORD -destkeystore MyDSKeyStore.jks -srckeystore fullchain_and_key.p12 -srcstoretype PKCS12 -srcstorepass PASSWORD -alias tomcat mkdir /opt/tomcat/ssl cp MyDSKeyStore.jks /opt/tomcat/ssl/ We can then switch back to our user account su - USER Open the following file sudo nano /opt/tomcat/conf/server.xml and replace the text with in the document where you see <!-- Define a SSL Coyote HTTP/1.1 Connector on port 8443 --> making sure to input the correct password: <Connector port="8443" protocol="org.apache.coyote.http11.Http11NioProtocol" maxThreads="150" SSLEnabled="true" scheme="https" secure="true" clientAuth="false" sslProtocol="TLS" keystoreFile="/opt/tomcat/ssl/MyDSKeyStore.jks" keystoreType="JKS" keystorePass="PASSWORD"/> Exit the file (Ctrl+X, Y, and Enter) and restart the service sudo systemctl restart tomcat Now, we want to block unsecured (HTTP) connections and allow only encrypted (HTTPS) connections: sudo ufw delete allow 8080 sudo ufw allow 8443 If you use an SSL certificate, it will need to be renewed every 90 days. This is an automatic process, however, converting to JKS and saving into the Tomcat directory is not. You can either do it manually each time or write a Bash script to do it. ### 4. Install ODK Aggregate We’ll firstly install the database software that will hold the collected data, we’ll use PostgreSQL (you can always use MySQL as well). The following code will install the right packages: sudo apt-get update sudo apt-get install postgresql-9.6 Now we are going to install ODK Aggregate. This is straightforward as the ODK Aggregate code provides a walkthrough. It is important to select the correct options during this process, which should be obvious from the above tutorial. In the first line of the following code, you may need to replace the link in the first command if the download does not work. sudo wget https://opendatakit.org/download/4456 -O /tmp/linux-x64-installer.run sudo chmod +x /tmp/linux-x64-installer.run sudo /tmp/linux-x64-installer.run then follow the instructions. Now, we are going to connect to the database to ODK Aggregate: sudo -u postgres psql \cd '/ODK/ODK Aggregate' \i create_db_and_user.sql \q Now copy the ODK installation to the Tomcat directory sudo cp /ODK/ODK\ Aggregate/ODKAggregate.war /opt/tomcat/webapps And the final step is to restart Tomcat sudo systemctl restart tomcat To test whether the installation has been successful go to (replacing the URL as appropriate): https://www.domainnamehere.com:8443/ODKAggregate or if you do not have a domain name use: http://<IP address>:8443/ODKAggregate Use this URL for accessing ODK Aggregate. To log on, use the ‘admin’ account and the password ‘aggregate’. Once you are logged in you can change the admin account password and create new user accounts as required. ### 5. Programme a survey Surveys can be complex with logical structures that skip questions for certain responses, require different type of responses like numbers, text, or multiple choice, and can need signatures or images. Multiple languages are often needed. All of this is possible in ODK as well. To programme your survey into a format that ODK can use, you can use XLSform, a standard for authoring forms in Excel. The website has a good tutorial. It is important to try to learn as much as possible as it is very flexible. A few key things and tips to note: • Skip and logical sequences are managed with the ‘Relevant’ column; • If you want to be able to skip big groups of questions, you should use ‘groups’; • If the same question is to be asked multiple (but an unknown number of) times you should use ‘Repeats’. Note that when you download the data the multiple responses from repeat type questions will be grouped in their own .csv files rather than with the rest of the survey; • Encryption of a survey is managed by putting an RSA public key in the Settings tab; • There are multiple different question types, learn them!; • You can use the response from one question as text in another question, for example, someone’s name – using${} syntax to refer to questions as with the ‘Relevant’ column;
• You can automatically collect default data like start and end time to check there’s been no cheating by data collectors.

Once your form is complete, you can convert it here, it will notify you of any errors. Once you have a .xml file ready to go, you can upload it in the ‘Form Management’ section of the ODK Aggregate interface.

### 6. Use ODK Collect

Submissions to the ODK Aggregate server need to be downloaded to a computer in order to be decrypted and the data processed and analysed. Exporting files from the server requires a number of pieces of software on the computer to which it is being downloaded:

• Unlimited Strength JCE Policy Files. These files are necessary if you are using encrypted forms and can be downloaded from here. To install the files, extract the contents of the compressed file, and copy and paste the files in the folder to the following location [java-home]/lib/security/policy/unlimited.

When you first launch ODK Briefcase the first thing you must do is choose a storage location where it will save all data downloads. Once you have done this go to the ‘Pull’ tab to download the surveys. Click ‘Connect…’ to input the URL of the ODK Aggregate instance. Select which forms you will need to download data for and click ‘Pull’ in the lower right-hand side of the window.

To download data submissions go to the ‘Export’ tab. The available forms are listed in this window. By the form for which you want to download submissions, input a storage location for the data. In the next row, select the location of the private RSA key if you are using encrypted forms, which must be in .pem format. Many programs will generate .pem keys, but some provide the keys as strings of text that will be saved as text files. If the text begins with ‘— BEGIN RSA PRIVATE KEY —‘, then simply change the file type to .pem.

## Applications

Now you should be good to go. Potential applications are extensive. The ODK system has been employed in evaluative studies (of performance-based financing, for example), to conduct discrete choice experiments, or for more general surveillance of health service use and outcomes.

There are ways of extending these tools further, such as collecting GPS and location and map data through Open Map Kit, which links to Open Street Map. There are also private companies who use OpenDataKit-based products to offer data collection and management services like SurveyCTO. However, we have found a key part of complying with data protection rules involves knowing exactly where data will be stored and having complete control over accessing it, which many services cannot offer. The flexibility of managing your own server permits more control, for example, you can write scripts to check for data submissions and to process them and upload them to another server or you can host other data collection tools when you need. Many universities or institutions may provide these services ‘in-house’, but if they do not support the software it can be difficult using company servers. A cloud server provider gives us an alternative solution that can be up and running in an hour.

Credit

# Chris Sampson’s journal round-up for 22nd May 2017

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

The effect of health care expenditure on patient outcomes: evidence from English neonatal care. Health Economics [PubMed] Published 12th May 2017

Recently, people have started trying to identify opportunity cost in the NHS, by assessing the health gains associated with current spending. Studies have thrown up a wide range of values in different clinical areas, including in neonatal care. This study uses individual-level data for infants treated in 32 neonatal intensive care units from 2009-2013, along with the NHS Reference Cost for an intensive care cot day. A model is constructed to assess the impact of changes in expenditure, controlling for a variety of variables available in the National Neonatal Research Database. Two outcomes are considered: the in-hospital mortality rate and morbidity-free survival. The main finding is that a £100 increase in the cost per cot day is associated with a reduction in the mortality rate of 0.36 percentage points. This translates into a marginal cost per infant life saved of around £420,000. Assuming an average life expectancy of 81 years, this equates to a present value cost per life year gained of £15,200. Reductions in the mortality rate are associated with similar increases in morbidity. The estimated cost contradicts a much higher estimate presented in the Claxton et al modern classic on searching for the threshold.

A comparison of four software programs for implementing decision analytic cost-effectiveness models. PharmacoEconomics [PubMed] Published 9th May 2017

Markov models: TreeAge vs Excel vs R vs MATLAB. This paper compares the alternative programs in terms of transparency and validation, the associated learning curve, capability, processing speed and cost. A benchmarking assessment is conducted using a previously published model (originally developed in TreeAge). Excel is rightly identified as the ‘ubiquitous workhorse’ of cost-effectiveness modelling. It’s transparent in theory, but in practice can include cell relations that are difficult to disentangle. TreeAge, on the other hand, includes valuable features to aid model transparency and validation, though the workings of the software itself are not always clear. Being based on programming languages, MATLAB and R may be entirely transparent but challenging to validate. The authors assert that TreeAge is the easiest to learn due to its graphical nature and the availability of training options. Save for complex VBA, Excel is also simple to learn. R and MATLAB are equivalently more difficult to learn, but clearly worth the time saving for anybody expecting to work on multiple complex modelling studies. R and MATLAB both come top in terms of capability, with Excel falling behind due to having fewer statistical facilities. TreeAge has clearly defined capabilities limited to the features that the company chooses to support. MATLAB and R were both able to complete 10,000 simulations in a matter of seconds, while Excel took 15 minutes and TreeAge took over 4 hours. For a value of information analysis requiring 1000 runs, this could translate into 6 months for TreeAge! MATLAB has some advantage over R in processing time that might make its cost (\$500 for academics) worthwhile to some. Excel and TreeAge are both identified as particularly useful as educational tools for people getting to grips with the concepts of decision modelling. Though the take-home message for me is that I really need to learn R.

Economic evaluation of factorial randomised controlled trials: challenges, methods and recommendations. Statistics in Medicine [PubMed] Published 3rd May 2017

Factorial trials randomise participants to at least 2 alternative levels (for example, different doses) of at least 2 alternative treatments (possibly in combination). Very little has been written about how economic evaluations ought to be conducted alongside such trials. This study starts by outlining some key challenges for economic evaluation in this context. First, there may be interactions between combined therapies, which might exist for costs and QALYs even if not for the primary clinical endpoint. Second, transformation of the data may not be straightforward, for example, it may not be possible to disaggregate a net benefit estimation with its components using alternative transformations. Third, regression analysis of factorial trials may be tricky for the purpose of constructing CEACs and conducting value of information analysis. Finally, defining the study question may not be simple. The authors simulate a 2×2 factorial trial (0 vs A vs B vs A+B) to demonstrate these challenges. The first analysis compares A and B against placebo separately in what’s known as an ‘at-the-margins’ approach. Both A and B are shown to be cost-effective, with the implication that A+B should be provided. The next analysis uses regression, with interaction terms demonstrating the unlikelihood of being statistically significant for costs or net benefit. ‘Inside-the-table’ analysis is used to separately evaluate the 4 alternative treatments, with an associated loss in statistical power. The findings of this analysis contradict the findings of the at-the-margins analysis. A variety of regression-based analyses is presented, with the discussion focussed on the variability in the estimated standard errors and the implications of this for value of information analysis. The authors then go on to present their conception of the ‘opportunity cost of ignoring interactions’ as a new basis for value of information analysis. A set of 14 recommendations is provided for people conducting economic evaluations alongside factorial trials, which could be used as a bolt-on to CHEERS and CONSORT guidelines.

Credits