Method of the month: Mobile data collection with a cloud server

Once a month we discuss a particular research method that may be of interest to people working in health economics. We’ll consider widely used key methodologies, as well as more novel approaches. Our reviews are not designed to be comprehensive but provide an introduction to the method, its underlying principles, some applied examples, and where to find out more. If you’d like to write a post for this series, get in touch. This month’s method is mobile data collection with a cloud server.

Principles

A departure from the usual format of this feature to bring you more of a tutorial based outline of this method. Surveys are an important data collection tool for social scientists. Among the most commonly used surveys for health economists are the British Household Panel Survey and Health Survey for England. But we also interrogate patients in trials and follow-up to complete well-being and quality of life surveys like the EQ-5D. Sometimes, when there’s funding, we’ll even do our own large-scale household survey. Many studies use paper-based methods of data collection when doing these surveys, but this is more expensive than it’s digital equivalent and more error-prone.

Implementation

Rather than printing out survey forms then having to transcribe the results back into digital format, the surveys can be completed on a tablet or smartphone, uploaded to a central server, automatically error checked, and then saved in a ready to use format. However, setting up a system to do digital data collection can seem daunting and difficult. Most people do not know how to run a server or what software to use. In this post, we will describe the process of setting up a server using a cloud server provider and using open source survey software. Where other tutorials exist on individual steps we link to them, otherwise, the information is provided here.

Software

OpenDataKit (ODK) is a family of programs designed for the collection of data on Android devices using a flexible system for designing survey forms. The programs include ODK Collect, which is installed on phones or tablets to complete forms; ODK Aggregate, which collates responses on a server; and ODK Briefcase, which works on a computer and ‘pulls’ data from the server for use on the computer.

Using ODK, mobile data collection can be implemented using a cloud server in the following 7 steps.

1. Set up a cloud server

For this tutorial we’ll be using Digital Ocean, a cloud server provider that is easy to use, permits you to deploy as many ‘droplets’ (i.e. cloud servers) as you want, and enables you to specify your hardware requirements. ODK Aggregate will run on Amazon Web Services and Google App Engine as well and is, in fact, easier to deploy on these platforms. However, we’ve chosen to go with Digital Ocean to make sure we control where our data are being stored – a key issue to ensure we’re compliant with EU data protection regulations, especially the GDPR. Digital Ocean met all our needs for information security.

You will need an account with Digital Ocean with a credit card to pay for services. The server we use costs around $15 a month to run (or 2 cents an hour), but can be ‘destroyed’ when not in use and then rebooted when needed by storing a ‘snapshot’ before you destroy it. Once you are logged in, create a new droplet with Ubuntu 16.04.4 x64. For our purposes, 2 GB of memory and 2 vCPUs will be sufficient. You will receive an email with the root password of the droplet and IP address. More info can be found here.

To log into the server from a Windows computer, you can download and run Putty. From Linux based or Mac OS X you can use the ssh command in the terminal. We recommend you follow this tutorial to perform the initial server set up, i.e. creating a user and password.

ODK Aggregate requires Tomcat to run, so we will install that. We then provide an optional step to allow access to the server only over https: (i.e. encrypted internet connections). This provides an extra layer of security, however, ODK also features RSA-key encryption of forms when transmitting to the server, so https: can be avoided if required. You will require a registered domain name to use https:.

2. Install Tomcat 8

Once you’re logged into the server and using your new user (it’s better to avoid using the root account when possible), run the following code:

sudo apt-get update
sudo apt-get install default-jdk
sudo groupadd tomcat
sudo useradd -s /bin/false -g tomcat -d /opt/tomcat tomcat
cd /tmp

In the next line, you may want to replace the line of code with a link to a more recent version of Tomcat

wget http://mirror.ox.ac.uk/sites/rsync.apache.org/tomcat/tomcat-8/v8.5.27/bin/apache-tomcat-8.5.27.tar.gz -O apache-tomcat-8.5.24.tar.gz
sudo mkdir /opt/tomcat
sudo tar xzvf apache-tomcat-8*tar.gz -C /opt/tomcat --strip-components=1
cd /opt/tomcat
sudo chgrp -R tomcat /opt/tomcat
sudo chmod -R g+r conf
sudo chmod g+x conf
sudo chown -R tomcat webapps/ work/ temp/ logs/

Now we’re going to open up a file and edit the text using the nano text editor:

sudo nano /etc/systemd/system/tomcat.service

Then copy and paste the following:

[Unit]
Description=Apache Tomcat Web Application Container
After=network.target

[Service]
Type=forking

Environment=JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre
Environment=CATALINA_PID=/opt/tomcat/temp/tomcat.pid
Environment=CATALINA_HOME=/opt/tomcat
Environment=CATALINA_BASE=/opt/tomcat
Environment='CATALINA_OPTS=-Xms512M -Xmx1024M -server -XX:+UseParallelGC'
Environment='JAVA_OPTS=-Djava.awt.headless=true -Djava.security.egd=file:/dev/./urandom'

ExecStart=/opt/tomcat/bin/startup.sh
ExecStop=/opt/tomcat/bin/shutdown.sh

User=tomcat
Group=tomcat
UMask=0007
RestartSec=10
Restart=always

[Install]
WantedBy=multi-user.target

Then save and close (Ctrl+x then Y then Enter). Going on,

sudo systemctl daemon-reload
sudo systemctl start tomcat
sudo systemctl status tomcat
sudo ufw allow 8080
sudo systemctl enable tomcat

Again, we’re going to open a text file, this time to change the username and password to log in:

sudo nano /opt/tomcat/conf/tomcat-users.xml

Now we need to set the username and password. Here we’ve used the username ‘admin’, the password should be changed to something strong and memorable. The username and password are changed in this block:

<tomcat-users . . .>
 <user username="admin" password="password" roles="manager-gui,admin-gui"/>
</tomcat-users>

Now, we need to comment out two blocks of text in two different files. The block of text is

 <Valve className="org.apache.catalina.valves.RemoteAddrValve"
 allow="127\.\d+\.\d+\.\d+|::1|0:0:0:0:0:0:0:1" />

and the two files can be accessed respectively with

sudo nano /opt/tomcat/webapps/manager/META-INF/context.xml
sudo nano /opt/tomcat/webapps/host-manager/META-INF/context.xml

then restart Tomcat

sudo systemctl restart tomcat

3. Install an SSL certificate (optional)

Skip this step if you don’t want to use an SSL certificate. If you do, you will need a domain name (e.g. www.aheblog.com), and you will need to point that domain name to the IP address of your droplet. Follow these instructions to do that. It is possible to self-sign an SSL certificate and so not need a domain name, however, this will not work with ODK Collect as the certificate will not be trusted by Android. SSL certificates are issued by trusted authorities, a service for which they typically charge. However, Let’s Encrypt does it for free. To use it we need to install and use certbot:

sudo apt-get update
sudo apt-get install software-properties-common
sudo add-apt-repository ppa:certbot/certbot
sudo apt-get update
sudo apt-get install certbot

Now, use  to get the certificates for your domain:

sudo cerbot certonly

Completing all the questions that are prompted – you want a certificate for a ‘standalone’.

Now, we need to convert the certificate files into the Java Key Store format so that it can be used by Tomcat. We will need to do this as the root user (replace the domain name and passwords as appropriate):

su - root
cd /etc/letsencrypt/www.domainnamehere.com

openssl pkcs12 -export -in fullchain.pem -inkey privkey.pem -out fullchain_and_key.p12 -name tomcat keytool -importkeystore -deststorepass PASSWORD  -destkeypass PASSWORD -destkeystore MyDSKeyStore.jks -srckeystore fullchain_and_key.p12 -srcstoretype PKCS12 -srcstorepass PASSWORD -alias tomcat

mkdir /opt/tomcat/ssl
cp MyDSKeyStore.jks /opt/tomcat/ssl/

We can then switch back to our user account

su - USER

Open the following file

sudo nano /opt/tomcat/conf/server.xml

and replace the text with in the document where you see

<!-- Define a SSL Coyote HTTP/1.1 Connector on port 8443 -->

making sure to input the correct password:

<Connector port="8443" protocol="org.apache.coyote.http11.Http11NioProtocol"
maxThreads="150" SSLEnabled="true" scheme="https" secure="true"
clientAuth="false" sslProtocol="TLS" keystoreFile="/opt/tomcat/ssl/MyDSKeyStore.jks" 
keystoreType="JKS" keystorePass="PASSWORD"/>

Exit the file (Ctrl+X, Y, and Enter) and restart the service

sudo systemctl restart tomcat

Now, we want to block unsecured (HTTP) connections and allow only encrypted (HTTPS) connections:

sudo ufw delete allow 8080
sudo ufw allow 8443

If you use an SSL certificate, it will need to be renewed every 90 days. This is an automatic process, however, converting to JKS and saving into the Tomcat directory is not. You can either do it manually each time or write a Bash script to do it.

4. Install ODK Aggregate

We’ll firstly install the database software that will hold the collected data, we’ll use PostgreSQL (you can always use MySQL as well). The following code will install the right packages:

sudo apt-get update 
sudo apt-get install postgresql-9.6

Now we are going to install ODK Aggregate. This is straightforward as the ODK Aggregate code provides a walkthrough. It is important to select the correct options during this process, which should be obvious from the above tutorial. In the first line of the following code, you may need to replace the link in the first command if the download does not work.

sudo wget https://opendatakit.org/download/4456 -O /tmp/linux-x64-installer.run
sudo chmod +x /tmp/linux-x64-installer.run
sudo /tmp/linux-x64-installer.run

then follow the instructions. Now, we are going to connect to the database to ODK Aggregate:

sudo -u postgres psql 
\cd '/ODK/ODK Aggregate'
\i create_db_and_user.sql
\q

Now copy the ODK installation to the Tomcat directory

sudo cp /ODK/ODK\ Aggregate/ODKAggregate.war /opt/tomcat/webapps

And the final step is to restart Tomcat

sudo systemctl restart tomcat

To test whether the installation has been successful go to (replacing the URL as appropriate):

https://www.domainnamehere.com:8443/ODKAggregate

or if you do not have a domain name use:

http://<IP address>:8443/ODKAggregate

Use this URL for accessing ODK Aggregate.

To log on, use the ‘admin’ account and the password ‘aggregate’. Once you are logged in you can change the admin account password and create new user accounts as required.

5. Programme a survey

Surveys can be complex with logical structures that skip questions for certain responses, require different type of responses like numbers, text, or multiple choice, and can need signatures or images. Multiple languages are often needed. All of this is possible in ODK as well. To programme your survey into a format that ODK can use, you can use XLSform, a standard for authoring forms in Excel. The website has a good tutorial. It is important to try to learn as much as possible as it is very flexible. A few key things and tips to note:

  • Skip and logical sequences are managed with the ‘Relevant’ column;
  • If you want to be able to skip big groups of questions, you should use ‘groups’;
  • If the same question is to be asked multiple (but an unknown number of) times you should use ‘Repeats’. Note that when you download the data the multiple responses from repeat type questions will be grouped in their own .csv files rather than with the rest of the survey;
  • Encryption of a survey is managed by putting an RSA public key in the Settings tab;
  • There are multiple different question types, learn them!;
  • You can use the response from one question as text in another question, for example, someone’s name – using ${} syntax to refer to questions as with the ‘Relevant’ column;
  • You can automatically collect default data like start and end time to check there’s been no cheating by data collectors.

Once your form is complete, you can convert it here, it will notify you of any errors. Once you have a .xml file ready to go, you can upload it in the ‘Form Management’ section of the ODK Aggregate interface.

6. Use ODK Collect

ODK Collect can be downloaded from the Google Play store on any Android device. It is easy to use and a number of training guides exist. You will firstly need to link the app to the server by going to General Settings -> Server and inputting the URL that directs to your ODK Aggregate interface. The form you uploaded to your server can be downloaded with ‘Get Blank Form’ in the main menu and then data collected with it by selecting ‘Fill Blank Form’. Swiping left or right moves between questions. You can save at any point and come back to the survey at a later time. There are options you may also want to consider such as ‘Auto-finalising’, which means that once a survey is complete it is no longer accessible on the device, and ‘Auto send’ which will automatically send the data to the server when the form is finalised and there’s an internet connection. What you choose depends on your information security requirements.

7. Download the data

Submissions to the ODK Aggregate server need to be downloaded to a computer in order to be decrypted and the data processed and analysed. Exporting files from the server requires a number of pieces of software on the computer to which it is being downloaded:

  • Java 8. Update to the latest version of Java or install it if it is not already on the computer. Java can be downloaded here.
  • Unlimited Strength JCE Policy Files. These files are necessary if you are using encrypted forms and can be downloaded from here. To install the files, extract the contents of the compressed file, and copy and paste the files in the folder to the following location [java-home]/lib/security/policy/unlimited.
  • ODK Briefcase. This can be downloaded from the OpenDataKit website.

When you first launch ODK Briefcase the first thing you must do is choose a storage location where it will save all data downloads. Once you have done this go to the ‘Pull’ tab to download the surveys. Click ‘Connect…’ to input the URL of the ODK Aggregate instance. Select which forms you will need to download data for and click ‘Pull’ in the lower right-hand side of the window.

To download data submissions go to the ‘Export’ tab. The available forms are listed in this window. By the form for which you want to download submissions, input a storage location for the data. In the next row, select the location of the private RSA key if you are using encrypted forms, which must be in .pem format. Many programs will generate .pem keys, but some provide the keys as strings of text that will be saved as text files. If the text begins with ‘— BEGIN RSA PRIVATE KEY —‘, then simply change the file type to .pem.

Applications

Now you should be good to go. Potential applications are extensive. The ODK system has been employed in evaluative studies (of performance-based financing, for example), to conduct discrete choice experiments, or for more general surveillance of health service use and outcomes.

There are ways of extending these tools further, such as collecting GPS and location and map data through Open Map Kit, which links to Open Street Map. There are also private companies who use OpenDataKit-based products to offer data collection and management services like SurveyCTO. However, we have found a key part of complying with data protection rules involves knowing exactly where data will be stored and having complete control over accessing it, which many services cannot offer. The flexibility of managing your own server permits more control, for example, you can write scripts to check for data submissions and to process them and upload them to another server or you can host other data collection tools when you need. Many universities or institutions may provide these services ‘in-house’, but if they do not support the software it can be difficult using company servers. A cloud server provider gives us an alternative solution that can be up and running in an hour.

Credit

Chris Sampson’s journal round-up for 22nd May 2017

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

The effect of health care expenditure on patient outcomes: evidence from English neonatal care. Health Economics [PubMed] Published 12th May 2017

Recently, people have started trying to identify opportunity cost in the NHS, by assessing the health gains associated with current spending. Studies have thrown up a wide range of values in different clinical areas, including in neonatal care. This study uses individual-level data for infants treated in 32 neonatal intensive care units from 2009-2013, along with the NHS Reference Cost for an intensive care cot day. A model is constructed to assess the impact of changes in expenditure, controlling for a variety of variables available in the National Neonatal Research Database. Two outcomes are considered: the in-hospital mortality rate and morbidity-free survival. The main finding is that a £100 increase in the cost per cot day is associated with a reduction in the mortality rate of 0.36 percentage points. This translates into a marginal cost per infant life saved of around £420,000. Assuming an average life expectancy of 81 years, this equates to a present value cost per life year gained of £15,200. Reductions in the mortality rate are associated with similar increases in morbidity. The estimated cost contradicts a much higher estimate presented in the Claxton et al modern classic on searching for the threshold.

A comparison of four software programs for implementing decision analytic cost-effectiveness models. PharmacoEconomics [PubMed] Published 9th May 2017

Markov models: TreeAge vs Excel vs R vs MATLAB. This paper compares the alternative programs in terms of transparency and validation, the associated learning curve, capability, processing speed and cost. A benchmarking assessment is conducted using a previously published model (originally developed in TreeAge). Excel is rightly identified as the ‘ubiquitous workhorse’ of cost-effectiveness modelling. It’s transparent in theory, but in practice can include cell relations that are difficult to disentangle. TreeAge, on the other hand, includes valuable features to aid model transparency and validation, though the workings of the software itself are not always clear. Being based on programming languages, MATLAB and R may be entirely transparent but challenging to validate. The authors assert that TreeAge is the easiest to learn due to its graphical nature and the availability of training options. Save for complex VBA, Excel is also simple to learn. R and MATLAB are equivalently more difficult to learn, but clearly worth the time saving for anybody expecting to work on multiple complex modelling studies. R and MATLAB both come top in terms of capability, with Excel falling behind due to having fewer statistical facilities. TreeAge has clearly defined capabilities limited to the features that the company chooses to support. MATLAB and R were both able to complete 10,000 simulations in a matter of seconds, while Excel took 15 minutes and TreeAge took over 4 hours. For a value of information analysis requiring 1000 runs, this could translate into 6 months for TreeAge! MATLAB has some advantage over R in processing time that might make its cost ($500 for academics) worthwhile to some. Excel and TreeAge are both identified as particularly useful as educational tools for people getting to grips with the concepts of decision modelling. Though the take-home message for me is that I really need to learn R.

Economic evaluation of factorial randomised controlled trials: challenges, methods and recommendations. Statistics in Medicine [PubMed] Published 3rd May 2017

Factorial trials randomise participants to at least 2 alternative levels (for example, different doses) of at least 2 alternative treatments (possibly in combination). Very little has been written about how economic evaluations ought to be conducted alongside such trials. This study starts by outlining some key challenges for economic evaluation in this context. First, there may be interactions between combined therapies, which might exist for costs and QALYs even if not for the primary clinical endpoint. Second, transformation of the data may not be straightforward, for example, it may not be possible to disaggregate a net benefit estimation with its components using alternative transformations. Third, regression analysis of factorial trials may be tricky for the purpose of constructing CEACs and conducting value of information analysis. Finally, defining the study question may not be simple. The authors simulate a 2×2 factorial trial (0 vs A vs B vs A+B) to demonstrate these challenges. The first analysis compares A and B against placebo separately in what’s known as an ‘at-the-margins’ approach. Both A and B are shown to be cost-effective, with the implication that A+B should be provided. The next analysis uses regression, with interaction terms demonstrating the unlikelihood of being statistically significant for costs or net benefit. ‘Inside-the-table’ analysis is used to separately evaluate the 4 alternative treatments, with an associated loss in statistical power. The findings of this analysis contradict the findings of the at-the-margins analysis. A variety of regression-based analyses is presented, with the discussion focussed on the variability in the estimated standard errors and the implications of this for value of information analysis. The authors then go on to present their conception of the ‘opportunity cost of ignoring interactions’ as a new basis for value of information analysis. A set of 14 recommendations is provided for people conducting economic evaluations alongside factorial trials, which could be used as a bolt-on to CHEERS and CONSORT guidelines.

Credits

Sam Watson’s journal round-up for 23rd January 2017

Every Monday our authors provide a round-up of some of the most recently published peer reviewed articles from the field. We don’t cover everything, or even what’s most important – just a few papers that have interested the author. Visit our Resources page for links to more journals or follow the HealthEconBot. If you’d like to write one of our weekly journal round-ups, get in touch.

Short-term and long-term effects of GDP on traffic deaths in 18 OECD countries, 1960–2011. Journal of Epidemiology and Community Health [PubMedPublished February 2017

Understanding the relationships between different aspects of the economy or society in the aggregate can reveal to us knowledge about the world. However, they are more complicated than analyses of individuals who either did or did not receive an intervention, as the objects of aggregate analyses don’t ‘exist’ per se but are rather descriptions of average behaviour of the system. To make sense of these analyses an understanding of the system is therefore required. On these grounds I am a little unsure of the results of this paper, which estimates the effect of GDP on road traffic fatalities in OECD countries over time. It is noted that previous studies have shown that in the short-run, road traffic deaths are procyclical, but in the long-run they have declined, likely as a result of improved road and car safety. Indeed, this is what they find with their data and models. But, what does this result mean in the long-run? Have they picked up anything more than a correlation with time? Time is not included in the otherwise carefully specified models, so is the conclusion to policy makers, ‘just keep doing what you’re doing, whatever that is…’? Models of aggregate phenomena can be among the most interesting, but also among the least convincing (my own included!). That being said, this is better than most.

Sources of geographic variation in health care: Evidence from patient migration. Quarterly Journal of Economics [RePEcPublished November 2016

There are large geographic differences in health care utilisation both between countries and within countries. In the US, for example, the average Medicare enrollee spent around $14,400 in 2010 in Miami, Florida compared with around $7,800 in Minneapolis, Minnesota, even after adjusting for demographic differences. However, higher health care spending is generally not associated with better health outcomes. There is therefore an incentive for policy makers to legislate to reduce this disparity, but what will be effective depends on the causes of the variation. On one side, doctors may be dispensing treatments differently; for example, we previously featured a paper looking at the variation in overuse of medical testing by doctors. On the other side, patients may be sicker or have differing preferences on the intensity of their treatment. To try and distinguish between these two possible sources of variation, this paper uses geographical migration to look at utilisation among people who move from one area to another. They find that (a very specific) 47% of the difference in use of health care is attributable to patient characteristics. However, I (as ever) remain skeptical: a previous post brought up the challenge of ‘transformative treatments’, which may apply here as this paper has to rely on the assumption that patient preferences remain the same when they move. If moving from one city to another changes your preferences over healthcare, then their identification strategy no longer works well.

Seeing beyond 2020: an economic evaluation of contemporary and emerging strategies for elimination of Trypanosoma brucei gambiense. Lancet Global Health Published November 2016

African sleeping sickness, or Human African trypanosomiasis, is targeted for eradication in the next decade. However, the strategy to do so has not been determined, nor whether any such strategy would be a cost-effective use of resources. This paper aims to model all of these different strategies to estimate incremental cost-effectiveness threshold (ICERs). Infectious disease presents an interesting challenge for health economic evaluation as the disease transmission dynamics need to be captured over time, which they achieve here with a ‘standard’ epidemiological model using ordinary differential equations. To reach elimination targets, an approach incorporating case detection, treatment, and vector control would be required, they find.

A conceptual introduction to Hamiltonian Monte Carlo. ArXiv Published 10th January 2017

It is certainly possible to drive a car without understanding how the engine works. But if we want to get more out of the car or modify its components then we will have to start learning some mechanics. The same is true of statistical software. We can knock out a simple logistic regression without ever really knowing the theory or what the computer is doing. But this ‘black box’ approach to statistics has clear problems. How do we know the numbers on the screen mean what we think they mean? What if it doesn’t work or if it is running slowly, how do we diagnose the problem? Programs for Bayesian inference can sometimes seem even more opaque than others: one might well ask what are those chains actually exploring, if it’s even the distribution of interest. Well, over the last few years a new piece of kit, Stan, has become a brilliant and popular tool for Bayesian inference. It achieves fast convergence with less autocorrelation between chains and so it achieves a high effective sample size for relatively few iterations. This is due to its implementation of Hamiltonian Monte Carlo. But it’s founded in the mathematics of differential geometry, which has restricted the understanding of how it works to a limited few. This paper provides an excellent account of Hamiltonian Monte Carlo, how it works, and when it fails, all replete with figures. While it’s not necessary to become a theoretical or computational statistician, it is important, I think, to have a grasp of what the engine is doing if we’re going to play around with it.

Credits