“Man is a tool-using animal. Without tools he is nothing, with tools he is all” – Thomas Carlyle
We’ve so far talked about the importance of data, whether it be something you’ve purposely collected as part of a coordinated study, data you’ve amassed as a by-product of some other process, or data you’ve found in one of the many open data repositories on the internet.
The next question is how to analyse such data. Specifically, what tools are out there to help? This post outlines some of the main ones that are available for turning your raw data into insight.
The humble spreadsheet doesn’t fare well in modern data science circles, but I believe their widespread dismissal for something more powerful (yet harder to use) is unjustified. Spreadsheet programs like Microsoft Excel offer a wide range of tools that allow just about anyone to make sense of data.
Beyond the raft of functions, quick visualisation tools and the ability to create macros (a way to automate a task or procedure), one advantage of such a program is the ability to perform ‘What-If’ analysis in a quick and visual way.
For example, setting up a number of calculations to work out the costs of ordering treatment for a number of sick patients. This could be based upon the number of patients, the expected proportion that will need the treatment, and the cost of the treatment. Any of these three values could then be changed and tweaked, and the ultimate cost of ordering the treatment updated in real-time (the blue cell below),
If you don’t have a licence for Excel, there is a great open-source version within the Open Office suite.
Use of Excel in the veterinary literature is extensive. Looking specifically for use of what-if analysis didn’t reveal a great number of modern examples, but one paper from 2016 by Faverjon et al1 used such a technique in Excel to look at Culicoides-borne diseases in horses. That said, they did use an add-on package called @Risk to significantly increase Excel’s capabilities. In their work, they modelled the probabilities of African horse sickness and equine encephalosis entering Europe, and used what-if analysis to look at the effects of different preventative measures.
When to use: When you don’t want to get involved in any programming, and your data isn’t too large
Beyond using spreadsheets, the next level up is that of the commercial statistical package. These include such programs as SAS, Matlab, SPSS, Prism, and Tableau (just to name a few). Such packages offer the user a slick and intuitive way of performing analytics, with the complexity of features ranging from simple drag-and-drop interfaces to more technical programming environments.
Such features come at a price, with SAS costing around £75 per month, Matlab costing £1,800 for a general licence, a Prism licence for two at around £300 per year, Tableau Professional Edition at £50 per month, and SPSS at an eye-watering £6,600 per year.
A search of the literature shows examples in veterinary of all five of these packages. As an example, a paper by Usui et al2 in 2015 used SAS to investigate the characteristics of Japanese dogs with diabetes. From over 5000 cases collected between 2006 and 2013, they found that dogs with diabetes tended to be old and underweight, with a high proportion being toy poodles.
When to use: When you want a relatively easy way to perform analytics without necessarily delving too deep into the technical detail. And when you can afford it!
R is a free, open-source statistical software language, which has grown in popularity year-on-year in recent times. It’s learning curve is steep, but once even partially ascended, R allows you to perform any statistical work you could possibly imagine. This wide-range functionality stems from its open source nature; anyone can create new modules to handle new types of problems, and these new features can be consequently loaded and used by anyone.
To get started, download R here, and then a free program called R-Studio here, (R-Studio is the interface that you use to write your code). Once installed, R comes with many datasets built in for you to learn the ropes. Below are two lines of code to create a boxplot of chick-weights by feed-type,
This line loads the chick weights dataset,
This line says ‘create a boxplot by breaking down the weights by feed-type’,
boxplot(chickwts$weight ~ chickwts$feed)
The resulting boxplot is shown below,
Of course, you can add a title, axis-labels, colours, etc. In addition to these ‘base’ visualisations, R has a package called ggplot2 which allows you to create publication-level visualisations. In addition to graphics, there are packages available for such diverse areas as finance, genetics, image analysis, social sciences, language analysis and machine learning. The easiest way to find out more is simply to Google ‘R’ along with the problem of your choice, and you’ll be almost guaranteed to find other people online who have solved something similar. And because R is open source, they often share their solutions.
The use of R in veterinary papers appears to be fairly wide-spread. A paper by Babyak et al3 from 2017 looked at the prevalence of elevated serum creatinine concentration in dogs, using R for all the associated statistical work. Utilising records from over 115k hospital visits, they concluded that a high proportion of cases had raised creatinine levels, indicating kidney disease.
When to use: When you want to delve into the details and are primarily concerned with the statistical analysis of data (as opposed to create pieces of software)
Another open-source language widely used for data analysis is Python, or more specifically, the ‘ecosystem’ of SciPy. This is a collection of modules (just like in R) that do a lot of the heavy lifting in Python when it comes to data analysis. One module in Python that’s great for data visualisation is Seaborn (based on the more general Matplotlib package). Below is some Python code for loading some built-in data and creating a boxplot.
These first two lines load the iris dataset (iris petal sizes vs type),
from sklearn import datasets
iris = datasets.load_iris()
These next two lines load the Seaborn module, and then say ‘Create a boxplot by breaking down the first column by type’. Note that the type is called ‘target’ here, and the first column, which corresponds to sepal length in centimetres, is referred to by the ‘0’ (which is just the column number in the data),
import seaborn as sns
The resulting boxplot is shown below,
As with the R example, this needs labels, etc to be of any real-world use. But it shows how powerful these languages can be, in that you can create things with just a few lines of code. Beyond the quick results, both languages then allow you to dig deeper into the technical details. One advantage Python has over R is that it is a multi-purpose language, with more mature support for professional software development concepts such as object-oriented programming.
To get started with Python, I recommend the Anaconda distribution, which can be found here.
Searching for the word ‘python’ in the veterinary literature isn’t straightforward, for obvious reasons! Papers relating to the Python software were eventually found, although the general use of Python appeared to be beyond the more straight-forward statistical analysis done in R. One example is a paper by Brunton et al4 in 2015, looking at mapping the spread of endemic bovine tuberculosis in England and Wales. Python was used, but primarily to create scripts that then interfaced with a commercial mapping tool called ArcGIS.
When to use: For general analysis, or for when you’re more concerned about creating software beyond the initial data analysis
Beyond installing something like R and importing some data (for example, from an Excel file), the next step is to learn Structured-query language (SQL). This is very different to everything covered so far, but it’s an essential skill to acquire if you’re going to get seriously involved in data analysis.
SQL is used to manage data held in a relational database, a ubiquitous type of data-store that’s used to hold data whenever it’s well-structured and clearly defined. Relational databases work, in short, by dividing data into tables, and connecting these tables using keys.
You can use SQL to extract, input, update or delete data, and the syntax is as simple or as complex as the nature of your query and the design of your database. For example, the line below will show you all the data from a (made up) table called pet_details,
SELECT * FROM pet_details
Querying multiple tables becomes harder, as you need to tell the database how you want to connect the different (but related) tables together. However, the internet is awash with tutorials examples to help you make a start.
References to SQL in the veterinary literature are widespread, almost always referring to how a certain dataset was accessed. For example, a paper by Summer et al5 in 2014 looked at the prescribing practices of primary-care veterinary practitioners in dogs with bacterial pyoderma. They note that data was both imported and exported using SQL.
When to use: Typically, when you want to get hold of data from a relational database, to then be analysed (in, say, R or Python)
In the next blog, we’ll conclude our look at the tools of data analysis, with an overview of how to handle all things big data.
Written by Rob Harrand – Technology & Data Science Lead
DID YOU FIND THIS USEFUL?
To register your email to receive further updates from Avacta specific to your areas of interest then click here.
- Faverjon, C., et al. “Comparative Risk Analysis of Two Culicoides‐Borne Diseases in Horses: Equine Encephalosis More Likely to Enter France than African Horse Sickness.” Transboundary and emerging diseases (2016).
- Usui, Shiho, Hidemi Yasuda, and Yuzo Koketsu. “Characteristics of dogs having diabetes mellitus; analysis of data from private Japanese veterinary clinics.” Veterinary Medicine and Animal Sciences1 (2015): 5.
- Babyak, J. M., et al. “Prevalence of Elevated Serum Creatinine Concentration in Dogs Presenting to a Veterinary Academic Medical Center (2010–2014).” Journal of Veterinary Internal Medicine (2017).
- Brunton, L. A., et al. “A novel approach to mapping and calculating the rate of spread of endemic bovine tuberculosis in England and Wales.” Spatial and spatio-temporal epidemiology 13 (2015): 41-50.
- Summers, Jennifer F., Anke Hendricks, and David C. Brodbelt. “Prescribing practices of primary-care veterinary practitioners in dogs diagnosed with bacterial pyoderma.” BMC veterinary research1 (2014): 240.