Part 5 – Code Based Analysis
This blog series has been about the reading, understanding and application of veterinary papers from a statistical perspective. In this final post, we’ll visit some of the more modern ideas around data analysis, and offer a chance to get see some examples of literate programming. This, I believe, will provide a means to understand some of the key statistical ideas at a far deeper level than simply reading about them. If at any stage you are involved in veterinary research, or if you’re just curious about the latest developments in data analysis, this post is for you.
Have you ever been sent a spreadsheet that looks like this?
The things wrong with this (made-up) example are so numerous it brings a tear to my eye. The bottom line is this; the analysis is not reproducible. Such an approach is not a good way to organize, analyse, present or communicate data. Nor is it conducive to allowing the original analyst to revisit and quickly understand the work at some future point, and certainly not for a third-party who may be collaborating on a project. Beyond the reproducibility issues, the scope for human error is unbounded, with copy-and-paste errors being the most common.
There was a time, not so long ago, when I would happily produce such an abomination. Today, having seen the error of my ways, I want to help you to emerge from the cellar into the light of modern data analysis.
Sharing is Caring
A growing number of journals are now making it obligatory to share data alongside scientific papers. This is an effort to improve transparency and reproducibility, and any researcher that does this is clearly demonstrating a certain level of openness and professionalism.
Behind this growing wave of data sharing are the ripples of a another; code sharing. This is the idea that alongside the data should be the computer code that was used to produce the results.
The ultimate example of this is the notebook, as mentioned in blog post 3. There are even some who suggest that the notebook will one day supersede the scientific paper , especially in light of an ever-increasing catalogue of high-profile spreadsheet horror stories .
Notebooks allow you to analyse your data via code, integrate it with text and then output it as a Word file, PDF, slide-deck or webpage,
The challenge, of course, is learning how to code. But beyond that, if you or someone working on a particular project is willing learn, the advantages are numerous, include total transparency, reproducibility, and the ability to update analysis, plots and tables with a single click (if, for example, new data is acquired or some stage of the analysis is changed).
Kaggle is a data science platform that both hosts data science competitions and thousands of different datasets for the practitioner to use to develop data science projects . Access is completely free, and both R and Python notebooks can be created and shared (R and Python being the two most popular languages for open, code-based statistical analysis). Datasets can also be uploaded and shared with the wider data science community.
For me, this is a great way to take any learning from a book or course and put it into action. Veterinary-specific datasets are limited at the time of writing, but many are available in the wider healthcare category, including data on,
- Contagious disease outbreaks
- Mental health surveys
- Alcohol and drug use amongst teens
- Google health searches
- Zika virus outbreaks
- Chronic disease indicators
- Chest radiographs
- Animal bite incidents
I’ve created a dataset on Kaggle, along with a ‘kernel’ (Kaggle’s name for a notebook) showing some basic statistical techniques in R, in particular, techniques and concepts from this blog series.
My kernel can be found here … https://www.kaggle.com/tentotheminus9/understanding-veterinary-papers-example-code
I aim to create further kernels in the future on other statistical topics that are relevant to veterinary science. If you fancy getting hands-on with such analysis, I encourage you to head over to the Kaggle home-page, create and account, and start playing.
The aim of this series has been to introduce the veterinary reader to some of the tools and techniques involved in veterinary research papers. Hopefully, by this point, reading a paper and evaluating the contents will be that much clearer in terms of the statistical methods and findings. In this final post, we’ve seen (via Kaggle) some of these ideas in the form of code, which is the emerging best-practice approach to data analysis.
If you have any questions on any subject covered, including coding, feel free to drop me an email at email@example.com
- The Scientific Paper Is Obsolete https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/
- Spreadsheet mistakes http://www.eusprig.org/horror-stories.htm
- Kaggle www.kaggle.com
Written by Rob Harrand – Technology & Data Science Lead
DID YOU FIND THIS USEFUL?
To register your email to receive further updates from Avacta specific to your areas of interest then click here.