Understanding Veterinary Papers – A Data and Statistics Perspective – Part 2
Part 2 – Core Concepts
In the last blog post we saw how Evidence Based Veterinary Medicine offers a guide to forming a question, searching for and appraising evidence, and then acting on the findings. We also touched on the importance of statistical knowledge when appraising such evidence, allowing an understanding of the quantitative nature of the results.
We can’t learn about all of statistics in a single blog post, nor would it be of any real benefit to simply outline a whole slew of tools and techniques. There are people who spend their entire professional careers dedicated to just one small part of it, and if you ever pick up a statistics textbook (no mean feat in itself, given they’re often the size of a small fridge), you’re likely to be faced with strange terms and graphs at one extreme, and a hellscape of incomprehensible equations and esoteric jargon at the other. Instead, for breadth and depth, I’d personally recommend online introductory courses.
Instead, in this post I want to talk about what I believe are 5 core concepts that promote statistical thinking. Then, with an understanding of these, the gist of quantitative results in the literature should be clearer and any efforts to then fill in the details of the statistics used will hopefully be that much easier. Later in the series, we’ll go a little deeper with the sorts of things that can go wrong in statistics, and there will be some data analysis examples in the final (and bonus) article.
The Big 5
The following are 5 ideas that, for me, underpin my understanding of medical research papers,
- Sample selection and bias
- The Normal Distribution
- Hypothesis Testing
- Confidence Intervals
From these concepts, more specific tools such as sensitivity, specificity, odds ratios, linear regression, etc. that frequent the literature should fall into a wider statistical context, the details of which can then be filled in if needed.
Sample selection and bias
Before even thinking about a t-test or a p-value, something is needed on which to apply statistical techniques, which for veterinary papers tends to be groups of animals. How these groups are put together and analysed is absolutely fundamental to everything that comes next, and if care isn’t taken, bias can be introduced that will nullify any apparent findings.
In one guide we read that “The concept of bias is the lack of internal validity or incorrect assessment of the association between an exposure and an effect in the target population” . What on Earth does that mean? Basically, that the effect we see in the data is due to an aspect of the data, and not the overall population we ultimately hope to apply it to.
There are lots of different types of bias, and at first, some of the terms can seem a little confusing. I’m going to focus on 3 types; selection bias, information bias, and confounding.
Imagine you wanted to estimate the prevalence of UK dogs with a grass allergy. If you went out and collected samples from only dogs in a rescue centre, or only dogs under the age of 6 months, or only dogs of a single breed, or only collected samples in December, and then communicated the results as if they could be generalized, you’d have introduced selection bias. Or, imagine you only measured samples from dogs with suspected grass allergy (again, with prevalence in mind). In that case, your prevalence would likely be much higher than the national average of dogs genuinely with a grass allergy.
The way to minimise many sources of selection bias is through randomisation. For example, in the case of only selecting cases from December, you could choose samples randomly from throughout the year (many pieces of software will make random selections for you). Or, apply the same randomisation with age or breed in mind. The authors of papers should be thinking about and discussing efforts to mitigate such biases.
Related to selection bias is spectrum bias. This is where groups are selected at the extremes of a spectrum of disease (for example), such as ‘completely healthy’ vs ‘at death’s door’, when in fact many cases in the real-world will be somewhere in between. Or, even worse, ‘healthy’ vs ‘diseased’ when in practice the results would only ever be applied to cases of ‘diseased with the disease in question’ vs ‘diseased with something else’ (the only case where it’s appropriate to investigate healthy vs diseased cases is in a screening scenario).
This is the first thing to look for when reading a paper. Ask yourself … are the subjects in this paper representative of the real world? Are they representative of my situation? Have different groups been analysed in the same way throughout? Have any been removed for inappropriate reasons? If any sort of selection bias has crept in, then the best p-value in the world can’t make it relevant to you.
At the simplest level, this boils down to misclassification of cases, i.e. non-perfect tests are employed to determine disease status. As few perfect tests exist, some degree of misclassification should always be expected. More nuanced types of information bias then come in the form of researcher-participant collaboration, such as failing to mention socially awkward behaviour, such as heavy alcohol consumption, or participants knowing too much about the study and answering interview questions in certain ways. This latter point is not relevant to cats and dogs, of course, but pet owners, perhaps?
One solution to such issues is to use blinding. This is where either the researcher or the participant, or both (single-blinded and double-blinded, respectively) do not know key pieces of information, such as which animals are on certain treatment types, the results of certain tests, etc.
This is where a variable (known as the confounder) affects both the variable of interest and the measured variables.
For example, imagine you want to determine if there is any correlation between the amount of exercise a dog has and some measure of general wellness. You collect samples from two groups of dogs, one where their owners indicate that their dogs exceed some predetermined exercise threshold, and others that don’t.
You run your tests and discover that, contrary to your expectations, the dogs that have little to moderate exercise are generally in better health than the exercising group. You then realise that the ‘high amounts of exercise’ samples were all from older dogs and the ‘low amounts of exercise’ dogs were all from younger dogs. The confounder here is age. Such confounders can be subtle and deadly to a study.
The Normal Distribution
Unless the authors of the paper you’re reading had infinite resources, whatever study they’ve performed probably used a sample from a population. For example, the population of interest might be ‘all UK dogs’, or ‘all UK cats with the clinical signs of pancreatitis’ or ‘all US horses on immunotherapy’, etc. The sample is then a number of cases from that population, and what’s often the aim is to infer aspects of the population from this sample.
When this is done poorly, it’s pretty obvious. For example, if a study is trying to say something about ‘healthy UK dogs’ but the sample consists of 5 Chihuahuas, that feels wrong. First, it’s unlikely that Chihuahuas offer a good generalization of all other breeds. Second, it’s unlikely that 5 dogs, even if they were 5 different breeds, fully capture the variations of the population in question. It’s obvious, intuitively, that a decent sample of UK dogs would be in the hundreds (at least) and be made up of dogs of all breeds, ages, male and female, neutered and not neutered, urban and rural, etc.
This is where the idea of a distribution comes in. Any quantitative aspect that’s being measured will form a distribution when plotted as a histogram (a type of plot that shows the quantitative variable in question broken down into sections vs the number that fall into that section). Here is an example,
This shows the number of Labradors (y-axis) in the different height ranges (x-axis). The red line indicates the mean. This is a distribution, in particular, a normal distribution, which makes a lot of sense when you realise what it’s saying. The shape of the distribution is a consequence of the fact that it’s unlikely you’ll get super-short and super-tall Labradors (the start and the end of the distribution, respectively), but you are likely to get a lot around the average.
As soon as a paper enters the realms of samples and populations, you’re into one of the main two branches of statistics; inferential statistics. With this, the aim is to, as the name suggests, infer something about a population from a sample (the other main branch being descriptive statistics, which is much simpler and involves describing data with things such as the mean, median and standard deviation).
In inferential statistics, you’re trying to estimate aspects of the population from your sample, or, you’re trying to tell if your sample is likely, based upon a theoretical population. More common is for researchers to try and tell if there are statistically significant differences between groups. In such cases, you’ll encounter something known as hypothesis testing. This works by first coming up with something called the null hypothesis, which, despite sounding like a 70s prog-rock album, is actually just the statement that there is no difference between the groups.
For example, let’s say a paper has two groups of horses, young and old, both with signs of allergy. Measurements of some blood marker have been taken (say, environmental IgE levels), and the researchers are trying to tell if the mean values for each group are statistically different. In other words, the question may be something like “do older horses possess higher or lower levels of IgE compared to younger horses, given similar levels of allergen exposure?”. In this case, they may perform a two-sample t-test, which tells you the probability of any differences being due to chance. The null hypothesis in this case would be ‘there is no difference’, and the alternative hypothesis would be ‘there is a difference’ (which could be refined to ‘older horses will have higher levels’ or ‘older horses will have lower levels’).
Again, intuition helps here and can be a great aid when reading papers. Imagine the groups are both made up of 6 horses (6 young and 6 old) and the mean levels of IgE are 616ug/ml and 625ug/ml, respectively. You can already tell that, with such a low number of horses and such similar results, the test is unlikely to conclude anything significant is going on. On the other hand, if 50 young and 50 old horses were used and the levels were 100ug/ml and 900ug/ml, respectively, it now feels very different.
Of course, gut feeling is no substitute for the hard numbers, and once you’ve gotten this far, you’ll soon encounter the p-value.
The ‘p’ is p-values stands for ‘probability’, and what a p-value is telling you in the context of a hypothesis test like the one above is; this is the probability that you’d get these numbers by chance if there is no difference between the groups. So, if the paper that you’re reading does a test with young and old horses like the example above and gets a p-value of, say, 0.03, then they might conclude something like ‘we conclude that young and old horses possess different levels of IgE, given similar levels of allergen exposure’.
The reason such a conclusion would be expected with a p-value of 0.03 is because that value is less than the (completely arbitrary) level of significance of 0.05 that’s commonly used. They may also employ more formal language along the lines of ‘we therefore reject the null hypothesis’, which is to say, they reject the idea that there is no difference in the mean levels of IgE between the young and old horses.
It’s at this point that p-values can cause people to lose their minds. What such a result means is ‘there is a 3% chance the data we have is due to chance if the mean levels between the two groups are the same’, but what some translate this into is ‘we have therefore proven biological differences between the two groups’. This latter claim is a huge overreach.
There’s one more idea to have a handle on when reading modern veterinary papers; the confidence interval. This is a concept that can take a while to get your head around, can easily be ignored or misinterpreted, but which offers a very useful tool in the evaluation of evidence.
So far, we’ve looked at samples from populations and whether or not such samples are likely to be selected, given a particular null hypothesis.
However, any population you’re evaluating your sample against is theoretical and unknown. That’s the whole reason a sample has been taken. Measuring the population is too difficult, time-consuming or expensive.
Imagine the authors of a paper have measured the CRP levels of 50 dogs that have been diagnosed with atopic dermatitis, with the aim of estimating such levels in all allergic dogs (i.e. the mean value they would obtain if they could measure the CRP levels of every allergic dog in the world, AKA the population mean). They obtain a mean value of 5.2mg/ml. How close is that to the actual population value? We have no idea! For all we know, the population value could be 5.3mg/ml and this paper could be extremely close with its value of 5.2mg/ml. On the other hand, it’s possible that the population mean is actually 9.7mg/ml and this random sample of 50 dogs just so happened to result in a mean of 5.2mg/ml.
There is no way of knowing. However, what can be done is to place a range of values around the obtained mean that gives an idea of where the actual population value might sit. This is where the confidence interval comes in, typically used at a 95% confidence level (quoted as ‘95% CI’).
What such a confidence interval is telling you is this; If the authors were to repeat their experiment an infinite number of times and calculate confidence intervals for each sample, 95% of them would contain the population parameter.
The first time I encountered this I was a little disappointed. What’s the point of something that only becomes real with an infinite number of samples? It would be less effort just to measure the actual population! But when you think about it, it’s as good as things can get. As already stated, all we have is a sample and its mean. We don’t know a thing about the population. All we can do is place some upper and lower bounds on the value, based upon the number of samples and their variability (more samples and less variability decrease the width of the confidence intervals).
In essence, a confidence interval is a ‘parameter catcher’. We’re trying to catch some population mean using such an interval around our sample mean, and this is the best we can do in terms of expressing where a parameter of interest might be. Note that a confidence interval does not mean there is a 95% chance we have captured the population mean. The sample is taken. We’ve either captured it or we haven’t.
Importantly, because such intervals tell you what the number of interest could be, it’s important to acknowledge that fact. For example, a test accuracy of 98% sounds great, but if the lower confidence limit is, say, 82%, how does that change the clinical applicability of the test?
If some of the above ideas don’t quite click, don’t worry. The aim isn’t to teach you statistics per se, but to give you an overview of core concepts, which you can then build upon if you wish. But, more importantly, the aim is to put some flesh on the bones when you’re reading a paper and see terms such as ‘the alternative hypothesis’, ‘p-value’ and ‘95% CI’. In summary, try to keep the following key points in mind,
- Bias comes in many different forms and can diminish or even destroy a study
- People use samples to try and say something about a population or populations
- Hypothesis testing is used to quantify such inferences, and is where people use statistical tests such as t-tests
- P-values are used to say something about the chance that an obtained result is by chance, given that the null hypothesis (‘nothing is going on’) is true
- Confidence intervals are used to try and put limits on a value, telling you something about where a population value might sit
Of course, the world of statistics stretches far beyond the above 5 ideas, involving other fairly common tools and techniques such as linear regression, odds ratios, sensitivity and specificity, non-parametric tests, etc. And of course, there is a myriad of different plots that can be employed to display and communicate data. But in my experience, once you see all of statistics as falling into either descriptive or inferential categories and have a feel for sampling, hypothesis testing and the quantification of uncertainly, a lot of research begins to feel less opaque. And, critically, you can begin to assess papers for their trustworthiness and applicability to you and your clinical requirements.
To learn more about statistics, I recommend the book ‘Statistics for Animal and Veterinary Science’ by Aviva Petrie & Paul Watson , Nature’s ‘Statistics for Biology’ site , or take a look on Coursera for a variety of statistics courses .
In the next blog post, we’ll build on these ideas further by looking at why such statistical knowledge is important and why there’s never been a more important time to be critical of the literature.
- Bias https://jech.bmj.com/content/58/8/635
- Statistics for Animal and Veterinary Science https://www.amazon.co.uk/Statistics-Veterinary-Animal-Science-Petrie/dp/0470670754
- Statistics for Biology https://www.nature.com/collections/qghhqm/content/statistics-in-biology
- Coursera https://www.coursera.org/
Written by Rob Harrand – Technology & Data Science Lead
DID YOU FIND THIS USEFUL?
To register your email to receive further updates from Avacta specific to your areas of interest then click here.