Part 4 – Critical Evaluation
So far, we’ve seen why statistical knowledge is important, touched on some core concepts and discussed what can go wrong behind the scenes. This latter aspect of data analysis wasn’t discussed just for the sake of criticizing the literature. It was covered to arm you against such errors and poor practice when reviewing the literature and deciding on a clinical course of action.
Critiquing a Paper
Not all research is created equally.
Unfortunately, due to space limitations in a paper, time limitations on the author, and the simple fact that details get missed, most if not all papers fail to tell you absolutely everything that was done.
You’ll usually get the headlines from a paper; sample numbers, group inclusion criteria, statistical tests run, software used. But we’ve seen several things in this blog post that are prone to be hidden, lurking below the radar of the modern scientific paper.
For example, you read that a t-test has been run and the p-value is significant, but has any p-hacking been done? What about HARKing? (see blog post 3) Or, you have details of, say, the breed, age and clinical signs of the dogs in two groups, but were the samples randomized when tested? Were any blinding procedures used and were they correctly designed and implemented? Or even simpler, were copy-and-paste errors made at the analysis stage? These sorts of question can be very difficult if not impossible to answer from just reading a paper.
The best we can do as readers of the literature is to look out for key things, ask certain questions in our mind, and then make a judgement call on the validity and veracity of the work presented. What follows is a basic guide for this process.
The Good, the Bad and the Ugly
The aim here is not to just criticise other people’s work. Instead, the purpose is to provide a framework for evaluating the quality, reliability and applicability of a particular paper to your clinical situation. Note that this guide mentions a few new statistical ideas that are outlined afterwards in the appendix.
- The first thing you’re going to see is the title of a paper. What does it tell you? Are there any initial indications that the study is applicable? Is it a prospective or retrospective study? Is randomisation mentioned?
- Next is the abstract. Here you can get a sense of the professionalism in terms of the statistics. Does it mention blinding, randomisation, p-values, confidence intervals, statistical tests, etc? What does the conclusion say in relation to your situation? Do they indicate that more work is needed, suggesting the paper is only the start of a line of work? Can you see any warning signs that, perhaps, bias could have crept in? For example, spectrum bias, potentially rendering the results clinically suspect? (see blog post 2)
- Does the introduction clearly set out the study and give detailed background information? Does the introduction cite recent studies?
- Does the method give you all the pertinent information, including how and why the chosen animals were selected? How were follow-ups carried out, if relevant? Does it seem like the experiment could be replicated from the description? Do the authors explicitly mention bias and steps taken to eliminate it, such as randomization and blinding? Have calculations been performed to estimate the number of samples needed? (known as power calculations). Are the data analysis steps detailed, with full disclosure of tests performed, assumptions of the data, the handling of missing data and approaches to dealing with losses to follow-ups? (for example, in clinical trials). Are p-values and confidence intervals used and interpreted correctly?
- Are the results clearly visualized and explained? Are any plots used clear and applicable, with error bars shown? Are animal characteristics summarized, such as breed, age and sex? In randomized control trials, is ‘intention-to-treat’ analysis used? (see appendix for more on this). Are the outcomes relevant in terms of the nature and magnitude of the findings? (especially with confidence intervals taken into account)
- Is there a clear and honest discussion of the study? Are the number of subjects needed to treat mentioned, if relevant? (again, see appendix for more on this). Do the findings make biological sense? Do the lower or upper limits of any confidence intervals change the clinical relevance? (many just focus on the calculated value of the parameter of interest, but if that number could be higher or lower on a repeat of the experiment, could that change your opinion?). Are limitations to the study mentioned?
Other things to note include,
- Are there any hints of p-hacking, such as statistically significant results emerging from very specific subsets of the data?
- Are there any hints of HARKing, such the results from some subset of the data perfectly lining up with the opening hypothesis?
- If statistical tests have been applied many times, have the p-values been adjusted to avoid the problem of multiple comparisons?
- Has data been shared?
- Has analysis been done via code and is that code shared?
I had a look through the literature for papers that matched the search terms ‘canine’ and ‘allergy’, and browsed until I saw one relevant to the above points (i.e. a paper that reported on an outcome from a study rather than any sort of review article).
The one that I chose was Zur, G., Lifshitz, B., & Bdolah‐Abram, T. (2011). The association between the signalment, common causes of canine otitis externa and pathogens. Journal of Small Animal Practice, 52(5), 254-258.
I also made sure the paper I chose was open access, so you can read it for yourself .
I don’t historically come from a veterinary background, but hopefully I should still be able to use the guidance above to get a sense of the statistical quality of the work performed.
Title: The title doesn’t give a lot away in terms of how the study was performed, but it’s pretty clear in terms of the nature of the study.
Abstract: The text reads well, telling us that predetermined inclusion criteria were used, that this is a retrospective study and that correlations were evaluated. The details of this inclusion criteria will be interesting and should allow us to consider any sources of bias. There is transparency regarding an over-representation of certain breeds, and p-values are quoted throughout (including one at 0.098, which is above the typical threshold of 0.05). A couple of things grab my attention. First, there is reference to ‘almost all dogs that were older than five years’ along with a p-value of 0.01. It will be interesting to see when we get further into the text why this figure of five years was chosen. Is there a risk that many different ages were chosen until significance was found? Second, there is reference to ‘…all the other parameters examined’. Is there a chance here that multiple tests were run, but p-values not adjusted?
Introduction: The introduction is fairly brief but seems thorough and well referenced. It lays out the problem, common causes (with quantification) and a summary of the aim of the study.
Method: The method opens with a description of the medical records used. We learn that 430 records were used from dogs diagnosed with otitis externa (OE) from a particular teaching hospital in Jerusalem. A previous study is cited for determining the most common primary causes and predisposing factors. As a non-vet, thoughts that occurs to me include; how were the dogs diagnosed? Was a strict diagnostic protocol applied or were there elements of subjectivity? Were all 430 dogs diagnosed by the same person? And, are dogs that live in Jerusalem generalizable to other populations?
The description of the inclusion criteria seems clear, and they state that ‘dogs suspected as having more than one underlying condition were excluded from the study’. There appears to be very detailed descriptions, including quantification, of how things were assessed, such as the number of organisms seen in cytology samples (although they quote a paper from 2001 on how the original cases had been diagnosed, despite some of them being from 1999 and 2000, which I don’t quite understand). They also state that all cytology results were assessed by a single person, reducing the chance of any subjective differences between operators. We also see their division of ages for the onset of OE (less than one year, one to five years and more than five years), but where these divisions come from is unclear (perhaps from an earlier reference?).
The method then has a section dedicated to statistical analysis. This part tells us about the many tests performed, the statistical tests used (chi-squared and Fisher’s exact test, which I’ll explain later), the significance level used (0.05) and the name and version of the software. There is a good level of detail, although you’d be hard-pressed to reproduce their analysis even if you had the raw results, as we don’t know precisely how each test was performed. For example, they state that ‘This [the statistical testing] was applied when pathogens were defined as present or absent or combinations between them’, but it’s unclear to me what ‘or combinations between them’ means.
There is also still no mention of any attempt at avoiding the multiple comparison problem.
Results: The results are clearly laid out, although no visualization is used. There are, however, some clear tables. We don’t see any information on when dogs were diagnosed in terms of month of the year. It might be interesting to know whether, for example, the allergic cases were predominantly from the summer? (again, talking from a non-veterinary perspective).
Discussion: The paper concludes witha detailed reiteration of the results, with emphasis on whether or not each of the tests resulted in significant or non-significant findings. There are also suggestions of limitations, such as “the highest levels of rods were found in endocrinopathies suggesting a more severe otitis due to this primary cause. Although this finding was highly significant statistically, there were too few cases of endocrinopathies and further studies are needed”, although the authors don’t state what ‘too few’ means or how many more would be needed (again, back to the idea of power calculations).
Overall, this seems like a clear paper and an interesting study. It’s a shame the data isn’t shared, although that’s not overly surprising as data sharing is still uncommon. The software they used (SPSS) has both graphical and code-driven analysis features, and it is unclear which approach they used in this work.
If I were to evaluate the paper from an EBVM point-of-view, my main two questions would be,
- How would the results appear if a Bonferroni correction had been applied?
- Are the results of 149 dogs from this particular hospital applicable and useful in general?
We’ve encountered a few new statistical ideas in the section above, any one of which you may commonly encounter in veterinary-related papers. These are described in the appendix.
This concludes the main section of the blog series. We’ve seen how evidence-based medicine can help to evaluate a paper for your clinical needs, some of the core statistical tests that are common to veterinary-related papers, what can go wrong in terms of reproducibility, and how to critically evaluate a paper.
In the next and final blog post, we’ll see examples of how to perform data analysis using computer code in the form of a notebook. This notebook, hosted on the data science platform ‘Kaggle’, offers the reader a way to learn more about modern data analysis.
With statistical tests, there are two main things that can go wrong. First is a Type I error, where the null hypothesis is rejected incorrectly (recall that the null hypothesis is the statement that basically ‘nothing is going on’, or that two tested groups are actually one and the same). In this situation, a p-value of >0.05 has been seen and some sort of effect concluded, when actually, the results were due to chance. This is also known as a false positive. Second is a Type II error, where the null hypothesis is accepted incorrectly. In this situation, a p-value of <0.05 has been seen despite the fact that the effect in question is real but gets missed. This is also known as a false negative.
The reason for type II errors is typically a lack of power, which is defined as the probability a test will reject a false null hypothesis. In other words, more power means a greater chance of rejecting a null hypothesis that isn’t right and a lower chance of making a type II error. This basically equates to the chance of a test detecting an effect when there is one.
The two factors that influence power are common sense; effect size and sample size. For example, if you wanted to tell the difference between, say, the Serum Amyloid A (SAA) levels of cats with and without some minor ailment (assuming for the sake of argument there is some small difference), the effect size would probably be quite small. A study like that with only a handful of cats would almost certainly have little chance of finding a significant result, ending in a potential type II error. The greater the effect size and the greater the number of samples, the more power a study has.
Flipping this around, if you can estimate the effect size (from, say, a pilot study or similar studies in the literature) and choose a desired level of power, then the equations give you the number of samples required. This should be seen in all relevant papers.
Simply put, this means analysing all patients who were originally included in a study, in whichever groups they were assigned (this is relevant to randomised controlled trials). This seems like an obvious statement, but how to approach analysis when, for example, patients are withdrawn from a study or die during a study is not always clear. This is in contrast to per protocol analysis, which sticks to those patients who didn’t deviate from the original protocols in any way and have remained in the study throughout. Intention to treat (ITT) analysis is particularly useful for limiting bias, as it retains the randomisation done at the point of sample selection. It’s even described by some as “Once randomised, always analysed”. Many concrete examples are available online for further information, including the effect on something called relative risk when using intention-to-treat vs per protocol analysis .
Number of subjects needed to treat
This is a measure of how many patients would require a certain treatment (usually the subject of a paper) in order to avoid one negative outcome. The lower this number is (the ideal NNT is 1), the more effective the treatment is.
Chi-squared test & Fisher’s exact test
Both of these are statistical tests applied to categorical, tabulated data, such as species vs outcome (e.g. dog/cat vs cured/not cured). In general, Fisher’s exact test is used when sample sizes are small.
- Zur, G., Lifshitz, B., & Bdolah‐Abram, T. (2011). The association between the signalment, common causes of canine otitis externa and pathogens. Journal of Small Animal Practice, 52(5), 254-258 https://onlinelibrary.wiley.com/doi/full/10.1111/j.1748-5827.2011.01058.x
- Understanding the Intention-to-treat Principle in Randomized Controlled Trials https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5654877/
Written by Rob Harrand – Technology & Data Science Lead
DID YOU FIND THIS USEFUL?
To register your email to receive further updates from Avacta specific to your areas of interest then click here.