Part 3 – The Reproducibility Crisis
So far in this blog series, we’ve seen how evidence-based medicine provides a framework for translating research into clinical action, and that one of the key steps is the evaluation of the literature using statistical insight. We then touched upon some key statistical ideas that are common in the veterinary literature.
In this post, we’ll see why being critical of the literature has never been so important.
Traditional Data Analysis
My first experience of data analysis was when I was around 15 years old. I can’t remember the details, but it was probably some sort of financial data from a school project, or perhaps the results from a simple scientific experiment. What I can recall is the software. It was called ‘Lotus 123’, and its bright, bold colors against a black background seared themselves into my memory (as well as my retinas, when the screen was turned up too bright).
The software was, for that sort of thing, absolutely fine. The data were manually input, simple calculations performed and the results saved to a file. It could even create plots, and all of these features were regularly enhanced and updated until it was eventually overtaken by Microsoft Excel in the late nineties.
Today, many people are using Excel in the exact same way as described above. Or, for those who require the next level of sophistication, something like Graphpad Prism might be used. And at this point, you may wonder why you or anyone else would ever need anything different. You input data, crunch the numbers, throw a few statistical functions at it and produce some graphs. Then, you hit save and move on. Job done, right?
Well, maybe. Like with most things in science, the devil is in the detail. Perhaps you’ve used the right software, but did you use it correctly? And perhaps you shouldn’t have used that software at all? But with data analysis, the problems can begin even sooner.
Where’s that thingy file from what’s his face?
In my experience, the most common mistakes with data analysis are the simplest to make. Have you ever had a conversation like this? …
“Do you know where that data is from that thing we did with those thingy samples?”
“I think I have it saved somewhere. Hang on a sec”
[20 minutes later]
“Oh, here it is. It was in an email”
[Get a file through called ‘final17.xls’]
This isn’t great. First, the data you’re looking for clearly isn’t well defined (‘that thing we did’). Second, there has been a delay in getting hold of it. Third, it was kept as an attachment to an email, and fourth, it looks like you’ve received the 17th ‘final’ version.
The next stage is usually to put the file on your PC somewhere, perhaps in a folder like this,
Don’t pretend you don’t have folders like this. I know I did until very recently. And yet, the major problem is blindingly obvious, namely, what are these files? What do they contain? How are they related? Who created them and why? If I wanted to use or share the final or ‘correct’ version, which was is it?
Problems such as these, including poor file naming, poor storage, a lack of information about which files are which, etc., inevitably and painfully lead to an overall process like this,
The smallest and simplest of projects drag out over weeks, mistakes are made, and everyone concerned has to do their jobs against a background of deep frustration.
Alas, poor data management is only the beginning of the issues surrounding modern data analysis. The next area that can hinder the process concerns statistics.
Statistics isn’t Surfing
It’s easy to know whether or not you’re good at surfing. If you can’t stand on the surfboard for more than 2 seconds, then you’re not. Confidently telling people you know what you’re doing isn’t going to work, and you’re unlikely to delude yourself, because your ability (or lack of it) is clear and demonstrable to all.
Statistics isn’t like that. It’s easy to fool yourself into thinking you know what you’re doing. When you load data into, say, Excel or Prism, hit a sequence of buttons and get a p-value out, it can feel like all is right with the world and you’re ready to move on. I used to do this often.
After all, you got a number out, didn’t you? And you’ve never been picked up on it. You’ve had countless papers published, all of which went through peer review, so this can’t apply to you, can it?
The problem with statistical software is that the old adage of ‘garbage in, garbage out’ applies. Sometimes the software might tell you you’re done something wrong, but more often than not, what you’re doing may be absolutely fine from a computational point-of-view, but not from a statistical one. Let’s take a look at a case in point.
Death by a thousand Mouse Clicks
In 2006, researchers at Duke University in the US published several papers that claimed to show a relationship between cancer patient’s gene expression signatures and their consequent response to chemotherapy. These papers generated huge interest and quickly led to clinical trials.
What followed was a long, drawn-out battle between the Duke researchers and two biostatisticians called Keith Baggerly and Kevin Coombes, about the findings of the work. In particular, Baggerly and Coombes couldn’t reproduce the numbers that the Duke researchers had published based on the same data, and slowly began to reveal dozens of flaws in the analysis. These flaws included statistical errors, data analysis errors, and even an unhealthy dose of outright fraud. The clinical trials were eventually stopped, papers retracted and careers ended .
Such high-profile cases are rare, but it’s estimated that large swathes of the scientific literature could be wrong  thanks to mistakes similar to those made by the Duke team. Such mistakes tend to be either statistical in nature or related to the data analysis in some way. Crucially, such mistakes are often impossible to spot. As one author puts it, “it is generally difficult to know if a statistical test is really the correct one for data presented in a paper. This is because space restrictions on scientific papers preclude detailed descriptions of the data and verification of test or model assumptions” 
Baggerly even coined a phrase relating to this – ‘forensic biostatistics’. Basically, given the raw data and the final results, you have to figure out, Sherlock Holmes style, what on Earth the researches must have done to get from one to the other.
Below we’ll see what sorts of things can go wrong with modern research, and touch on some possible solutions.
What’s the Worst that can Happen?
Back in the day of small data volumes and simple software, you were limited in what you could do in terms of analysis. Today, it’s relatively straight-forward to load lots of data into a piece of statistical software and with a few clicks of the mouse, apply cutting-edge and highly advanced tools and techniques. This ease of analysis, combined with a poor understanding of statistics, brings us to our first statistical abomination; p-hacking.
As we saw in the last blog post, it’s customary to use a p-value of 0.05 to indicate ‘significance’. P-hacking is the repeated selection of data until significant results are shown. For example, if the authors of a paper expected a treatment group to have a lower serum amyloid A (SAA) level compared to a control group, but failed to obtain a significant p-value, they might be tempted to, say, check again but only with dogs under the age of 5. Or over the age of 5. Or only a certain breed. Or only those with certain clinical signs. Or only those called Bernard. Before you know it, they’ve obtained a p-value under the magical value of 0.05. There is evidence that this is a wide-spread problem , although thankfully, the effects seem to be marginal when research is combined in meta-studies.
The problem of multiple comparisons is a similar issue. Think back to what a p-value is; the probability that you would obtain your observed results by chance if the null hypothesis was true. That means that as you increase the number of tests performed, even in the complete absence of any effect, the chance of getting an apparently significant result increases. See the cartoon below for a real-world application.
Figure 1 – Source https://xkcd.com/882/
The solution to this is to adjust the p-value in some way, the simplest approach being the Bonferroni correction, which simply divides the common p-value cut-off of 0.05 by the number of tests run. This gives a lower and harsher value to get below before declaring significance. This is particularly important in high-throughput experiments (such as those using a microarray) where thousands of tests are performed during analysis.
Next in the list of ‘what not to do with data’, is HARKing, or ‘hypothesising after the results are known’. This phrase was first coined in 1998 and described as “…presenting a post hoc hypothesis in the introduction of a research report as if it were an a priori hypothesis” . In other words, where a hypothesis is formed before data is collected, but then history is rewritten when a different outcome is observed.
For example, in the SAA measuring example above, you might have started with the hypothesis that the dogs on treatment would have, on average, a significantly reduced level of SAA. If, after data collection, the treatment group actually had the same average level, with the exception of some arbitrary subgroup (say, dogs under the age of 2), the correct course of action would be to point this out and state the facts (along with taking steps to avoid p-hacking). However, if instead you began your write-up with a confident explanation of why you expected only the younger dogs in the treatment group to have lower SAA levels and then presented your data supporting it, that would be HARKing. This makes it looks like you’re onto something, increasing the chance of falsely rejecting the null hypothesis (also known as a Type I error or false positive, where you state something is going on when really it isn’t).
Finally, there is bias, a topic we encountered in the last post. Specifically, failing to understand, identify or control for it. As a concrete example of this, we turn to another example of Baggerly and Coombes uncovering errors in the literature. This time, a paper investigating protein signatures in the blood of women with ovarian cancer claimed to be able to tell cancer samples apart from normal samples. The results were so stark that a company was rapidly formed to sell the test, the end result being the potential of ovary removal. Thanks to Baggerly and Coombes, it was pointed out that all the cancer samples had been run months after all of the normal samples, and the differences the researchers were seeing were nothing but electronic noise. The test didn’t work, and instead, there had been a confounding of the experiment .
The answer to the above problems, like the issues themselves, fall into two categories. The first is the statistical education of researchers and to instill an understanding that statistics is not just a bag of tools to be thrown at data as an afterthought, but a discipline that should be respected. If data are to be analysed and statistical concepts used, the researcher should either be prepared to learn the relevant concepts in sufficient detail, or to seek the assistance of an expert. The answer is not ‘I think a t-test is about right, so I’ll keep shoving the data through that until I get a decent p-value’. This doesn’t mean to sound like a holier-than-thou criticism, for I hold my hands up and confess that this is exactly what I did for many years.
This notion, where people believe that their expertise in one area must map onto their statistical ability has given rise to the phenomenon of cargo-cult statistics .
Cargo-cults are religions in the Pacific where people try to create objects that imitate manufactured goods such as radios and aeroplanes , having seen such items upon contact with Western civilization. One aim of such cults is to receive more ‘cargo’ (often in WW2, payloads were parachuted in from aeroplanes, further fueling the mysticism). With statistics, it’s all too easy with certain types of modern software to use statistical tools without understanding them, ‘imitating’ experienced statisticians and hoping to get the same results. Thankfully, there are now many good, free sources of statistical knowledge [9-11].
To avoid p-hacking, some researchers are even going to the extent of preregistering their analysis, with a date-stamp, on platforms such as the Open Science Framework . This allows people to check that analysis was designed before the data were seen, removing the opportunities for researchers to cherry-pick, tweak and fiddle until they get a p-value of less than 0.05.
The second solution is to move away from traditional, graphical data analysis packages like Prism and Excel. Such software has its place, but doesn’t lend itself to reproducible research, as manual steps like copying, pasting and the clicking of the mouse can’t be easily recorded. Today, there are increasing efforts to do something called literate programming, where data analysis is performed via computer code, with the code interspersed with explanatory text. The resulting ‘notebooks’ can then be re-run in the future, taking raw data and reproducing the exact same tables, plots, etc.
In the next post we’ll take a look at the things to look for when reviewing papers, touching on the topics encountered so far. Finally, in the final (bonus) post, we’ll see some examples of literate programming applied to some typical veterinary-related data, along with a few extra statistical ideas.
- Putting Oncology Patients at Risk https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3474449/
- Ioannidis, John PA. “Why most published research findings are false.” PLoS medicine 2.8 (2005): e124.
- Evans, Richard B., and Annette O’Connor. “Statistics and Evidence-Based Veterinary Medicine: answers to 21 common statistical questions that arise from reading scientific manuscripts.” Veterinary Clinics of North America: Small Animal Practice 37.3 (2007): 477-486.
- Head, Megan L., et al. “The extent and consequences of p-hacking in science.” PLoS biology 13.3 (2015): e1002106.
- HARKing: Hypothesizing After the Results are Known https://journals.sagepub.com/doi/abs/10.1207/s15327957pspr0203_4
- How Bright Promise in Cancer Testing Fell Apart https://www.nytimes.com/2011/07/08/health/research/08genes.html
- Cargo-cult statistics and scientific crisis https://www.significancemagazine.com/2-uncategorised/593-cargo-cult-statistics-and-scientific-crisis
- Cargo cults https://simple.wikipedia.org/wiki/Cargo_cult
- Statistics for Biologists https://www.nature.com/collections/qghhqm
- Statistics notes: The normal distribution https://www.bmj.com/content/310/6975/298.full
- Coursera https://www.coursera.org/courses?query=statistics&
- Open Science Framework https://osf.io/
Written by Rob Harrand – Technology & Data Science Lead
DID YOU FIND THIS USEFUL?
To register your email to receive further updates from Avacta specific to your areas of interest then click here.