“Experts often possess more data than judgment” – Colin Powell
The above quote is a timely warning in the age of big data. Judgement, insight and understanding should always be the final stage when dealing with big data, rather than the blunt results from any analysis in and of itself. That said, if you do have such data, the first challenge before you can engage your brain in the final verdict of your efforts is to crunch the numbers.
The Beasts of Big Data
Everything explained in part 1 will get you going with small to medium sized datasets, but in this day and age, datasets can be monstrous. The definition of ‘big data’ is a moving target, and many use the term when (in my humble opinion) it’s unjustified. However, for the purpose of this blog, you may want to learn about the following technologies if your data is several gigabytes in size, and causes your poor old Excel spreadsheet to curl up into a ball and die.
Apache Hadoop is a stalwart of the big data world, and allows multiple computers to be joined together in order to share the burden of the (big) data processing. This approach is in contrast to the traditional method of buying a single, behemoth of a computer that fills a room.
Hadoop is open-source, and is designed to work on commodity hardware (i.e. affordable, common PCs). If the cluster of PCs is too small, you can simply add another, and Hadoop will adapt. This approach also has the advantage of offering redundancy, where-by copies of data are held by different computers, providing backups in case one or more computers fail.
MapReduce, Hive and Pig
At its heart, Hadoop has something called the Hadoop Distributed File System (HDFS), which is what allows the data to be split between the different computers in the cluster. However, ‘Hadoop’ can be thought of as a bit of an umbrella term covering a wide-range of tools and techniques. Three of the most common are MapReduce, Hive and Pig.
MapReduce is a programming method (which can be written in multiple languages, including Python), that tells the Hadoop cluster how to analyse the data. A common example of how to understand how MapReduce works is to think of how you would perform a survey. For example, let’s say you are in charge of a group of vet practices, and you wanted to know how many diagnostic tests had been used overall. You could ask someone at each practice to count up how many have been used at their respective practice, and you would receive such as,
Test A: 120
Test B: 56
Test C: 30
Test D: 189
Test A: 101
Test B: 45
Test C: 17
Test D: 151
Test A: 98
Test B: 13
Test C: 5
Test D: 104
That, in essence, is the map phase of MapReduce. The number of some item of interest is counted, and ‘mapped’ to its type (in this case, test type).
You would then sum the numbers for each test type, giving,
Test A: 319
That’s the reduce phase.
The beauty of this is that you delegated the hard work to three different people. With Hadoop and MapReduce, you do the same, except you’re delegating the work to different computers.
The problem with MapReduce is it may not easy to frame the question in your mind like the simple example above, let alone code it. Hive does a lot of this hard work for you, by hiding the MapReduce detail underneath an SQL-link syntax. The only requirements are that your data must be stored in a database, and you have to be familiar with SQL.
As an example, I searched the internet for some openly available ‘big data’. I came across some GP practice prescribing data on the UK government’s open data website5. This isn’t really big data in the modern sense of the word, but still weighed in at a hefty 1.4GB for a single month’s worth of data.
Quoting the corresponding website, this data’s description is as follows,
“General practice prescribing data is a list of all medicines, dressings and appliances that are prescribed and dispensed each month”, including,
- the total number of items prescribed and dispensed
- the total net ingredient cost
- the total actual cost
- the total quantity
I created a single table in Hive, and used the following SQL-like command to find out the top 10 highest items by total costs,
SELECT BNF_NAME, SUM(ACT_COST) AS Total_Cost
GROUP BY BNF_NAME
ORDER BY Total_Cost
DESC limit 10;
This may look complex at first, but any introductory SQL course would soon get you up to speed with such syntax. In short, it’s saying ‘select the name and sum the costs, group together by the names, order it by the summed costs, and show me the top 10’. This query took 21 seconds to run. Here are the results,
Rivaroxaban_Tab 20mg 5647167.498436213
Fostair_Inh 100mcg/6mcg (120D) CFF 5425188.610569
Apixaban_Tab 5mg 5085967.126802683
Tiotropium_Pdr For Inh Cap 18mcg 4873361.710469246
Sitagliptin_Tab 100mg 3938509.5228374004
Ins Lantus SoloStar_100u/ml 3ml Pf Pen 2864460.558757782
Sumatriptan_Tab 50mg 2796583.7536940575
Symbicort_Turbohaler 200mcg/6mcg (120 D) 2753414.1420173645
Ezetimibe_Tab 10mg 2691778.329881549
Hydrocort_Tab 10mg 2682227.2477264404
Pig is similar to Hive in that it offers a layer of abstraction on top of MapReduce, but rather than being table based, it provides a programming language (called Pig Latin) for manipulating and analysing big data sets. Common commands include LOAD, FILTER, GROUP, ORDER and STORE, which do exactly what you’d expect.The formatting could use some work, but as you can see, for the month in question (June, 2017), Rivaroxaban (a common anticoagulant) cost the National Health Service the most money in terms of prescriptions.
When to use: All things Hadoop should be considered when your data is huge and you have access to the technical resources to run it (i.e. a Hadoop cluster)
Complimenting Hadoop is a sister project by Apache called Spark. Spark is much faster than MapReduce, as all operations are done in memory, lending it to situations where speed is of the essence. Typically, this means when you have a flow of big data, rather than a static lump of data that isn’t going to change.
Spark can use file systems such as the Hadoop Distributed File System, offers various libraries for performing a variety of tasks, and can be interfaced via languages such as SQL, R, Java, Python or Scala. The interface to run Spark tasks is programmatic, and allows you to tell Spark how to load, split, map, reduce, filter, count, etc, your data.
One relevant library that’s becoming increasingly popular in Spark is called MLlib, which allows users to perform machine learning on a big data scale. However, the justification for deploying such a tool is almost certainly lacking in most veterinary scenarios, at least at the time of writing.
When to use: When you need speed with your big data
Practical Use of Big Data Tools
One significant issue with all of these big data technologies is that to use them, you need some serious hardware and IT efforts to get things up and running. An easier solution is to use a pre-configured cloud based interface, such as Amazon Web Services (AWS). AWS offers a myriad of cloud-based computing solutions, including a large section on all things big data. By using AWS, you can spin up a cluster of computers (usually for a few dollars an hour), analyse your data, and then spin everything down. This is extremely cost effective, and doesn’t require a huge IT undertaking.
Examples of the use of tools such as Hadoop and Spark in the veterinary literature (…no, I didn’t even try searching for Hive or Pig), are virtually non-existent. This is in stark contrast to the literature in human health. This, I believe, reflects the fact that not only are these tools relatively new and fairly niche, but that data is typically not shared or collated amongst vets. Hopefully this will change in the future, giving the need for veterinary professionals to one day use such tools. For now, consider this section an attempt to future-proof the use of big data tools in the veterinary world.
Getting Started with all things Data
There are hundreds of other tools and systems for dealing with data, but hopefully these two posts give a flavour of what’s out there.
In terms of where to go next in terms of learning any of the above, I would strongly recommend a MOOC (Massive Open Online Course). The two platforms that I use the most are Coursera and Udemy, but there are dozens now out there for learning about data analysis. Beyond the MOOCs there are even more paid-for courses, and I would recommend the CloudXLab platform for learning hands-on big data technologies (used for running the Hive code for this blog).In summary, whatever data you’ve got, there are tools out there to allow you to crunch the numbers and extract insight. Excel is great, but know that there is a world beyond the spreadsheet, and learning how to use some of these tools has never been cheaper, quicker or easier.
Written by Rob Harrand – Technology & Data Science Lead
DID YOU FIND THIS USEFUL?
To register your email to receive further updates from Avacta specific to your areas of interest then click here.
GP practice prescribing data – Presentation level https://data.gov.uk/dataset/prescribing-by-gp-practice-presentation-level/resource/222a490b-0445-4dc5-a4d0-25fe0c8b721c (c) NHS Digital, 2017 . This information is licensed under the terms of the Open Government Licence.