Data in Veterinary: Twitter Scraping

“If animals could speak, the dog would be a blundering outspoken fellow; but the cat would have the rare grace of never saying a word too much.”― Mark Twain

The Twitterverse

If you’re reading this blog post, then I’ll assume you know what Twitter is. What you may not know, however, is that tweets can be searched for automatically (or ‘scraped’) by the thousands using computer programs. Not only that, but thanks to the public domain nature of Twitter, the results of a search can then be used in whatever way the searcher sees fit.

Readers of previous blog posts will know that I’m a fan of R, the open-source, statistical language used worldwide for data science. There is a package in R called rtweet, which allows R users to find tweets in a variety of ways.

I recently decided to test it out, searching for all tweets that had the words ‘dog walk’ in them. Below is what I found.

Walkies!

The ‘search_tweets’ function in the rtweets package searches Twitter and returns results from the last 6 – 9 days. I ran this on 26th January 2018, and received back (after about 60s), 12,774 tweets. Checking the range of dates, these tweets were from the period 16th January to 26th January.

From the mass of tweets, I first looked at which tweets had location information attached to them. The majority, 11,766, did not. From the 1,008 that did, the top 3 countries in terms of frequency were the US (578 tweets), the UK (312 tweets) and Canada (30 tweets).

I decided to restrict what I was looking at to just UK tweets, and after removing any duplicate tweets, I was left with 251. From these, I found that the top 3 sources were Twitter for iPhone (120 tweets), Twitter for Android (68 tweets) and Instagram (41 tweets). Looking at the location tag, the top 3 were the South East (6 tweets), Cardiff (5 tweets) and the West Midlands (4 tweets).

Plotting Tweet Data

How about visualising some of this data? One interesting question is to see when people are tweeting about dog walks. This has to be taken cautiously, because of course, people can tweet about dog walks that have perhaps happened days or even years ago, or walks planned for the future. But on the whole, after eye-balling the tweets, most seemed to relate to things that had just happened or were about to happen. Focussing on the hour that the tweets were sent revealed the following,

Figure 1 – Histogram of dog walking tweets – UK

This shows, unsurprisingly, that dog walk tweets start to pick up at about 6am, are fairly well spread-out during the day, with a small increase in the late evening. To be sure these really were reflecting walks with a dog, below are 5 examples, along with their tweeted hour,

(5-6am) … “#Dunblane #snowing heavily since 4.45am #wind cuts you in half #very snowy cold dog walk”

(7-8am) … “Took dog out for walk, got three feet from the house, slipped and went flat on my back! Be careful out there pavements are treacherous! #ice”

(12-1pm) … “Very small signs of Spring on my dog walk today. Lovely to be out under blue skies again!”

(4-5pm) … “The dog (black lab) was reluctant to go out for his evening walk, a true sign of how grim it is out there”

(10-11pm) … “Every Friday night on the dog walk in #carlisle I hear the #carlislecathedral bells being rung”

Next, let’s break the text down to see what are the most frequent words. I used a package in R called tm for this, which is a collection of tools for text mining. Once analysed, I used another package called wordcloud to produce the following (focussing on words above a certain minimum frequency),

As you can see, words like ‘park’, ‘cold’, ‘snow’ and ‘wet’ appear, reflecting both the nature of the tweets, as-well-as the time of year!

Finally, these tweets have latitude and longitude data associated with them. Let’s plot them on a map of the UK,

 

A pretty good distribution across the country, although it looks like mid-Wales might be lacking in 3G coverage! (or dog walkers with Twitter accounts, but I doubt that).

Other Possibilities

This gives a quick introduction to what’s possible with R and Twitter. Using tools and techniques like those shown above, you can easily pull out interesting things from the global, relentless brain-dump that’s happening every second of every day. Could you use this to answer interesting social questions? How about for marketing? Whatever your ideas, I recommend having a go and seeing what you can find. And if you need any help getting things up and running, feel free to drop me an email.

Written by Rob Harrand – Technology & Data Science Lead

DID YOU FIND THIS USEFUL?

To register your email to receive further updates from Avacta specific to your areas of interest then click here.