If you read through 100 tweets you are sure to find a few things of interest. If you scour 500 million a day you can develop fascinating insight, generate a patent, and potentially win two awards for your findings.
That was the culmination of a three-year project by Thomson Reuters researchers to see if people tweet about their side-effects to popular pharmaceuticals. In early July, our paper, ‘Quantifying Self-Reported Adverse Drug Events on Twitter: Signal and Topic Analysis,’ won Best Methods Paper at the 2016 International Conference on Social Media & Society.
Our team consisted of three researchers with diverse backgrounds: Vassilis Plachouras, Senior Scientist with a background in information retrieval; Jochen Leidner, Director of Research, who leads the London site of the global Thomson Reuters Research & Development group; and former colleague Andrew Garrow.
Machine learning-based classifier joins the team
However, our fourth team member – a machine learning-based classifier – was perhaps our most important. To tackle the immense amount of data from Twitter, we constructed a cascaded pipeline of system components that resulted in a machine learning based classifier to filter tweets that didn’t match our criteria. We were looking for tweets:
- about pharmaceutical drugs;
- that mention drug side-effects; and
- that suggest the tweeter is the patient – using human crowds as a sensor.
In all, our classifier would extract over 90 signals from each tweet, squeezing quite a bit of information out of at most 140 characters. It gave these queries its own weighting, constantly re-evaluating which were more beneficial to finding correct tweets; our classifier was learning.
The reward was an average of 721 tweets out of 500 million per day which matched our initial criteria. This was a large enough sample to be relevant, but also still small enough to allow a human analyst to manually double-check and interpret the findings, which suggests the system could be used e.g. by a regulatory body.
Headaches of our own
If you are familiar with the language of today’s social media, it is not the most regulated or predictable. Our classifier had to be taught which signals in a tweet suggested irony, translate “belly ache” into “acute abdominal pain,” and represent alternative (colloquial) names for the over 2,200 medical drugs investigated.
The volume and velocity of the information we were processing was so large and rapid that we had to devise a cascaded architecture (a set of progressively broader conditions that tweets had to meet before further queries were applied) to help our classifier sift the information. These challenges combined to pose a risk to the project, but a risk that we in Research are used to: we’re constantly innovating to overcome challenges like those, so we can nurture ideas, concepts and inventions that provide Thomson Reuters – and by extension our customers – tools to find trusted and comprehensive answers.
As well as Twitter’s API, our project relied on comprehensive scientific information, including the professional nomenclature of thousands of drugs.
But the applications for this technique aren’t limited to the medical field. Our methodology could be adapted to search for financial instruments linked to bankruptcy, or government legislation receiving a negative reaction from people who identify as a particular profession.
We built our own server API on top of the hierarchy and classifier so users could design their own applications.
In 2015, our patent received the Thomson Reuters Inventor Award, deemed to have greatest potential that year to progress Thomson Reuters technology.
You will always find something interesting if you look hard enough, but you can find things truly fascinating when you search smart enough.
Visit Innovation @ ThomsonReuters.com to learn more about how we are pairing smart data with human expertise and how you can get involved.