social-media-noiseOne of the biggest fears that pharmacovigilance professionals have when it comes to using social media is that amount of work that it will create. How will you be able to take the time needed to "clean" the data and get it fit for analysis? Machines can be trained to identify posts with drug-adverse event combinations, but it is more difficult for them to be able to ensure the post is an accurate representation of an adverse event. Thus, Booz Allen's Epidemico team uses human curation to provide the most accurate data.

Curation is the manual review of a subset of data by human annotators immediately following automated processing. This step serves to reduce false positives and false negatives, improving the Bayesian classifier through positive and negative feedback loops, constantly teaching the algorithm to recognize emerging syntax and slang. Human curators are trained to never make medical interpretation beyond the stated social media text and to adhere to MedDRA coding standards and MedDRA's ICH-endorsed guidelines. 

Another source of noise could be duplicate or multiplicate posts, like re-tweets or the posting of the same thing to multiple platforms. With this blog for example, we usually use the same text in our Twitter, LinkedIn, and Facebook notifications. 

To remove duplicates, first, literal duplicates are identified and consolidated using verbatim matches. For Twitter data specifically, the system further identifies duplicate posts according to characteristics such as the phrase “RT” (used in Twitter to denote a “retweet”). Next, a rule-based approach is used to consider fuzzy matches as duplicates, using increased computation power (based on a Bloom filter). If a post is nearly identical to another post, but has a number of characters that are distinct, then it is marked as a duplicate. This method results in a high recall rate of 100% (i.e., 100% of duplicates will be captured) while the probability of finding a false positive is only .001% (i.e., there is a .001% chance that a post will be falsely marked as a duplicate).

What this means for Evidex users is that they can feel confident that the posts they are examining and performing analytics on are "noise free". To help our clients get up to speed with using social media for pharmacovigilance, we created a quick start guide. Not yet a client, but interested to see Evidex, powered by Booz Allen's Epidemico social monitoring data in action? Request a demo. 

Topics: Pharmacovigiance 2.0

Jim Davis

Written by Jim Davis

As Executive Vice President, Jim is responsible for the commercialization strategy for Advera Health Analytics.