Finding Drug Mentions in Social Media



An Improved Method for Generating Misspellings
of Medication Names for Social Media Searches

Robert D. Hogan, PhD1, Graciela Gonzalez-Hernandez, PhD2

1Terminologix LLC, Antigo, WI, 2Univ. of Pennsylvania, Philadelphia, PA



Data mining of social media for adverse drug reactions (ADRs) may be a valuable tool for early ADR detection leading to improved patient safety. Typically, a social media mining pipeline starts by collecting comments that include the name of a given medication as a keyword. However, medication names are routinely misspelled in social media, making generation of suitable misspelled variants of the medication name the first challenge for automated detection. A previous method for generating misspelled drug mentions relied on a purely phonetic match.  We systematically analyzed the profile of misspellings in a corpus of postings from a patient support forum to characterize the types of misspellings that are likely to be used, particularly whether they are phonetic in nature. We tested an assumption from previous work that the Google Custom Search Engine® (“Google CSE”) could be used as a surrogate vocabulary for selecting spelling variants for use in social media search.  We found that misspellings of medication names are not consistently phonetic in nature, and that Google CSE proved to be an excellent vocabulary surrogate for drug misspellings but only when limited to search on relevant patient support sites and not the entire web.  Using our findings, we developed a four-stage pipeline process for generating medication name variants that can be used for social media keyword-based search queries.  We concluded that a search query that includes the correct name plus the top 10 variants generated by our method on a small sample of medication names can generally recall 99+% of all mentions for medication names that are up to 10 characters in length.  Longer terms are more likely to have edit distance 3+ misspellings, which are not captured by our method, resulting in lower levels of recall near 96%.