Академический Документы
Профессиональный Документы
Культура Документы
A DISSERTATION
DOCTOR OF PHILOSOPHY
By
Kathy Lee
EVANSTON, ILLINOIS
December 2017
ProQuest Number: 10638877
All rights reserved
INFORMATION TO ALL USERS
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
ProQuest 10638877
Published by ProQuest LLC (2018 ). Copyright of the Dissertation is held by the Author.
All rights reserved.
This work is protected against unauthorized copying under Title 17, United States Code
Microform Edition © ProQuest LLC.
ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
2
c Copyright by Kathy Lee 2017
ABSTRACT
Kathy Lee
Social media such as Twitter has risen as a powerful new communication medium
social media, people talk about their lifestyle, health conditions and symptoms, search
information on treatment options, and connect with people who have been through simi-
lar medical experiences to get emotional support. Such health information generated by
patients or family members is not available in medical documents created by health care
providers and became publicly available only recently with the prevalent use of microblog-
ging sites, which makes social media an invaluable source of health data to mine. However,
social media data is often short, unstructured, and written in colloquial languages, and
In this thesis, we focused on mining public Twitter data for healthcare intelligence.
We designed models based on bag-of-words and social network structure features that
classify trending topics into general categories such as sports, technology and health.
This model could help identify trending topics and posts in health domain and benefit
4
also proposed a real-time digital disease surveillance system that uses spatial, temporal,
and text mining techniques to track disease activities. Our work was motivated by the
fact that, while traditional disease surveillance systems require 1-2 weeks time to collect
and process before the data becomes publicly available, Twitter data is available near
real-time and the aggregated social media data can provide an overall health state of
the general population earlier than the traditional disease surveillance systems can. We
further built a neural network model that combines Twitter data with the observed data
from Centers for Disease Control and Prevention (CDC) to predict current and future
influenza activities. Our system can serve as a proxy for early detection of pandemics and
the resulting insights are expected to help facilitate faster response to and preparation
for epidemics. We also investigated the use of clinical knowledge sources to train deep
learning models for medical concept normalization in which health conditions described
in natural (colloquial) language are mapped to a standard clinical term. The proposed
model can help an automatic system to effectively interpret health concepts written in
layman’s language.
The studies presented in this thesis provide interesting insights into the application
of machine learning and text mining on social media data in healthcare domain. We
hope our work motivates further study of online user-generated data to gain meaningful
healthcare insights.
5
Acknowledgements
I would like to thank Prof. Alok Choudhary for advising this research and for his
constant guidance and valuable feedback. He has inspired me to work on research problems
in social media and healthcare domain. He has always given me a steady force to maintain
thesis committee members Prof. Wei-keng Liao and Prof. Ankit Agrawal.
I wish to thank my parents for their endless love and sacrifice, and providing me with
the best education. I dedicate this thesis to them. I also wish to thank my husband Seung
Woo and my two children Daniel and Ashley for patiently supporting me. Without their
encouragement and dedication, I would not have been able to successfully finish this long
journey.
Lastly, I would like to thank all members of the center for ultra-scale computing and
Table of Contents
ABSTRACT 3
Acknowledgements 5
Table of Contents 6
List of Tables 10
List of Figures 13
Chapter 1. Introduction 18
2.1. Introduction 22
2.3.2. Labeling 28
2.5. Summary 36
Allergy Surveillance 38
3.1. Introduction 38
3.2.1. Datasets 40
3.2.2. Methodology 41
3.5. Summary 57
4.1. Introduction 59
4.3. Summary 67
5.1. Introduction 68
5.3. Method 72
5.3.1. Dataset 72
5.4. Results 77
5.5. Summary 81
6.1. Introduction 82
6.4.1. Data 90
6.5. Results 97
References 104
10
List of Tables
3.1 Tweets with positive and negative labels. A tweet is positive if it talks
about the author or someone around the author having allergy. A tweet
3.3 A list of most frequently used bigrams where the second word is allergy,
3.5 Most prevalent food allergies. The rank of the most prevalent food
learning rates and a varying number of hidden layers and hidden units
are used. The highest correlation of 0.9559 was obtained using learning
hidden units are used. The highest correlation of 0.929 was obtained
using learning rate λ = 0.2 and one hidden layer with 4 activation units. 76
6.1 Medical concepts in UMLS and example social media phrases that
6.2 Data Statistics after removing duplicates from the combined training,
6.4 Data Statistics after removing concepts that had less than five examples 92
12
6.5 Medical concepts and similar words based on cosine similarity obtained
dimension = 300) 97
is removed. 99
List of Figures
2.3 Web interface deployed for manual labeling. Annotators read the trend
classes. 28
techniques. 36
are used to create the graph. The graph illustrates the general allergy
level trend over time. The allergy level is the highest in mid–May, goes
down in June and July, starts rising again in August, and reaches its
14
3.2 Monthly average data for allergy tweet count (blue), daily highest
temperature (green), and pollen level (red) for Washington state (March
0.668). 50
3.4 Time-series graph of tweet count for various allergy symptoms (Feb
line) throughout the year, followed by cough (green) and runny nose
(sky blue). 53
of allergy levels across U.S. is clearly visible. Allergy level is the highest
allergy levels for each U.S. state. The tweet count is normalized by state
15
U.S. disease activity maps, timelines, and pie charts on our project
websites [15][16]. 61
4.2 Our Real-Time Digital Flu Surveillance Website [16]. The ‘Daily Flu
the volume changes of tweets mentioning the word ‘flu’ over time.
coincided with the dates when the major U.S. newspapers reported
Boston Flu Emergency [21] and deaths of four children from the AH3N2
tweet volumes mentioning ‘flu’ by states. The level of flu activity was
flu epidemic. 62
4.3 Flu Symptoms Timeline. The timeline displays tweet volume changes
‘Cough’ (green line) and ‘fever’ (dark orange line) reach their highest
16
level in mid January and decrease as the actual national ILI level by
CDC decreases. 64
data was smoothed out to align with CDC data. Current and 1-week
5.2 Data available at current week t. At the end of week t, all flu-related
Twitter data collected during current week t and prior are available. At
time t, past two weeks (Wt−1 and Wt )’ CDC data is not available as
model. 78
5.4 Comparison of our current and 1-week ahead U.S. influenza activity
forecast results against CDC and Google Flu Trends data. For current
Thesaurus. 93
Medical Dictionary. 93
18
CHAPTER 1
Introduction
Social media has gained popularity as a new means for information sharing in the
last decade. The rise of social media along with advancements of mobile technologies
such as smart phones and tablets has changed communication patterns among friends
and families.
Twitter is one of the largest microblogging social network where people post short
text messages called tweets. Users can subscribe to receive tweets by following other
users they are interested in. If user A selects to receive all tweets posted by user B, A
is called a follower and B is called a friend of A. User A can follow user B back, but is
not obligated to do so. Users can select to share information publicly or privately within
small social circles. By default, tweets are publicly viewable by others unless the user
sets his/her Twitter account private, which makes Twitter a great real-time resource for
information search where the latest news and events can be found faster than any other
media. People generally like to find out news at the exact moment they are happening,
read/write information at their convenient time, and search what they want to know.
Users create hashtags, a pound sign (#) followed by a word or un-spaced phrase, to
dynamically tag user-generated posts, which allows searching tweets on a specific topic or
theme easy. Retweet is a unique feature of Twitter that allows users conveniently share
information with their followers thereby letting information propagate at a speed faster
than other traditional media. On social media, users share news, events, experiences,
19
and opinions on various topics. The language used in Twitter has several characteristics.
While the 140 character limit on tweet text makes it fun and exciting for users to post a
tweet, it also causes the language to be short, noisy, prone to misspelling, and frequent
use of emojis and acronyms. Also, users often use multiple languages within a same post.
These pose many challenges for an automated system to accurately interpret meanings
of the sentences. These are relatively new problems generated by the unique ways of
Social media has a wide scope of applications. In business, it can be used for brand
social media has been widely used for presidential campaigns, fundraising, and to measure
public opinions. In healthcare, patients use social media and online health forums to
search medical answers, seek medical advice on treatments, and to connect with other
Twitter tracks trending topics to identify popular topics of discussion. Trending topics
can be unique to a specific geographic location or time, and the popularity is measured
topics into general categories such as sports, news, music, science, technology, health, and
so on, to provide readers more context and help narrow down the search space. We explore
Mining social media for healthcare insights is a relatively new research area that
have emerged with the rapid growth of microblogging services in the last decade. We
built a real-time digital disease surveillance system that constantly collects, analyzes, and
20
visualizes the aggregated data. We studied distribution of disease types, symptoms and
treatments social media users talk about on three common diseases: cancer, allergy, and
influenza.
Cancer is a disease that involves abnormal cell growth and is among the leading causes
of death worldwide.1 In 2017, 1,688,780 new cancer cases and 600,920 cancer deaths are
projected to occur in the United States [94]. Allergy is another common disease a large
and various environmental factors. Roughly 7.8% of people 18 and over in the U.S. have
hay fever, a common allergic condition also known as allergic rhinitis.2 Prior studies have
shown that allergy symptoms are highly associated with lost work productivity [64]. Early
detection and treatment support can help reduce lost work productivity and potentially
reduce the health care costs. Influenza is one of the most common viral infection which
affects lungs, nose and throat. It is a contagious disease that has similar symptoms as
cold but usually more severe, lasts longer, and can cause various complications leading
to deaths. In recent years, influenza activity tracking using social media has been a very
active area of research followed by Google Flu Trend that estimates prevalence of influenza
activity using aggregated google search query log data. Early detection of influenza levels
can help reduce the impact caused by pandemic and provide more time to prepare an
emergency response. Centers for Disease Control and Prevention (CDC) collects and
reports prevalence of influenza-like illness (ILI) based on physicians visits data across
the country with a two weeks time lag. We explored using Twitter posts mentioning
1https://www.cancer.gov/about-cancer/understanding/statistics
2
http://www.aaaai.org/about-aaaai/newsroom/allergy-statistics
21
symptoms of influenza as a real-time resource to track influenza levels and built neural-
network based real-time and 1-week ahead flu forecast models using both Twitter and
Users describe their health conditions, ask questions related to a certain disease or
treatment on social media. However, the colloquial nature of the languages used in social
media makes it difficult to automatically map the medical concepts present in the text to
standard medical terminologies. In addition, various ways of describing the same medical
condition poses an additional challenge for an automated system to understand the con-
ontology terms, automatic systems would be able to search relevant clinical resources such
and use the aggregated large-scale clinical data to track and detect disease spread for
population health.
This work demonstrates that social media is a useful resource to obtain health-related
information and the aggregated personal health information can be used for population
health management. Our main contributions are building automatic systems that 1)
classify trending topics and posts into general categories to help information search in
a specific domain such as health [1], 2) mine Twitter data as a real-time resource to
monitor disease (allergy, cancer, influenza) activities [2, 3, 4], 3) predict current and
future influenza levels by combining social media data with observed data from CDC
for features [5], and 4) normalize medical concepts described in user-generated texts to
CHAPTER 2
2.1. Introduction
Twitter1 is an extremely popular microblogging site, where users search for timely
and social information such as breaking news, posts about celebrities, and trending topics.
Users post short text messages called tweets, which are limited by 140 characters in length
and can be viewed by user’s followers. Anyone who chooses to have other’s tweets posted
on one’s timeline is called a follower. Twitter has been used as a medium for real-time
information dissemination and it has been used in various brand campaigns, elections, and
as a news media. Since its launch in 2006, the popularity of its use has been dramatically
increasing. As of June 2011, about 200 million tweets are being generated every day.
When a new topic becomes popular on Twitter, it is listed as a trending topic, which
may take the form of short phrases (e.g., Michael Jackson) or hashtags (e.g., #election).
2
What the Trend provides a regularly updated list of trending topics from Twitter. It is
very interesting to know what topics are trending and what people in other parts of the
world are interested in. However, a very high percentage of trending topics are hashtags,
what the trending topics are about. It is therefore important to classify these topics into
general categories for easier understanding of topics and better information retrieval.
1
http://www.twitter.com
2http://www.whatthetrend.com
23
The trending topic names may or may not be indicative of the kind of information
people are tweeting about unless one reads the trend text associated with it. For example,
#happyvalentinesday indicates that people are tweeting about Valentines Day. A trend
named Boone Logan is indicative that tweets are about person named Boone Logan.
Anyone who does not follow American Major League Baseball (MLB), however, will not
know that the information is regarding Boone Logan, who is a pitcher for the New York
Yankees unless a few tweets are read from this trending topic as shown in Figure 2.1.
We found that trend names were not indicative of the information being transmitted
address this problem, we defined 18 general classes: arts & design, books, business, charity
& deals, fashion, food & drink, health, holidays & dates, humor, music, politics, religion,
science, sports, technology, tv & movies, other news, and other. Our goal was to aid users
searching for information on Twitter to look at only smaller subset of trending topics by
classifying topics into general classes (e.g., sports, politics, books) for easier retrieval of
information.
To classify trending topics into these predefined classes, we proposed two approaches:
the well-known Bag-of-Words text classification and using social network information. In
24
this paper, we used supervised learning techniques to classify Twitter trending topics.
First, we employed a well-known text classification technique called Naive Bayes (NB)
[73]. A document in NB would model as the presence and absence of particular words.
Y
(2.1) P (c|d) ∝ P (c) P (tk |c),
1≤k≤nd
where P (c|d) is the probability of a document d being in class c, P (c) is the prior prob-
ability of a document occurring in class c, and P (tk |c) is the conditional probability of
Apart from text-based classification, we also incorporated Twitter social network in-
formation for topic classification. For the latter we made use of topic-specific influential
users [78], which were identified using Twitter friend-follower network. The influence
rank was calculated per topic using a variant of the Weighted Page Rank algorithm [102].
In general, a tweeter is said to have high influence if the sum of the influence of those
following him/her is high. The key idea of the proposed network-based approach was to
predict the category of a topic knowing the categories of its similar topics. Similar topics
were identified using user-similarity metric, which was the cardinality of the intersection
of influential users between two topics ti and tj divided by the cardinality of top s influ-
encers of topic ti [78]. We experimented using different classifiers, for example, C5.0 (an
improved version of C4.5) [87], k-Nearest Neighbor (kNN) [23], Support Vector Machine
(SVM) [44], Logistic Regression [66], and ZeroR (the baseline classifier), and found that
25
C5.0 classifier resulted in the best accuracy on our data set. Experimental results showed
that both our approaches effectively classified trending topics with high accuracy, given
that it was a 18-class classification problem. This work was published in [1].
A number of recent papers have addressed the classification of tweets. Sriram et al. [97]
classified tweets to a predefined set of generic classes such as news, events, opinions, deals,
and private messages based on author information and domain-specific features extracted
from tweets such as presence of shortening of words and slangs, time-event phrases, opin-
ionated words, emphasis on words, currency and percentage signs, “@username” at the
beginning of the tweet, and “@username” within the tweet. Genc et al. [49] introduced
messages into their most similar Wikipedia pages and calculating semantic distances be-
tween messages based on the distances between their closest wikipedia pages. Kinsella
et al. [60] included metadata from external hyperlinks for topic classification on a social
media dataset. Whereas all these previous works used the characteristics of tweet texts or
meta-information from other information sources, our network-based classifier used topic-
specific social network information to find similar topics, and used categories of similar
Sankaranarayanan et al. [91] built a news processing system that identified tweets
corresponding to late breaking news. Issues addressed in their work included removing
the noise, determining tweet cluster of interest using online methods, and identifying
relevant locations associated with the tweets. Yerva et al. [103] classified tweet messages
26
to identify whether they were related to a company or not using company profiles that
were generated semi-automatically from external web sources. Whereas all these previous
works classified tweets or short text messages into two classes, our work classified tweets
Becker et al. [27] explored approaches for distinguishing tweet messages between real-
world events and non-event messages. The authors used an online clustering technique
to group topically similar tweets together, and computed features that could be used to
There had been a lot of research in sentiment classification of short text messages. Go
et al. [51] introduced an approach for automatically classifying sentiment of tweets with
emoticons using distant supervised learning. Pang et al. [80] classified movie reviews
determining whether a review was positive or negative. But none of these classified
!
234+'(!
253'/&'(! 253'/&'(! !237$89#43/! 4*1&$!
=#$3(.5@! 25#&'&'(!
2.*&>!!! 2.*&>! ! %./31&'(!! 6*1&$! !
G! !!!!?#/@!(#(#! ,A4&>! ! ! !
"3H'&+.'! !! 25#&'&'(!
!
6*1&$! ! 25#&'&'(!
6*1&$!
!9A59355@! C#4D&.'!
! ! ! !
25#&'&'(!
&*#/! !!$3>D'.1.(@! !
6*1&$!
!
$.@!4$.5@!B! $E!F!,.E&34! !
! "#$#!%&'&'(!
?#/@! 4A*359.;1! 4*.5$4!
! !:3$;.5<89#43/!! )*+,&-#+.'!!
! (#(#! $.5'#/.! .$D35!'3;4! %./31&'(! #'/!0#1&/#+.'!
2;33$4! !
! !
As shown in Figure 2.2, the proposed classification system consisted of four stages:
Data Collection, Labeling, Data Modeling, and Machine Learning. In our experiments, we
used two data modeling methods: (1) Text-based data modeling, and (2) Network-based
data modeling.
The website What the Trend provides a regularly updated list of ten most popular topics
called “trending topics” from Twitter. A trending topic may be a breaking news story
or it may be about a recently aired TV show. The website also allows thousands of
users across the world to define, in a few short sentences, why the term is interesting or
important to people, which we refer to as “trend definition”. The Twitter API3 allows
downloaded trending topics and definitions every 30 minutes from What the Trend and
all tweets that contained trending topics from Twitter while the topic was trending.
All the tweets containing a trending topic constituted a document. For example, while
the topic “superbowl” was trending, we kept downloading all tweets that contained the
word “superbowl” from Twitter, and saved the tweets in a document called “superbowl”.
In case a tweet contained more than two trending topics, the tweet was saved in all
relevant documents. For example, if a tweet contained two trending topics “superbowl”
and “NFL”, the same tweet was saved into two documents called “superbowl” and “NFL”.
From 23000+ trending topics that we had downloaded since February 2010, we randomly
Figure 2.3. Web interface deployed for manual labeling. Annotators read
the trend definition and tweets before labeling trending topics as one of the
18 classes.
2.3.2. Labeling
We identified 18 classes for topic classification. The classes were art & design, books,
charity & deals, fashion, food & drink, health, humor, music, politics, religion, holidays
& dates, science, sports, technology, business, tv & movies, other news, and other. Since
Twitter is a primary source of news or information, the news related to political events
29
%+*"
%'!"
%'*"
%#*"
%**"
,-./01"23"425678"
&#"
!)"
!*" ($" ()"
+*"
$#"
'&"
'*"
#$" #("
##" #'"
%!" %&" %("
#*" %$" %)"
!" !"
*"
were classified as politics. If the topic was about news that was not in any of the categories,
it was classified as other news. If the trend definition or tweet text was gibberish or if it
was in a language other than English, then we classified the topic as other category. The
data was labeled by reading topic’s trend definition and few tweets.
We used two annotators to label all topics. In case of disagreement, a third annotator
intervened. For the labeling task, a random sample of 1,000 topics was selected. From
the 1,000, we narrowed the data set down to 768 topics for mainly two reasons. First, the
topic had no trend definition. Second, the third annotator could not finalize the label.
For each of the 768 topics in our dataset, its five most similar topics were also labeled,
30
which were required for the network-based modeling as described in Section 2.3.3.2. We
ended up manually labeling 3,005 topics because some of the similar topics were common
to more than one topic. Figure 2.3 shows the web interface we deployed for the labeling
task.
The distribution of data over the 18 classes is provided in Figure 2.4. The sports
category had the highest number of topics (19.3%), followed by other category (12%).
Except for categories other news, tv & movies, and music, all other categories contained
less than 6.8% of the topics. Figure 2.5 shows examples of trending topics that were
classified as technology.
the data which comprised of topic’s trend definition, tweets and label was processed in
31
two stages. In the first stage, for each topic, a document was created from trend defini-
tion and varying numbers of tweets (30, 100, 300, and 500). From the document text,
all tokens with hyperlinks were removed. This document was then assigned a label corre-
sponding to the topic. In the next stage, the document was run through a string-to-word
vector kernel, which consisted of two components. The first component was the tokenizer
that removed delimited characters and stop words. We used a customized stop words list
catered to Twitter lingo4. The second component transformed the tokens into tf-idf (term
top 500 and 1,000 frequent terms per category. For each of the 18 labels, top most fre-
quent words with their tf-idf weights were used to build the dataset for machine learning
ing, in network-based data modeling we used Twitter specific social network information.
interest between two users and is directed and asymmetric. User A can freely choose to
follow user B without B’s consent and B does not necessarily have to follow A. We used
the algorithm from User Similarity Model [78] to find five most similar topics for trend-
ing topic X. The algorithm used the class of similar topics that were manually labeled
in section 2.3.2 to predict the class of topic X. In user similarity model, topic-specific
influential users were computed using Twitter social network information such as tweet
time, number of tweets made on a topic, and friend-follower relationship. Then, using
4http://www.twithawk.com
32
/010#$2!'+30"!
'(")*+#+,-!
!!" /010#$2!'+30"!
/010#$2!'+30"!
!!!!!!!!"#$%%&!
!!" !!"
/010#$2!'+30"! /010#$2!'+30"!
'(")*+#+,-! $%&'()*"+".($#%!
the number of common influential users between two topics, most similar topics were cal-
culated. Although the user similarity model captured different dimensions of similarity
such as temporal and geographical, our assumption was that a majority of the similar
topics would fall into the same category as the target topic and hence we could predict
the category of target topic using the categories of its similar topics.
Table 2.1 and Figure 2.6 show an example of the topic “macbook”, its five most similar
topics, and number of common influential users between topic “macbook” and its similar
topics. Trending topic “macbook” was classified as technology by manual labeling, and
its five most similar topics (“iwork”, “magic trackpad”, “#landsend”, “apple ipad” and
33
“mobileme”) were manually labeled as technology, technology, charity & deals, technology,
technology. The numbers in Fig. 2.6 indicate the number of common influential users who
tweeted about both “macbook” and its similar topic. The resulting data for machine
learning in this case consists of 768 rows and 19 columns. Each row represents a trending
topic. 18 columns represent 18 classes and the last column represents the class label. Since
topic “macbook” has four similar topics in technology, sum of four values of common
becomes the value for row “macbook” and column technology in the table. And the value
corresponding to its similar topic “#landsend” becomes the value for row “macbook” and
Table 2.1. Five most similar topics of topic “macbook” in class technology.
The two datasets constructed as a result of the two approaches in the Data Modeling
stage were used as inputs to machine learning stage. We built predictive models using
various classification techniques and selected the ones that resulted in the best classifica-
For our experiments, we used popular tools such as WEKA [100] and SPSS mod-
eler [56]. WEKA is a widely used machine learning tool that supports various modeling
algorithms for data preprocessing, clustering, classification, regression and feature selec-
tion. SPSS modeler is another popular data mining software with unique graphical user
interface and high prediction accuracy. It is widely used in business marketing, resource
planning, medical research, law enforcement and national security. In all experiments, 10-
fold cross-validation was used to evaluate the classification accuracy. The ZeroR classifier
which simply predicts the majority class was used to get a baseline accuracy.
),# $!%"$#
$"%&"#
$(%)$#
!&%*(#
$,#
!"# !'#
!,# '!%"(#
''%!#
'+%*"#
!""#$%"&'()*'
',#
",#
(&%+)#
+,#
(,#
,#
Using Naive Bayes Multinomial (NBM), Naive Bayes (NB), and Support Vector Ma-
chines with linear kernels classifiers (SVM-L), we found that the accuracy of classification
is a function of number of tweets and the frequent terms. Fig. 2.7 presents the comparison
sents the trend definition. Model(x,y) represents classifier model used to classify topics,
with x number of tweets per topic and y top frequent terms. For example, NB(100,1000)
represents the accuracy using NB classifier with 100 tweets per topic and 1,000 most
In our experiments, NB model always provided a lower accuracy over NBM model
because it models the word counts and adjusts the underlying calculations. SVM-L per-
formed better than NB but had slightly lower accuracy compared to NBM. If only trend
definition was used, irrespective of the most frequent word terms, the accuracy was much
lower for all three classifiers compared to using trend definition plus tweets. The experi-
mental results suggested that NBM classifier using text from trend definition, 100 tweets,
and a maximum of 1,000 word tokens per category gave the best accuracy of 65.36%.
Fig. 2.8 compares classification accuracy of different algorithms for network-based classifi-
cation. Clearly, C5.0 decision tree classifier gave the best classification accuracy (70.96%)
followed by k-Nearest Neighbor (63.28%), Support Vector Machine (54.349%), and Logis-
tic Regression (53.457%). C5.0 decision tree classifier achieved 3.68 times higher accuracy
compared to the ZeroR baseline classifier. The 70.96% accuracy was very good consider-
ing that we categorized topics into 18 classes. To the best of our knowledge, the number
36
*"'
!"#$%&'
!"'
%(#)*&'
%"'
+,#(,&' +(#,+&'
+"'
!""#$%"&'()*'
,"'
("'
)"' ,$#",&'
-"'
"'
.'+#"' /01234256' <=>>;46'?2@6;4' C;875D@' F24;E'
12789:;4' A3@97B2' E2842557;B'
of classes used in our experiment was much larger than the number of classes used in any
2.5. Summary
In this paper, we explored two different classification approaches for Twitter trending
topic classification. Apart from using text-based classification, our key contribution is
the use of social network structure rather than using just textual information, which
can be often noisy given the context of social media such as Twitter due the heavy use
of Twitter lingo and the limit on the number of characters that users are allowed to
generate for their messages. Our results show that network-based classifier performed
significantly better than text-based classifier on our dataset. Considering tweets are not
Naive Bayes Multinomial provides fair results and can be leveraged in cases where we
CHAPTER 3
Allergy Surveillance
3.1. Introduction
Allergy is the fifth most common chronic diseases in the United States1. The complex-
ity and severity of allergic diseases are increasing worldwide [82]. One in five Americans
have either allergy or asthma symptoms. In 2012, 7.5% of adults (17.6 million adults) and
9% of children (6.6 million children) were diagnosed with hay fever [30, 29]. Continuous
use of allergy medication can worsen patients’ health conditions and lead to side effects
patients gives rise to allergy-related health care cost and leads to reduced work produc-
tivity. $7.9 million is annually spent on allergy-related health care systems and business.
4 million workdays are lost due to hay fever each year. Therefore, an accurate allergy
surveillance and forecast is important to minimize the healthcare cost and maximize work
Twitter, one of the largest social networking website, allows users to post short text
messages called tweets that can be up to 140 characters in length. Twitter has over
328 million monthly active registered users. Twitter has been used as a valuable real-
time information resource for various applications. For instance, Twitter data have been
1http://www.webmd.com/allergies/allergy-statistics
39
used to detect earthquakes in Japan [89], predict the stock market [33] and for an in-
depth study of 2011 Egyptian Revolution [10]. On Twitter, people not only make general
chatters but also share photos, news, opinions, emotions, and even health conditions
including symptoms and medications they are taking for their diseases. In recent years,
many researchers have investigated using Twitter for disease surveillance, especially for
influenza epidemic detection and prediction [81, 39, 22, 96, 26, 38, 65, 93, 69].
In this paper, we mined a large scale Twitter data collected over 28 months to monitor
employed to distinguish tweets that mentioned actual incidents of allergy from those that
talked about news or general awareness about allergy, 2) text-mining techniques such as
n-gram extraction and part-of speech tagging were applied to extract predominant allergy
types, and 3) a spatiotemporal mining was applied to track allergy levels over time and
space.
We believe that our work is the first framework towards real-time allergy surveillance
using a fine-grained spatiotemporal analysis on a large-scale social media data. The data
analysis results reveal that Twitter is an excellent resource for detecting allergy prevalence.
Our proposed system can help see the past and current trend of allergy levels detected
in social media stream. The real-time analysis results are updated on our allergy project
3.2.1. Datasets
3.2.1.1. Twitter dataset. We collected allergy-related tweets from public tweet stream
using Twitter’s streaming API2. We collected over 6.3 million tweets that mentioned
‘allergy’ or ‘allergies’ created by over 3.1 million unique users over 28 months from January
2013 to April 2015. Some talked about their allergy symptoms (e.g., Walked out of my
house confused as to why my eyes felt like they were on fire and then I realized it’s allergy
season.) while others talked about allergy types (e.g., I sneezed like eight times in a row.
This pollen allergy is killing me.) or allergy treatments/medication they took (e.g., sitting
Pollen dataset. We collected monthly average pollen levels and 90 day historic pollen
levels for U.S. major cities from pollen.com3. The pollen level is a number between 0
and 12 and divided into five categories: 0.0-2.4 (low), 2.5-4.8 (low-med), 4.9-7.2
Climate dataset. Climate Data Online (CDO)4 provides free access to National
Climatic Data Center (NCDC)’s archive of global historical weather and climate data.
We collected daily and monthly temperature and precipitation data generated since
January 2013 (because the earliest allergy-related Twitter data we had was generated in
January 2013) for major U.S. cities and states. More than a half of the climate data
2
https://dev.twitter.com/docs/streaming-apis
3
http://www.pollen.com/
4
http://www.ncdc.noaa.gov/cdo-web/
41
collecting stations did not report daily temperatures at all, and many, among those that
Allergy patients’ dataset. We used data from the first Quest Diagnostics Health
Trends allergy report, Allergies Across America5. This report is the largest analysis of
allergy testing of patients in the United States under the evaluation for medical
symptoms associated with allergies. We collected a ranked list of most prevalent food
allergies grouped by patients’ ages and a ranked list of the worst U.S. cities for different
allergy types.
3.2.2. Methodology
allergy incidents, we removed all retweets (20.51% of our initial dataset) and tweets that
were not written in English (2.9% of our initial data set). Special HTML characters were
replaced with human-readable characters (e.g., replaced < with < (i.e., less-than sign),
replaced > with > (i.e., greater-than sign)) and all hyperlinks were replaced with string
‘URL’.
3.2.2.2. Data Classification. While some tweets talked about a person having allergy
symptoms, other tweets talked about news, questions, general awareness of allergy sea-
to distinguish tweets that mention actual allergy incidents to infer precise allergy levels.
Hence, we classified tweets into two classes. First, we manually labeled 2,000 randomly
selected tweets into positive and negative. A tweet was labeled as positive if it talked
5
https://www.questdiagnostics.com/dms/Documents/Other/2011_QD_AllergyReport.pdf
42
Table 3.1. Tweets with positive and negative labels. A tweet is positive if
it talks about the author or someone around the author having allergy. A
tweet is negative if it is a question or talks about news, general awareness
or information about allergies.
Positive(+1) Tweet
Negative(-1)
+1 My allergies are going insane today.
(Author has allergy)
+1 Stupid allergies not letting me sleep.
(Author has allergy)
+1 Recently my lovely allergy to cats has led to my throat clos-
ing up n barely being able to breathe.
(Author has allergy)
+1 I never been able to enjoy spring cause my allergies. I hate
having itchy eyes and running nose.
(Author has allergy)
+1 @user1 @user2 and @user3 are all dying because of their
allergies.. and Im just sitting here.. #popapill
(People around author have allergies)
-1 In the United States, around 15 million people have food al-
lergies, according to Food Allergy Research and Education.
(News)
-1 Does anyone know good food near Happy Hollow that has
vegetarian options and is easy for seafood allergies?
(General question)
-1 Notice the increase in allergy ads on TV? Yep, spring is
around the corner.
(Awareness about spring season)
-1 RT @CureAllergies: What You Should Do To Manage Your
Allergies - URL.
(Information for allergy management)
about the author or someone around the author having allergy symptoms. A tweet was
lergies. Table 3.1 shows example tweets with positive and negative labels. The text
in parenthesis indicates the reason for the positive or negative annotation. We used a
removed common stop words except the pronouns I, me, my, you, and your because we
found that these pronouns were important features in classifying tweets into positive and
negative examples of actual incident of allergy. To create features, we applied Weka [53]’s
StringToWordVector filter. All unigrams, bigrams, and trigrams were used to construct
the feature vector if they appeared at least twice in the training data. Then the filter
converted words into their stems, applied TF-IDF weighting scheme, and kept 500 most
frequently used n-grams in the final feature vector. We then explored four different ma-
Forest (RF), Support Vector Machine (SVM)) that are commonly used for text classifica-
tion. In our classification task, both precision and recall were equally important. Thus,
F-measure and ROC area were used to compare performance of classification algorithms.
As shown in Table 3.2, the best classification performance (F-measure of 0.811 and
ROC area of 0.905) was obtained using NBM and 10-fold cross validation on labeled data.
We built a model using NBM on our training set, and classified all remaining tweets (after
removing retweets and tweets in non-english) into positive or negative. We used NBM
because it had the best performance on our training data, and several prior works had
shown that NBM outperformed other classification algorithms. For example, McCallum
and Nigamcite [75] found NBM to outperform simple NB, especially at larger vocabulary
44
sizes, and Lee et. al. [1] showed that the performance of NBM was better than that of NB
or SVM in 18-class tweet text classification. In our entire allergy corpus, 63% of tweets
were classified as positive and 37% were classified as negative. Only tweets in positive class
(i.e., tweets classified as mentions of actual allergy incidents) were used for our analysis.
TF-IDF (term frequency–inverse document frequency) [73]. The tf-idf measure allows
to the number of times a word appears in the document but is offset by the frequency of
the word in the document. Thus tf-idf is used to filter out common words.
(NBM), which considers the frequency of words and can be denoted as:
Y
(3.1) P (c|d) ∝ P (c) P (tk |c),
1≤k≤nd
where P (c|d) is the probability of a document d being in class c, P (c) is the prior prob-
ability of a document occurring in class c, and P (tk |c) is the conditional probability of
cover the most predominant allergy types that people suffer from or talk about on social
media by examining the texts in Twitter posts. From our allergy-related tweet corpus, we
extracted most frequently occurring bigrams where the second word was ‘allergy’. N-gram
is a contiguous sequence of n words in a sequence of text. N-gram models are widely used
Table 3.3. A list of most frequently used bigrams where the second word is
allergy, ranked by frequency of use in the entire allergy corpus. It includes
many actual allergy types that are in ‘noun noun’ POS tag.
Rank Most Frequently Used 2-grams POS-tag
1. food allergy noun noun
2. peanut allergy noun noun
3. gluten allergy noun noun
4. nut allergy noun noun
5. natural allergy adjective noun
6. hate allergy verb noun
7. skin allergy noun noun
8. lower allergy comparative-adjective noun
9. cat allergy noun noun
10. milk allergy noun noun
11. issues allergy verb noun
12. worst allergy superlative-adjective noun
13. dog allergy noun noun
14. severe allergy adjective noun
15. pollen allergy noun noun
(lexical category) such as noun, pronoun, verb, adjective, etc. We applied POS tagging
to each bigram. For example, the POS tag for string ‘natural allergy’ is ‘adjective noun’
and the POS tag for string ‘peanut allergy’ is ‘noun noun’. Table 3.3 shows the list of
15 most frequently used bigrams and corresponding POS tags in the descending order of
frequency of use.
Our assumption was that the POS tag of all allergy types (e.g., food allergy, nut
allergy, pollen allergy, dust allergy, egg allergy) should be in the form of ‘noun noun’ and,
therefore, we could obtain a list of allergy types by removing all bigrams that were not in
‘noun noun’ form. In other words, we needed to remove all bigrams that contained non-
nouns (e.g., natural allergy (adjective noun), worst allergy (superlative-adjective noun))
46
to get the final list of allergy types. All bigrams that contained Twitter screen name (e.g.,
3.2.2.4. Spatio-temporal Mining. Every tweet comes tagged with a timestamp that
indicates the time when the tweet is posted. For example, the timestamp ‘Sun Mar 02
05:55:02 +0000 2014’ indicates that the tweet is created on Sunday, March 2, 2014 at
5:55am GMT (Greenwich Mean Time). Since we were interested in tracking allergy levels
over time, we used the timestamps to count the volume of tweets posted each day that
There are two types of tweet location, a sensor-based geolocation and a text-based user
profile location. A geolocation provides the exact location where the tweet was posted
with latitude and longitude values. This data is available to others only if the Twitter
user selects it to be publicly available. Twitter users can identify home location in his/her
Twitter user profile. We examined user profile locations and extracted state information.
Examples of users’ home locations that had state information were ‘Riverside, CA’, ‘some-
where in NY’ and ‘Gainesville, Florida’. Examples of home locations that lacked state
information were ‘Home Sweet Home’, ‘Somewhere over the rainbow’ and ‘Traveling’.
We tagged each tweet with a 2-character state code (e.g., CA for California) if we were
Some tweets had both geolocation and user profile location, some had one or the other,
and the rest did not have any location information. Geolocations were first translated
into human-readable addresses using reverse geocoding API6 and then the state name was
extracted from the address. For tweets that did not have geolocation, we obtained state
6https://developers.google.com/maps/documentation/geocoding/
47
name from the user profile. Those that did not have any of the two locations were not
Table 3.5. Most prevalent food allergies. The rank of the most prevalent
food allergies extracted from Twitter data is very similar to that obtained
from actual allergy patients’ data.
Ground Truth Twiiter Data
Rank Most prevalent food allergies (Age>10) Rank Most mentioned food allergies
1. peanut allergy 1. food allergy
2. wheat allergy (gluten allergy) 2. peanut allergy
3. soybean allergy 3. gluten allergy
4. milk allergy 4. nut allergy
5. egg allergy 5. milk allergy
6. dairy allergy
7. egg allergy
8. wheat allergy
48
allergy types mentioned in our dataset by using natural language processing methods.
For the ground truth data, we created a list of allergy types by combining data from
multiple online resources7. Table 3.4 lists the top 30 most frequently mentioned allergy
types extracted from our allergy corpus by applying methods described in section 3.2.2.3.
The numbers indicate the rank of frequency (1 means the highest frequency, 30 means the
lowest frequency). The signs in the parenthesis indicate whether the extracted allergy type
is positive (an actual allergy type) or negative (not an actual allergy type). Out of top 30
allergy types, 26 were true positives and only 4 were false positives, leading to precision
of 86.7%. Two of the four false positive cases (claritin, mucinex) were allergy medicines,
and the other two cases were allergy-related disease (asthma) and term (prescription).
The traditional method that uses a pre-defined keyword list often fails to identify new
types of diseases, and new keywords (i.e., new disease types) have to be manually added.
However, with our proposed method that automatically identifies disease types, we would
not need the step where new disease types are manually added.
Most Prevalent Food Allergies. We further evaluated our Twitter data analysis
results by comparing it to the real-world allergy patients’ data. Table 3.5 shows the
ground truth value of the most prevalent food allergies in allergy patients in the first
column and the list of most mentioned food-related allergy types from table 3.4. We
7http://www.foodallergy.org/allergens, http://www.webmd.com/allergies/guide/
allergy-symptoms-types, http://acaai.org/allergies/types, http://www.healthline.com/
health/allergies/alcohol
49
used the data for patients older than age ten because most Twitter users fell into this
age group. The allergy types in two columns show that they are in a very similar order
of ranking. Note that gluten and wheat allergy can be considered to be the same and
milk and diary allergy can also considered to be the same. This proves not only how the
extracted allergy types are precise in identifying actual allergy types, but also the rank of
prevalent allergy types have a very strong relationship to the real-world allergy patients’
data.
Figure 3.1. Time-series graph of daily allergy levels detected in tweets (Feb-
ruary 2013 - April 2015). Only those allergy-related tweets labeled as posi-
tive are used to create the graph. The graph illustrates the general allergy
level trend over time. The allergy level is the highest in mid–May, goes
down in June and July, starts rising again in August, and reaches its local
maximum point in mid–September. Similar seasonal patterns are observed
in both 2013 and 2014.
50
Figure 3.2. Monthly average data for allergy tweet count (blue), daily high-
est temperature (green), and pollen level (red) for Washington state (March
2013 – April 2015). Pollen level is highly correlated with ∆temperature
(correlation of 0.776) and ∆tweet count (correlation of 0.706). Tweet count
is very strongly correlated with temperature (correlation of 0.668).
In temporal model, we tracked activities of allergy, various allergy types, symptoms and
medications over time using tweet timestamps. Figure 3.1 shows the allergy-related tweet
volume changes over two-years period from February 2013 through April 2015. The
allergy level reaches its annual global maximum in mid-May and a local maximum in mid-
September and this seasonal pattern is observed in both 2013 and 2014. The increased
number of people chatting about their allergies in May and in September indicates that a
51
very large population suffers from spring allergies such as tree pollen allergies and there
is also a quite large population that has allergy symptoms in the fall.
To validate our experimental results, we compared our Twitter data against the actual
pollen levels and the weather data. Because pollen levels and temperatures vary depending
on location, we partitioned allergy-related Twitter data into a finer space granularity (U.S.
state level). Figure 3.2 compares three trend-lines: allergy tweet timeline (blue), monthly
average pollen level (red), and monthly mean max temperature (green) for Washington
state. We show the data for Washington state, not just because a large volume of allergy-
related tweets were generated in WA but also because the ground truth temperature
52
data for WA was available for all dates from March 2013 through April 2015. It is clear
from the graph that all three trend lines illustrate seasonality. An interesting pattern is
that there is an order in time of three trend lines reaching their maximum and minimum
points. The pollen level starts rising first and reaches its peak, followed by tweet counts
and temperature. The trend lines also decrease in the same order.
Our analysis shows that the pollen level is highly correlated with the rate of tempera-
ture change (correlation of 0.776) as well as the rate of tweet count change (correlation of
0.706). In other words, pollen level reaches its peak point when the temperature sharply
increases in spring and, at the same time, allergy-related tweet volume also sharply in-
creases. Also, tweet count has a strong correlation with daily temperature (correlation
of 0.688), meaning allergy tweet count increases as the temperature increases. The high
correlation values show how well the social media data reflects the real-world allergy
In Figure 3.3, we show how the trend of mentions of two different allergy types differ
over time. The tweet volume mentioning ‘pollen allergy’ (a seasonal allergy) rises very high
during the spring and the fall and remains very low in the summer. However, unlike pollen
allergy, the tweet volume mentioning ‘peanut allergy’ (a food allergy) stays relatively
constant throughout the year. Note that we also carried out the same experiment in U.S.
state level and observed similar patterns in each state. This observation implies that
the seasonality observed in overall allergy dataset in figure 3.2 and figure 3.3 comes from
hay fever, rather than terms related to non-seasonal allergies such as dog, cat, milk or
egg.
53
Figure 3.4. Time-series graph of tweet count for various allergy symptoms
(Feb 2013–Sep 2014). The most common allergy symptom is sneezing (blue
line) throughout the year, followed by cough (green) and runny nose (sky
blue).
Figure 3.4 is a time-series graph showing tweet volume changes for different allergy
symptoms. Sneezing (blue) is the most common allergy symptom throughout the year,
followed by cough (green), runny nose (sky blue), watery eyes (red), and itchy throat
(turquoise). It is very interesting that the rank for different allergy symptoms on each day
is consistent throughout the year. Note that the percentage of Twitter users who enable
their location publicly available has been steadily increasing since we started collecting
our data.
54
(a) Feb 2013 (b) May 2013 (c) Aug 2013 (a) Nov 2013
For 20% of the tweets in our allergy data set, we were able to identify U.S. state
names. 11.4% of those had actual geolocation (longitude and latitude) values. For the
remaining 88.6%, state names were extracted from the user profile locations.
Figure 3.5 shows monthly snapshots of tweets with geolocations that helps us visualize
allergy levels across the U.S. We show quarterly seasonal maps for 2013. Each red dot
on the map represents a tweet that was posted from the location. This map shows a
general spatiotemporal trend of allergy activities. The allergy level starts increasing in
early spring and gets extremely severe in May. It remains high throughout the summer,
and goes down in the fall. Interestingly, most allergy-related tweets come from the eastern
part of the country although there are some from the west coast.
Next, using the U.S. state information we obtained from geolocations and user profile
locations, we visualized the distribution of tweets that mentioned different allergy types.
Figure 3.6 compares levels of peanut allergy (blue bar) and gluten allergy (red bar) de-
tected by social media sensors for each U.S. state. Because a greater number of tweets
were generated from states that had larger population, tweet counts were normalized by
state population and scaled to range between 0 and 100. Kansas had the highest level of
peanut allergy (94.51). South Dakota had the lowest level of both allergy types (3.85 for
55
peanut allergy and 0 for gluten allergy). Most states had higher levels of peanut allergy
than gluten with a few exceptions. For example, unlike most other states, Oregon (OR),
Delaware (DE), and Montana (MT) had higher gluten allergy levels.
Before the Internet was widely used, over-the-counter pharmaceutical sales data [72]
and telephone triage data [47] were among the methods that were used for surveillance
of diseases.
Disease Surveillance using online data. In the past decade, with the dra-
matic increase of internet use, online data had been extensively used to retrieve health
information and to detect disease activities. Web search queries data had been studied
56
to track influenza activities. Ginsberg et al. [50] used flu-related google search queries
data to estimate current flu activity near real time, 1-2 weeks in advance of the records
by the traditional flu surveillance system8. Recent research on public health and dis-
ease surveillance using online data have mostly focused on monitoring and predicting
influenza levels. Researchers had used Twitter data to monitor influenza outbreak and to
predict flu activities. Signorini et al. [95] attempted estimating current influenza activity
by tracking public sentiment and applying support vector machine algorithm on Twitter
data generated during the Influenza A H1N1 pandemic. Chew et al. [41] analyzed the
contents and sentiment of tweets generated during the 2009 H1N1 outbreak and showed
the potential and feasibility of using social media to conduct infodemiology studies for
public health. There are many others who have used Twitter data for flu outbreak detec-
tion [81, 39, 22, 96, 26, 38, 65, 93, 69]. Unlike earlier researchers who used Twitter
for flu activity detection and prediction, to the best of our knowledge, our work was the
first attempt examining allergy activities using a large scale Twitter stream.
demics detection method that used Natural Language Processing (NLP) to filter out
negative influenza tweets. Tuarob et al. [99] used ensemble machine learning techniques
work, we used bag-of-words model and explored using four different machine learning
algorithms to find the best model to classify tweets into those that mention actual allergy
incidents and those that mention general awareness or information about allergy season.
8http://www.cdc.gov/flu/
57
have studied the relationship between weather and pollen levels and how it affects severity
of allergy symptoms in patients [45, 101, 46]. In our work, the allergy levels were
extracted from social media data instead of allergy patients, and we studied relationship
between the trend of allergy-related tweets with the actual pollen levels and temperatures
In this paper, we focused on examining only allergy activity using a large Twitter
stream collected over two years and showed in-depth spatiotemporal analysis results. We
3.5. Summary
In this work, we proposed a system that monitored allergy levels near real-time by an-
alyzing streaming Twitter data. We first classified tweets to identify those that mentioned
and then used those tweets with positive labels for text and spatiotemporal analysis.
The top thirty allergy types extracted by our algorithm had precision of 86.7%. The
experimental results further showed that the rank of the most prevalent food allergy
types detected from tweet stream was highly correlated to the ground truth value, the
ranked list of prevalent allergies, obtained from real-world allergy patients’ data.
terms (e.g., pollen) showed clear seasonal patterns (a large volume of tweets in the spring
58
and a low volume of tweets in the winter) whereas those mentioning non-seasonal allergy
related terms (e.g., peanut) remained relatively constant throughout the year. By study-
ing relationships between allergy tweets with the pollen and the weather data, we showed
that all three data had similar seasonal patterns and allergy tweet data had a very strong
We believe that our work was the first study that examined a large-scale social media
data for in-depth analysis of allergy activities. Although our work had specifically focused
on studying allergy activities, the model could be generalized to track activities of other
diseases.
59
CHAPTER 4
4.1. Introduction
The Internet is usually the first place people turn for health information. People
search for a specific disease, symptoms, and appropriate medical treatments, and often
make decisions whether they should go see a doctor based on the search results. Healthcare
portal sites and the social media are popular online health information resources among
U.S. Internet users [?]. Disease surveillance is the monitoring of clinical syndromes such
as flu and cancer that have a significant impact on medical resource allocation and health
policy. Disease surveillance plays an important role in minimizing the harm caused by the
outbreaks by constantly observing the disease spread. The traditional approach employed
by the Centers for Disease Control and Prevention (CDC) [18] for flu surveillance includes
the collection of Influenza-like Illness (ILI) patients’ data from sentinel medical practices.
The main drawback of this method is the 1-2 weeks time lag between the time of medical
diagnosis and the time when the data becomes available. Early detection of a disease
outbreak is critical because it would allow faster communication between health agencies
We built a novel real-time disease surveillance system that used Twitter data to track
U.S. influenza and cancer activities. Twitter1 is a popular micro-blogging service where
users can post short messages. Twitter’s popularity as a medium for real-time information
dissemination has been constantly increasing since its launch in 2006. The proposed sys-
tem continuously downloads flu and cancer related Twitter data using Twitter streaming
API [17] and applies spatial, temporal, and text models on this data to discover national
flu and cancer activities and popularity of disease-related terms. The outputs of the three
models are summarized as pie charts, time-series graphs, and U.S. disease activity maps
on our project websites [15][16] in real time. This demonstration was built upon and ex-
tended our previous work [2]. In this work, the text analysis on most frequently occurring
terms was added. We further extended our real-time disease surveillance system to track
Figure 4.1 shows the architecture of our real-time flu and cancer surveillance system.
Our dataset consisted of all recent tweets that mentioned the keywords ‘flu’ or ‘cancer’.
We collected over 6 million flu-related tweets generated by more than 3.3 million unique
users for 5.5 months since October 16, 2012, and over 3.7 million cancer-related tweets
generated by more than 1.3 million unique users for 3 months since January 7, 2013.
Such big data presents a number of challenges due to its size and complexity, relating
to its storage, retrieval, analysis, and visualization, especially when the whole process is
required to be done in real-time as in this work. Our system was designed to be a disease
surveillance system that is (almost) always available, robust, and easily scalable for big
1
https://twitter.com/
61
:)./*&;<4(&"$=4'4'/$
!"#$
3)5;.*&"$=4'4'/$
%&'()*$
+&,&$-,.*&/)$$
&'0$
1)&"2345)$6'&"78(9$
3)>,$=4'4'/$
data. Different from many other related big data projects, which performed analytics on a
massive, static dataset, our system consisted of a cluster of several transactional databases
and high-dimensional data warehouses which were updated in real time. In our proposed
textual, the results of which were suitably presented pictorially, as described next.
62
Figure 4.2. Our Real-Time Digital Flu Surveillance Website [16]. The
‘Daily Flu Activity’ chart was an output of the temporal analysis and
showed the volume changes of tweets mentioning the word ‘flu’ over time.
The dramatic increase of flu tweet volume from Jan. 6 to Jan. 12 coin-
cided with the dates when the major U.S. newspapers reported Boston Flu
Emergency [21] and deaths of four children from the AH3N2 influenza out-
break [20]. The‘U.S. Flu Activity Map’ was an output of the geographical
analysis and showed the weighted percentage of tweet volumes mentioning
‘flu’ by states. The level of flu activity was differentiated by different colors
for an easy comparison of U.S. regional flu epidemic.
The goal of geographical analysis was to track disease spread in U.S. states by measuring
the volume of flu/cancer tweets generated in the region. For our experiments, we used
users’ home locations in their Twitter profiles. The dataset for geographic analysis was
all users who mentioned ‘flu’ or ‘cancer’ and had a valid U.S. state information (e.g.,
63
‘Evanston, IL’, ‘somewhere in NY’) in their home location fields. We excluded tweets
generated from outside the U.S. (i.e., tweets from foreign countries) and those with invalid
location information (e.g., ‘travelling’, ‘Wherever the wind blows me’). In our flu dataset,
there were 458,828 users with valid U.S. state information, and in our cancer dataset,
there were 193,797 users with valid U.S. state information. The U.S. Flu Activity Map
is shown in Figure 4.2. The tweet volume mentioning ‘flu’ generated in each state was
The goal of temporal analysis was to track the volume changes of tweets mentioning the
Disease Daily Activity Timeline. As shown in Figure 4.2, Daily Flu Activity chart
shows the tweet volume changes of flu-related tweets over three months period from
January through March 2013. The data for flu/cancer timeline is created by counting the
number of tweets mentioning ‘flu’ or ‘cancer’ generated daily. Our assumption was that
people would talk more about ‘flu’ when they themselves or people around them (e.g.,
family or friends) had flu symptoms and there would be more frequent news feeds when
the epidemic was wide spread. Achrekar et al. [22] reported that the volume of flu-related
tweets was highly correlated with the number of reported ILI cases by the CDC. In the flu
timeline, the number of flu related tweets started increasing on January 6 and reached its
peak on January 12, which coincides with the date when The Huffington Post reported
the death of four children from the outbreak of AH3N2 influenza [20]. This showed how
our temporal analysis effectively reflected the wide spread of the epidemics.
64
Figure 4.3. Flu Symptoms Timeline. The timeline displays tweet volume
changes mentioning different flu symptoms from January through March
2013. ‘Cough’ (green line) and ‘fever’ (dark orange line) reach their highest
level in mid January and decrease as the actual national ILI level by CDC
decreases.
Types, Symptoms, Treatments Timelines. We not only tracked the overall flu and
cancer activities, but also monitored disease types, symptoms, and treatments over time.
Figure 4.3 shows the daily tweet volume changes for various flu symptoms. From the
timeline chart, we could easily tell the types and levels of flu symptoms in the general
population at a specific point in time. Cough and fever were the two most dominant
symptoms throughout all flu season, and headache and sore throat were the next two
most common flu symptoms. The actual U.S. national influenza activity level (percentage
weighted Influenza-like Illness by the CDC) was plotted as red squares for reference. Tweet
volumes mentioning flu symptoms reached their highest point around mid January and
decreased as the actual flu activity level from the CDC decreased.
65
In text analysis, we revealed deep health insights by examining the content of the tweets.
We were interested in investigating the popularity of terms used in three categories: (1)
disease types (2) symptoms (3) treatments, and created a keyword list for each category.
For example, the keyword list for cancer types was a list of breast cancer, lung cancer,
skin cancer, brain cancer, etc., the keyword list for cancer symptoms was a list of lump,
cough, fatigue, weight loss, etc, and the keyword list for cancer treatments was a list of
surgery, radiation, chemotherapy, Emend, Xeloda, etc. We also had similar keyword lists
for ‘flu’. For ‘flu’, we had 9 flu types, 15 symptoms, and 31 treatments. For ‘cancer’, we
had 58 cancer types, 21 symptoms, and 63 treatments. Figure 4.4, 4.5, and 4.6 show the
keyword lists.
name. After tokenizing tweet texts and removing all stop words, we counted the number of
occurrence of each unique word. Our flu dataset (6,097,406 tweets) consisted of 83,896,915
67
words and 4,001,445 unique words. Figure 4.7 shows the top 20 most frequent words in
4.3. Summary
We built a real-time disease surveillance system that used Twitter data to automati-
cally track flu and cancer activities. The experiments showed that our disease detection
system could map U.S. regional influenza and cancer activity levels near real-time, discover
and compare popularity of terms related to flu/cancer types, symptoms, and treatments.
The system could also effectively track daily flu/cancer activities and the volume changes
of tweets mentioning disease related terms over time. All of the output data was visualized
as interactive maps, pie charts, and time series graphs on our project websites [15][16].
Our system is highly scalable and can be easily extended to track other diseases. Because
the traditional high-cost disease surveillance system that collects public health data from
CHAPTER 5
Streams
5.1. Introduction
Seasonal influenza is an acute viral infection that can cause severe illnesses and com-
plications. For instance, the annual epidemics cause about 250,000 to 500,000 deaths
worldwide. Centers for Disease Control and Prevention (CDC) reported 105 pediatric
deaths due to influenza during 2012-2013 flu season1. Monitoring of disease activity en-
ables an early detection of disease outbreaks, which will facilitate faster communication
between health agencies and the public, thereby providing more time to prepare a re-
sponse. Disease surveillance helps minimize an impact from a pandemic and make better
resource allocation. The traditional influenza surveillance system by CDC reports weekly
national and regional Influenza-Like Illness (ILI) physicians visit data collected from sen-
tinel medical practices2. This data is updated once a week and there is typically a two
weeks time lag before the data is published. Furthermore, the published data is updated
that used flu-related online search engine query data to estimate the current flu activity
with one day reporting lag, 1-2 weeks ahead of CDC, and its estimation had been known
1
http://www.cdc.gov/flu/spotlights/children-flu-deaths.htm
2
http://www.cdc.gov/flu
69
to be reasonably accurate for most parts. However, in February 2013, an article titled
“When Google got flu wrong” [35] reported Google Flu Trends’s over-estimation of peak
of U.S. flu activity, which was almost double that of CDC’s observations.
During the last decade, the number of internet and social networking site users have
dramatically increased. People share ideas, events, interests and their life stories over the
internet. As of January 2017, Twitter has 100 million daily active users and 5 million
tweets are generated per day3. Experiences and opinions on various topics including
personal health concerns, symptoms and treatments are shared on Twitter. Mining such
publicly available health related data potentially provides valuable healthcare insights.
Furthermore, the increasing number of users that access social media platforms on their
mobile devices makes social media data an invaluable source of real-time information.
In this paper, we proposed a model that (1) predicted future influenza activities,
(2) provided more accurate real-time assessment than before, and (3) combined real-
time social media data streams and CDC historical datasets for predictive models to
accomplish accurate predictions. The results showed that our model using multilayer
perceptron with back propagation on a large-scale Twitter data could forecast current
and future flu activities with high accuracy. The goal of our work was to predict expected
influenza activity for the future, a week or more ahead of time so that it could be used
exploit social media communication for the prediction. This work was published in [5].
3
https://www.omnicoreagency.com/twitter-statistics/
70
For an early detection of disease outbreaks, researchers had used different statistical
maceutical sales data [72] and telephone triage [47] had been used for surveillance of
ILI. Christakis et al. [43] studied whether monitoring of social friends could provide early
detection of flu outbreaks. Web search queries data had been used for influenza surveil-
lance [48, 55, 84, 104, 50, 93, 83]. Ginsberg et al. [50] used flu-related google search
queries data to estimate current flu activity and the near real-time estimation was reported
on Google Flu Trends (GFT) website4. Researchers had used GFT data to build an early
detection system for flu epidemics [83, 93]. Shaman et al. [93] used GFT data and
data was then recursively used to optimize a population-based mathematical model that
predicted flu activity. Pervaiz et al. [83] developed FluBreaks5, an early warning system
The use of social networking sites for public health surveillance had been steadily
increasing in the past few years [37]. Most diseases surveillance works using social media
data were focused on Twitter. A very unique feature of Twitter is that messages propagate
in real time. Many had used Twitter data to predict various real world outcomes [89,
26, 32].
For current estimation of influenza activity, Signorini et al. [95] applied support vector
regression algorithm to Twitter stream generated during the influenza A H1N1 pandemic
to public sentiment, and Achrekar et al. [22] used auto-regression with exogenous inputs
4
http://www.google.org/flutrends
5
http://www.newt.itu.edu.pk/flubreaks
71
(ARX) model on Twitter data. In our previous work, we built a real-time disease surveil-
lance website that tracked U.S. regional and temporal flu activities including popularity
of terms related to flu types, symptoms, and treatments [2, 3]. Aramaki et al. [24] pro-
posed a Twitter-based influenza epidemics detection method that used natural language
processing (NLP) to filter out negative influenza tweets. Chew et al. [41] analyzed con-
tent and sentiment of tweets generated during the 2009 H1N1 outbreak and showed the
potential and feasibility of using social media to conduct infodemiology studies for public
health.
Paul and Dredze [81] applied Ailment Topic Aspect Model to track illnesses over times
region, analyze symptoms and medication usage, and showed the broad applicability of
Twitter data for public health research. Li [69] proposed Flu Markov Network (Flu-MN),
for flu activity prediction. Lampos et al. [65] proposed an automated tool that tracked
ILI in the United Kingdom using a regression model and Bolasso, the bootstrapped ver-
sion of LASSO, for features extraction of Twitter data. Lamb et al. [63] classified tweets
into different categories to distinguish those that reported infections versus those that ex-
pressed concerns about flu, tweets about authors versus tweets about others in an attempt
of tweets [57], ran real-time spatio-temporal analysis of West Nile virus using Twitter
data [61]. Sugumaran and Voss advised to integrate existing epidemic systems, those
that used crowd-sourcing, news media (e.g., GPHIN, MedISys), mobile/sensor network,
and real-time social media intelligence, for an improved early disease outbreak system [98].
72
Chakraborty et al. [38] combined social indicators and physical indicators and used a ma-
Retrospective analysis and current estimates are important as they can describe the
observed trends. However, further prediction of future flu levels can represent a big leap
because such predictions provide actionable insights for public health that can be used for
we proposed a system that not only estimated current flu activity more accurately, but
also forecasted future influenza activities a week in advance beyond the current week
using aggregated ILI data by CDC and real-time Twitter data. The results showed that
our proposed model using multilayer perceptron with back-propagation algorithm could
forecast both current and future influenza activities with high accuracy.
5.3. Method
5.3.1. Dataset
We continuously downloaded publicly available tweets that mentioned ‘flu’ using Twitter
Streaming API6. The dataset used in this paper consisted of 20 million tweets generated
between December 2012 and May 2014. 71 weeks’ data (from week 1, 2013 until week 19,
2014) were used to build the model. Disambiguation of tweets was performed using text
analysis techniques to understand if a tweet was about a person talking about his/her own
flu or about someone else’s or if there were any mentions of common symptoms. Table 5.1
6
https://dev.twitter.com/docs/streaming-apis
73
!
"#$%&'!()**+! >1>!
,+-*./! !
! CDC!1.+.!'5%%*'456!
0!1&2./$&3#.456! =-5/!2*646*%!
0!7&%+*-&63! /*?&'.%!E-.'4'*2!
0!8*+)5-9!:6.%;2&2!
!
!"!#$%&*(
!
0!:%&36/*6+!)&+<!>1>! +,--%.#$%&'(
! 0$1234+1"#252#$%(
!
+,--%.#$%&)(
+,--%.#$%&*(
?.+.! +,--%.#$%&/(
+,--%.#$%(
!
lists examples of flu-related tweets. In the category column, user indicates that the tweet
is about the Twitter user being sick with flu, someone else indicates that the tweet is
about someone else (friends, family, etc.) being sick with flu, and symptom indicates
that the tweet describes one’s flu symptoms. Data was filtered to remove tweets that may
contain product advertisements (or links to websites) and using network analysis repeated
• Smoothing: We took 7-day moving average of daily tweet volume to identify the
long-term flu activity trend by smoothing out the fluctuations and noise in the
data that is often used in financial data analysis such as stock prices.
• Weekly counts and alignment: Weekly Twitter data was then computed
by summing smoothed daily tweet volumes from Sunday through Saturday. The
dates for weekly Twitter data were aligned with dates in CDC weekly surveillance
reports so that analysis and predictions could be validated with CDC reports.
In order to perform predictive modeling, features from the data were defined and extracted
as described below. Figure 5.2 depicts the data available at the end of week t. Wt denotes
75
*+*%+,",%,-,./,0/1%234/%!"#'%
56.718%+,",%,-,./,0/1%234/%!"%
Figure 5.2. Data available at current week t. At the end of week t, all flu-
related Twitter data collected during current week t and prior are available.
At time t, past two weeks (Wt−1 and Wt )’ CDC data is not available as
CDC’s collection, retrospective analysis and reports take two weeks.
Table 5.2. CDC and Twitter features used in flu prediction model.
Notation Description
CDC-4-3-2 CDC ILI Data for Wt−4 , Wt−3 , Wt−2
CDC-3-2 CDC ILI Data for Wt−3 , Wt−2
CDC-2 CDC ILI Data for Wt−2
Twitter-4-3-2-1-0 Twitter Data for Wt−4 , Wt−3 , Wt−2 , Wt−1 , Wt
Twitter-3-2-1-0 Twitter Data for Wt−3 , Wt−2 , Wt−1 , Wt
Twitter-2-1-0 Twitter Data for Wt−2 , Wt−1 , Wt
Twitter-1-0 Twitter Data for Wt−1 , Wt
Twitter-0 Twitter Data for Wt
the current week and any time window beyond this represents the future. Wt−n denotes n
week(s) prior to current week, and Wt+n denotes n week(s) after current week. Each week
starts on Sunday and ends on Saturday to align with CDC weekly data. CDC data for
current week, Wt , and the week before, Wt−1 , is not available due to the time it takes to
collect patients data from the sentinel practices. The latest available CDC data is weekly
Since we were able to download publicly available tweets in real time, we had all
Twitter data generated during Wt . We used the most recent 5 weeks’ data for both CDC
and Twitter data shown in table 5.2 as features of our predictive model to find the best
76
features for influenza prediction. The model was trained and validated using 10-fold cross
validation on 71 weeks data. As shown in table 5.3, the best feature for the current
flu level forecast model was feature CDC-4-3-2 Twitter-4-3-2-1-0 (latest 3 weeks’ CDC
77
plus latest 5 week’s Twitter data) with correlation coefficient of 0.9525, with +2.93%
performance improvement over feature CDC-4-3-2 (latest 3 weeks’ CDC data). The best
feature for 1-week ahead prediction model was CDC-3-2 Twitter-4-3-2-1-0, which resulted
in correlation coefficient of 0.9268, with +6.37% improvement over CDC-3-2. This clearly
showed that adding Twitter data significantly improved the performance of both current
and future flu level forecasts compared to that using only past CDC data.
The proposed model had two parts. The first estimated current flu activity in terms of
percentage of ILI-related physicians visit (2 weeks ahead of CDC data). The second part
was forecasting future influenza activity a week into the future (3 weeks ahead of CDC
data). We used multilayer perceptrons (MLP) with back propagation as it had the best
with in forecasting both current and future influenza activities. In our experiments, we
used 3-layer MLP with 4 activation units in the hidden layer. The network structure for
5.4. Results
Table 5.4 and 5.5 show how the performance of current and 1-week ahead forecast
model changed with different values of learning rate and a varying number of hidden lay-
ers and units in each hidden layer respectively. In notation ”A-B”, A indicates the number
of activation units in first hidden layer (layer 2) and B indicates the number of activation
units in second hidden layer (layer 3). Both the current and the 1-week ahead forecast
models achieved the best performance using learning rate λ = 0.2 and 3-layer multilayer
78
CDC_Wt-4
CDC_Wt-3
CDC_Wt-2
Tweets_Wt-4
%WEIGHTED_ILI_Wt
Tweets_Wt-3
Tweets_Wt-2
Tweets_Wt-1
Tweets_Wt
perceptron structure (input layer, 1 hidden layer, output layer) with 4 activation units in
Our current flu forecast model used CDC-4-3-2-Twitter- 4-3-2-1-0 (i.e., all currently
available CDC and Twitter data generated in recent 5 weeks) as features because it gave
the highest correlation of 0.9525 when the model was trained and validated using 10-fold
cross validation on 71 weeks data. Although our Twitter dataset had been collected for
1.5 years, each weekly data made only one data point for the weekly flu activity forecast
model. To best utilize the number of available data points, we built the initial model
using the first one year data (52 data points for year 2013) with 10-fold cross validation.
Then, each week, we incrementally built a new model with all available data points. For
example, a new model was trained using 52 data points (week 1, 2013 – week 52, 2013) to
% Weighted Influenza-Like Illness % Weighted Influenza-Like Illness
0
2
4
6
8
10
12
0
2
4
6
8
10
12
2012-week1 2012-week1
2012-week3 2012-week3
2012-week5 2012-week5
2012-week7 2012-week7
2012-week9 2012-week9
2012-week11 2012-week11
2012-week13 2012-week13
2012-week15 2012-week15
2012-week17 2012-week17
2012-week19 2012-week19
2012-week21 2012-week21
2012-week23 2012-week23
2012-week25 2012-week25
2012-week27 2012-week27
2012-week29 2012-week29
2012-week31 2012-week31
2012-week33 2012-week33
2012-week35 2012-week35
2012-week37 2012-week37
2012-week39 2012-week39
2012-week41 2012-week41
2012-week43 2012-week43
2012-week45 2012-week45
2012-week47 2012-week47
2012-week49 2012-week49
2012-week51 2012-week51
2013-week1 2013-week1
2013-week3 2013-week3
2013-week5 2013-week5
2013-week7 2013-week7
2013-week9 2013-week9
Time
2013-week11 2013-week11
2013-week13 2013-week13
2013-week15 2013-week15
2013-week17 2013-week17
2013-week19 2013-week19
2013-week21 2013-week21
2013-week23 2013-week23
2013-week25 2013-week25
2013-week27 2013-week27
2013-week29 2013-week29
(a) Current U.S. Influenza activity
2013-week31 2013-week31
2014-week11 2014-week11
CDC (% ILI)
2014-week13 2014-week13
2014-week15 2014-week15
Google Flu Trends
CDC (% ILI)
2014-week17 2014-week17
2014-week19 2014-week19
Current Prediction
model was for the first week of 2013 because we started collecting flu-related Twitter data
and Google Flu Trends data (GFT) [50] (green line). The earliest prediction by our
compares our flu activity prediction (red line) against the actual CDC %ILI (blue line)
a larger data set and therefore be more robust. Figure 5.4 is a time-series graph that
2, 2014. As we continued to collect more Twitter data, the model would be trained on
using 53 data points (week 1, 2013 – week1, 2014) to make current prediction for week
make current flu level prediction for week 1, 2014. Then a newer model was built again
79
in late 2012. Both our prediction (Fig. 5.4(a)) and GFT data were available two weeks
earlier than the official CDC ILI report. Our model was fitted on 52 weeks data (week 1,
2013 – week 52, 2013) with a correlation of 0.9522 and a mean absolute error (MAE) of
0.2383, and was further validated on 19 previously unseen weekly data (week 1, 2014 –
week 19, 2014) with a correlation of 0.929 and MAE of 0.493. Our prediction did as well
or better than the GFT data at most data points, and aligned very well with the CDC
ILI data. Furthermore, our prediction performed significantly better than GFT during
January 2013 when GFT’s algorithm significantly overestimated peak flu levels [35].
Our 1-week ahead flu forecast model used CDC-3-2-Twitter-4-3-2-1-0 as features. This
feature set provided the highest correlation of 0.9268 on the model trained and validated
using 10-fold cross validation on 71 weeks data, which was higher than the correlation
of 0.8952 obtained by using only CDC-3-2. Here also adding Twitter data improved
the model performance. An initial model was built using the first one-year data and a
newer model was incrementally rebuilt in the following weeks (in a similar manner our
current flu forecast model was built). Our 1-week ahead forecast data (Fig. 5.4(b)) was
available 3 weeks ahead of the official CDC ILI report and 1 week ahead of GFT data. The
model was fitted using 52 data points (week 1, 2013 - week 52, 2013) and incrementally
rebuilt using all available data (including the new weekly data collected during the current
week) thereafter. The final model was validated by measuring a correlation between the
CDC weekly percentage weighted ILI and that predicted by our model on 19 additional
previously unseen weekly data points (week 1, 2014 through week 19, 2014). A correlation
81
of 0.895 and MAE of 0.3846 were obtained on the training data and a correlation of 0.71
and MAE of 0.662 were obtained on the previously unseen test data. These results were
very good considering our forecast data was available 3 weeks faster than the official CDC
data.
5.5. Summary
a large-scale social media stream. Adding recent flu-related Twitter data as features
improved the model’s performance for both current and future forecasts. Our proposed
model could predict current and future influenza activities with high accuracy 2-3 weeks
faster than the traditional flu surveillance system could. The performance for the cur-
rent prediction was comparable to or better (in January 2013) than GFT. We expect the
these results present a very important step in not only accurately forecasting flu activity
for the future, prevention, resource planning, but also demonstrating a technique that
can combine social media, unstructured communication data, with observational data for
prediction.
82
CHAPTER 6
6.1. Introduction
On social media and online health communities, people often share their experiences
and opinions on various health topics including personal health issues and symptoms.
Especially, on medical forums, consumers ask health related questions, write reviews
on medications and describe negative side effects they experience while taking a drug.
Moreover, patients and their families can get emotional support by sharing their stories
of overcoming illnesses.
as Unified Medical Language System (UMLS) [71] via concept unique identifiers (CUIs).
This task has many applications for improving patient care such as: 1) understanding
detection of patients who need immediate attention and medical support (e.g., people with
clinical reports.
While consumers describe their health conditions in colloquial language, clinical knowl-
edge sources such as biomedical literature present medical terms in scientific language.
83
Table 6.1. Medical concepts in UMLS and example social media phrases
that describe the medical concept
This gap in the use of languages between patients/consumers and clinicians requires map-
ping of one to the other. In order to generate solutions to a given medical problem (e.g.
the solution is generated, it needs to be translated back to colloquial language for users
to easily understand.
Table 6.1 shows examples of user-generated texts from social media that describe
medical concepts. The labels in the top row are medical concepts from the standard
medical ontologies and the phrases in the same column denote example phrases from social
media that describe the concept. The examples very well illustrate the characteristics of
media. As can be seen in the table, the challenges for medical concept normalization
include: 1) alternative descriptions for health conditions in colloquial language (e.g., ‘sore
and stiff ankles’, ‘terrible pain in my ankles’, ‘ankles ache so bad’ → ankle pain; ‘trouble
sleeping’, ‘cannot sleep’, ‘hard time sleeping’ → difficulty sleeping, and 2) no overlaps
of terms between colloquial language and scientific/medical terms describing the same
84
health condition (e.g., ‘couldn’t remember’ → memory impairment, ‘sight loss’ → visual
case, basic string matching approaches without understanding semantics of the text will
In this work [6], we aimed to address the aforementioned challenges using deep learning-
based architectures and studied the impact of different types of input data used to build
• We investigated the use of various domain-specific text data to build neural em-
• We demonstrated that two deep learning models (CNN and RNN) could better
predict the medical concepts when we used neural embeddings trained on domain-
specific clinical texts compared to those trained on a larger general domain text
corpus.
• Our best results presented the new state-of-the-art for two benchmark datasets,
on the Twitter data set and up to +21.28% on the AskAPatient data set.
This chapter is organized as follows. In section 6.2, we present related works on deep
neural network models, social media for healthcare, and medical concept normalization.
In section 6.3, we describe CNN and RNN models we used for concept normalization. In
section 6.4, we describe how we re-created the social media datasets and present the details
85
of text data from various clinical knowledge sources used to build neural embeddings. In
section 6.5, we present our experimental results, followed by conclusion in section 6.6.
Social media had been widely used as a new medium for real-time information transmis-
sion in various domains including health to track volume of mentions of disease, drugs,
and symptoms [3, 4], predict influenza activities, and detect adverse drug events (ADE)
earlier than the traditional influenza or ADE surveillance systems that had significant
time delays in data processing [40, 68]. For automatic extraction of medical concepts
from social media, researchers had used machine learning approaches such as CRF (Condi-
tional Random Fields) and HMM (Hidden Markov Model) to extract phrases that describe
medical concepts (e.g., disease, drugs, symptoms) [79, 90], identify relationships between
two medical concepts (e.g., duration, frequency, dosage, route for a drug, indication, side
effects, etc.), and to classify texts into different categories (e.g., health vs. non-health,
Recurrent neural network (RNN) models have shown to be very effective in many natural
language processing (NLP) tasks. Unlike traditional neural network models, RNNs use
sequential information. Hence they are well-suited for tasks such as machine transla-
tion, speech recognition, language modeling and image caption generation. Traditionally,
convolutional neural network (CNN) models have been widely used in image processing
86
their ability to learn task-relevant features. However, with the recently proposed word
embedding models (word2vec) by Mikolov et al. [76, 77], deep neural network models for
NLP tasks have gained popularity. Kim [59] showed that a simple one layer CNN model
trained on top of pre-trained word vectors outperform several state-of-the-art models for
text classification such as sentiment analysis and question classification. Lee et al. [68]
explored semi-supervised CNN models to detect adverse drug events in tweets and demon-
strated that neural word embeddings trained on a smaller domain-specific dataset helped
more than the one trained on a larger random dataset for ADE classification. Deep learn-
ing models have also shown to be highly effective in other healthcare tasks such as clinical
diagnostic inferencing [86] and clinical neural phraphrase generation [54, 85].
string matching, heuristic string matching, and rule-based text mapping to a set of pre-
defined variants of terms [88, 25, 74]. DNorm [67] is a state-of-the-art concept (disease
name) normalization system that is based on pairwise learning to rank that learns similari-
ties between mentions and concept names. Limsopatham et al. used a machine translation
approach in which a social media phrase is translated into a formal medical concept. More
recently, Limsopatham et al. [70] showed that simple deep learning models, convolutional
neural network (CNN) and recurrent neural network (RNN), with pre-trained word em-
beddings induced from a large collection of Google News (GNews) and BioMed Central
87
(BMC) articles improved the performance over previous state-of-the-art concept normal-
ization models and reported that GNews was more effective than BMC for both CNN and
Our work significantly improved on the results from Limsopatham et al. [70] by refining
their original datasets and leveraging neural embeddings of various health-related text to
better learn the semantic characteristics of medical concepts and provided a new state-
In this section, we describe two deep learning models, convolutional neural network
(CNN) and recurrent neural network (RNN), we use for medical concept normalization.
/"%*"#.)$"% '()$*+)$"%,-.%()$"% !""#$%&
;< 01
8==) 02
8==# 04
03
#$?=
/67,
7 ! !
8"9,8""),:+$%
B+*=
@)"%= 05>1
A9.$@=@ 05
CNN is a feed-forward neural network model that learns task-relevant semantic features
for text classification. Figure 6.1 depicts a simple CNN with an input layer, followed by
88
a convolutional layer with multiple filters, a pooling layer, and a final softmax classifier.
The input layer of CNN are phrases or sentences represented as a matrix. Each row of the
sequence of words, and xi denotes a k-dimensional word vector, a filter w ∈ Rhk is applied
(6.1) ci = f (w · xi:i+h−1 + b)
from a window of words xi:i+h−1 where b is a bias and f is a nonlinear activation function.
Each feature is applied to the input matrix to produce a feature map. Then the features
are passed to a fully connected softmax layer to output the most probable label [59].
For example, for the eight word phrase ‘my feet feel like I have stone bruises’ using 300-
dimensional embedding, the input to the CNN would be a 8 x 300 matrix and the output
RNN is a family of artificial neural networks that uses its internal memory to process
variable-length sequential data. Figure 6.2 shows an unrolled RNN architecture, where
xt , yt , ht are the input, output, and hidden states at time step t, and W , U , V are model
parameters corresponding to input, hidden, and output layer weights shared across all
y0 y1 yt
Output Layer
V V V
h0 h1 ht
U U ... U
Hidden Layer
W W W
Input Layer
x0 x1 xt
(6.2) ht = f (W xt + U ht−1 ),
where the ht−1 is the previous hidden state, xt is the the current input, and f is an
Although RNN is a powerful model to encode sequences, it suffers from the vanishing
gradient problem while it tries to efficiently learn long-range dependencies [28]. We used
a gated recurrent unit (GRU) [42], which is known to be a successful remedy to the
vanishing gradient problem. The hidden state of GRU ht can be formulated as follows:
zt = σ(W z xt + U z ht−1 )
rt = σ(W r xt + U r ht−1 )
(6.3) ht = (1 − zt ) kt + zt ht−1 ,
GRU cell has two gates, an update gate zt , and a reset gate rt . kt is the candidate hidden
state. zt , rt are computed using different weight parameters where zt determines how
90
much of the old memory to keep while rt determines how to combine the new input with
6.4.1. Data
We used two data sets, TwADR-L (from Twitter) and AskAPatient, used by Limsopatham
et al. [70] for medical concept normalization1. TwADR-L was created by the authors of
[70], and AskAPatient dataset was created by Karimi et al. [58] for ADR (adverse drug
reaction), from which the authors extracted the gold standard mappings of phrases to
medical concepts.
Table 6.2. Data Statistics after removing duplicates from the combined
training, validation, and test data
TwADR-L AskAPatient
# unique phrases 2,944 4,469
# unique labels 2,220 1,036
# unique phrase-label pairs 3,157 4,496
# phrases with multiple labels 173 26
Min # examples per label 1 1
Max # examples per label 36 141
Avg # examples per label 1.42 4.35
In the original dataset, the TwADR-L had 48,057 training, 1,256 validation and 1,427
test examples. The test set (all test samples from 10 folds combined) consisted of 765
unique phrases and 273 unique classes (or medical concepts). The AskAPatient dataset
contained 156,652 training, 7,926 validation, and 8,662 test examples. The entire test
set (all test samples from 10 folds combined) consisted of 3,749 unique phrases and 1,035
1Available at https://zenodo.org/record/55013#.WKXwdxIrLde
91
unique classes (medical concepts). The authors randomly split each dataset into ten equal
folds, ran 10-fold cross validation and reported the accuracy averaged across the ten folds.
We found that, in the original data set, many phrase-label pairs appeared multiple
times within the same training data file and also across the training and test data sets in
the same fold. In the AskAPatient data set, on average 35.82% of the test data overlapped
with training data in the same fold. In the Twitter (TwADR-L) dataset, on average
8.62% of the test set had an overlap with the training data in the same fold. Having
a large overlap between the training and the test data could potentially introduce bias
in the model and contribute to high accuracy. Therefore to remove the bias, we further
cleaned and recreated the training, validation, and test sets such that each phrase-label
pair appeared only once in the entire dataset (either in training, validation or test set).
First, we combined all examples in training, validation and test data from the original
data set and then removed all duplicate phrase-label pairs (examples that had the same
phrase and label pair and appeared more than once in training/validation/test datasets).
Table 6.2 shows statistics of the new dataset after removing duplicates. The Twitter data
set had 3,157 unique phrase-label pairs and 2,220 unique labels (medical concepts) while
173 phrases had multiple labels (i.e., they were assigned to more than one label). Many
92
concepts had only one example, and the concept that had the most number of examples
had 36 phrases. On average, each concept had 1.42 examples. The AskAPatient data set
had 4,496 unique phrase-label pairs and 1,036 unique labels while 26 phrases had multiple
labels. Table 6.3 shows examples of phrases that had multiple labels. For example, ‘mad’
could be mapped to ‘anger’ or ‘rage’, and ‘sore’ could be mapped to ‘pain’ or ‘myalgia’.
Second, we removed all concepts that had less than five examples. The statistics of the
final data are shown in Table 6.4. Third, we divided all examples without multiple labels
into random 10 folds such that each unique phrase-label pair appeared once in one of the
10 test sets. We added the pairs with multiple labels into the training data. This final
Table 6.4. Data Statistics after removing concepts that had less than five
examples
TwADR-L AskAPatient
# unique phrases 543 2,494
# unique labels 65 228
# unique phrase-label pairs 617 1,427
# phrases with multiple labels 173 26
Min # examples per label 5 5
Max # examples per label 36 78
Avg # examples per label 9.5 11
In this section, we describe different types of unlabeled text data we used for building
neural embeddings.
93
6.4.2.1. Thesaurus (TH). For each word in TwADR-L and AskAPatient dataset (both
phrases and labels), we obtained the following six types of information from the Merriam-
Webster thesaurus2: definition, example sentence, synonyms, related words, near antonyms,
and antonyms. Figure 6.3 illustrates the information that was obtained for the word ‘sore’,
the second last example shown in Table 6.3. The definition of ‘sore’ included the label
‘pain’ and the list of synonyms also included ‘painful’ (an adjective form of the label
‘pain’). Therefore, the word embeddings built with the thesaurus would help the model
Figure 6.4. Medical definition of the term ‘myalgia’ obtained from Merriam-
Webster Medical Dictionary.
2https://www.merriam-webster.com/thesaurus
94
Medical Dictionary3, which contains 60,000 words and phrases used by healthcare pro-
fessionals. It is also used in the National Library of Medicine’s consumer health website
to help consumers with spelling of medical words and understanding of medical notes
written by physicians4. For each unique word in TwADR-L and AskAPatient dataset, we
obtained a medical definition (if present) using the Merriam-Webster medical dictionary
API5. The dictionary contains clinical terms that may not be found in the thesaurus.
We found that while definitions for some terms were same in both the thesaurus and the
medical dictionary, for other terms, either they used slightly different words/phrases, or
one or both did not have a definition at all. For example, the word ‘myalgia’ was in the
medical dictionary, but not in the thesaurus. As shown in Figure 6.4, we were able to
collect the definition for the word ‘myalgia’, a medical term that was not found in the
thesaurus.
6.4.2.3. Clinical Texts (CT). Clinical Texts is a collection of sentences from the fol-
ADR ontology database that provides both standardization and hierarchical classification
of ADR terms [36]. The database integrates ADR and drug information collected from
3
https://www.merriam-webster.com/medical
4
https://www.nlm.nih.gov/news/mplusdictionary03.html
5
https://www.dictionaryapi.com/products/api-medical-dictionary.htm
6
http://bioinf.xmu.edu.cn/ADReCS/
95
various public medical repositories like DailyMed7, MedDRA [34], SIDER2 [62], Drug-
Bank8, PubChem9, and UMLS. It contains 6.7K unique ADR terms and 1,698 drug names,
and 154K drug-ADR pairs. For each term in the ADReCS database, we collected its defini-
tion and synonyms. For example, the definition of the word ‘myalgia’ is ‘painful sensation
in the muscles’ and its synonyms are ‘myalga’, ‘myaigia’, ‘soreness’, ‘muscle pain’, ‘muscle
ache’, etc.
that were under the category of clinical medicine10. We also collected 4,271 sentences
from PubMed articles from the adverse drug events benchmark corpus [52].
Medical Concept to Lay Term Dictionaries: We used two medical to lay terms
medical terms and their definitions described in lay language. For example, the medical
term ‘anesthesia’ is defined in lay language as ‘loss of sensation or feeling’, the term
‘cephalalgia’ as ‘headache’, and the term ‘dyspnea’ as ‘hard to breathe’ or ‘short of breath’.
From these dictionaries, we generated sentences (e.g., ‘Anesthesia refers to loss of sensation
or feeling’, ‘cephalalgia means headache’) by combining a term and its definition with a
connecting phrase randomly chosen from a small preselected set (e.g., stands for, refers
to, indicates, means, etc.). We created a total of 1,556 sentences from these sources.
7
https://dailymed.nlm.nih.gov/dailymed/
8
https://www.drugbank.ca/
9
https://pubchem.ncbi.nlm.nih.gov/
10
https://en.wikipedia.org/wiki/Category:Clinical_medicine
11
http://gsr.lau.edu.lb/irb/forms/medical_lay_terms.pdf
12https://depts.washington.edu/respcare/public/info/Plain_Language_Thesaurus_for_
Health_Communications.pdf
96
that defined medical terms in the UMLS Metathesaurus [31], a large biomedical thesaurus
consisting of millions of medical concepts and used by professionals for patient care and
public health.
Table 6.5. Medical concepts and similar words based on cosine similarity
obtained from word embeddings built with different health-related text cor-
pora.
health-related tweets that mentioned 116 common diseases and symptoms (e.g., flu, de-
pression, insomnia, diabetes, obesity, heart disease, anxiety disorder, etc.) using the Twit-
ter streaming API13, which provided approximately 1% of all publicly available tweets.
to lowercase, and replaced hyperlinks, numerics and Twitter screen names with special
Table 6.5 shows medical concepts and examples of top 20 similar words by cosine
similarity based on the word embeddings built with individual data source.
6.5. Results
Table 6.6. Classification Accuracy (%) using 10-fold cross validation (TH
= thesaurus, MD = medical dictionary, CT = clinical texts, HT = health-
related tweets, batch size = 50, number of epoch = 100, vector dimension
= 300)
Table 6.6 shows the accuracy of classification models using 10-fold cross validation,
averaged over ten folds. The first two rows are our baseline models14 [70] where CNN and
13https://dev.twitter.com/streaming/public
14Code available at https://github.com/nutli/concept_normalisation
98
RNN models use a randomly generated embeddings (Rand) and a publicly available pre-
trained word embeddings generated from 100 billion words from Google News (GNews)
using word2vec [77] as inputs. The next four rows (rows 3-6) present the performance
of the same CNN and RNN as the baseline models but using word embeddings we built
on top of various clinical texts described in section 6.4.2. The last row presents the
performance when the models use word embeddings built using combination of all four
data sources as an input. All experiments including the baseline models were trained and
Among the individual datasets (TH, MD, CT, HT), the health-related tweets (HT)
had the most significant impact on the classification performance. Both the CNN and the
RNN models performed comparable to (for AskAPatient dataset) or better (for TwADR-
L dataset) than the best baseline models. When we combined all individual datasets, it
largely improved the classification accuracy over all baseline models and all our models.
Compared to the best baseline accuracy, the improvement was +21.17% on TwADR-
AskAPatient RNN. The improvement was substantial for CNN. For all models, we used
the following hyperparameters: batch size = 50, number of epochs = 100, vector dimension
= 300, number of neurons in hidden layer = 100, dropout rate = 0.5, non-linear activation
Next we conducted experiments to study the effects of removing a dataset from training.
Table 6.7 presents the performance loss when each dataset is removed from the set of all
99
possible resources (TH + MD + CT + HT). Interestingly, each of the four data sources
appeared to be the most important for different deep learning models and datasets. The
performance dropped by 3.88% (from 19.46% to 15.58%) when clinical texts (CT) was
removed, indicating that CT is the most important feature for TwADR-L CNN among
the four individual features. For TwADR-L RNN, health-related tweets (HT) was the
most helpful feature, indicated by the performance drop of 2.76% when removed.
While the definitions from the medical dictionary (MD) contributed the most for
AskAPatient CNN model (with 10.96% performance drop when removed), the definitions,
synonyms, and antonyms from the thesaurus (TH) was the most significant feature for the
AskAPatient RNN model (with 2.08% performance drop when removed). These results
indicate that each text data from different healthcare domain is very helpful for the deep
learning models learn clinical semantics for normalization. Word embeddings built with
the larger dataset that combined texts from multiple healthcare domains significantly con-
tributed to improving model’s performance across both Twitter and AskAPatient datasets
when compared to that built from a larger general domain corpus like google news.
100
Table 6.8 shows examples that our best model incorrectly predicted. The first column
shows example phrases of social media posts that describe medical conditions, the sec-
ond and the third columns show the annotated CUIs (unique concept identifiers) and
corresponding medical concept descriptions, and the fourth and fifth columns show the
predicted CUIs and corresponding concept descriptions by our best model (TH + MD +
CT + HT). These examples are false positives based on the ground truth labels (i.e., the
predicted CUIs do not match the labeled CUIs). However, we can observe that, although
the CUIs are different, the social media phrases can actually be mapped to both predicted
and labeled concepts. For example, the predicted concept ‘decrease in appetite’ and the
label ‘loss of appetite’ have similar meanings, therefore predicting the phrase ‘not being
able to eat’ as the concept ‘decrease in appetite’ should be considered correct. While
some phrases in the dataset have multiple labels, there are still many more that should
This suggests several future directions for designing a normalization system. First, it
is necessary to have a list of CUIs that represent similar medical concepts so that, when a
normalization system predicts a CUI, the mapping can automatically be associated with
other CUIs in the same set. Second, the normalization task should be cast as a multi-class
101
multi-label classification problem since each phrase can be mapped to multiple concepts
(as shown in Tables 6.3 and 6.8) and each concept can have many social media phrases
6.6. Summary
In this work, we explored building neural word embeddings using unlabeled text data
from various clinical knowledge sources for medical concept normalization from user-
generated social media texts. We showed that two deep learning models (CNN and
RNN) could better predict the medical concepts when we used various clinical domain-
text corpus. Our experiments showed that the proposed models with neural embeddings
trained on the combined clinical data sources could improve the accuracy up to 21.17%
on the Twitter data set and up to 21.28% on the AskAPatient data set.
102
CHAPTER 7
Social media is an invaluable resource for mining healthcare insights. In this thesis,
we presented intelligent systems we built using Twitter data for retrieving health-related
information, monitoring and predicting disease activities, and normalizing medical con-
Twitter posts into 18 general categories. Although both our approaches, bag-of-words and
network-based, effectively classified topics with high accuracy, the network-based model
using categories of similar topics had shown to achieve superior classification performance.
This model could help search information in a specific domain such as health. We also
discussed our contributions towards building a real-time disease surveillance system using
spatial, temporal, and text mining on Twitter data. The proposed system could effec-
tively track daily disease activities and map U.S. regional disease levels near real-time.
Although our work had focused on tracking three diseases (allergy, cancer, flu), the model
could be easily adapted to track other diseases. We further built a neural network model
that predicted current and future influenza activities with high accuracy by combining big
real-time social media data and observed CDC data to build predictive models. Finally,
dard medical terminologies in Unified Medical Language System (UMLS). By training two
deep learning, Convolutional neural network (CNN) and recurrent neural network (RNN),
103
models on various clinical knowledge sources, we were able to achieve significantly better
Although some pioneering works have been done, there still remain many challenges in
mining social media to gain healthcare insights. Medical concept extraction - identifying
phrases that describe health conditions - is a challenging task due to many different
ways of describing the same condition and the colloquial language used in social media.
Generating novel models for medical concept extraction using advanced natural language
processing (NLP) techniques and deep learning would be an interesting research area for
future work. Such models would be helpful for automatic systems to understand users’
health issues or clinical questions accurately such that they can provide more relevant
of adverse drug events (ADE), negative side effects that occur as a result of medical
and predicting health-related information from real-time social media. However, there
still remain many challenging problems related to mining user generated contents for
healthcare insights. We hope that the techniques we proposed in this thesis can be used
References
2011 IEEE 11th International Conference on, pages 251–258. IEEE, 2011.
[2] K. Lee, A. Agrawal, and A. Choudhary. Real-time digital flu surveillance using
twitter data. In The 2nd Workshop on Data Mining for Medicine and Healthcare,
2013.
[3] K. Lee, A. Agrawal, and A. Choudhary. Real-time disease surveillance using twitter
data: Demonstration on flu and cancer. In Proceedings of the 19th ACM SIGKDD
[4] K. Lee, A. Agrawal, and A. Choudhary. Mining social media streams to improve
Advances in Social Networks Analysis and Mining (ASONAM), pages 815–822, Aug
2015.
[5] K. Lee, A. Agrawal, and A. Choudhary. Forecasting influenza levels using real-
[6] K. Lee, S. A. Hasan, O. Farri, A. Choudhary, and A. Agrawal. Medical concept nor-
[7] T. Zhu, H. Gao, Y. Yang, K. Bu, Y. Chen, D. Downey, K. Lee, and A. N. Choudhary.
Beating the artificial chaos: Fighting osn spam using its own templates. IEEE/ACM
[8] H. Gao, Y. Yang, K. Bu, Y. Chen, D. Downey, K. Lee, and A. Choudhary. Spam
[10] A. Choudhary, W. Hendrix, K. Lee, D. Palsetia, and W.-K. Liao. Social media
[11] H. Gao, Y. Chen, K. Lee, D. Palsetia, and A. Choudhary. Towards online spam fil-
tering in social networks. In Proceedings of the 19th Annual Network and Distributed
and A. Choudhary. Ses: Sentiment elicitation system for social media data. In 2011
106
Dec 2011.
[13] H. Gao, Y. Chen, K. Lee, D. Palsetia, and A. Choudhary. Poster: Online spam
and Communications Security, CCS ’11, pages 769–772, New York, NY, USA, 2011.
ACM.
~kml649/allergy/.
~kml649/cancer/.
~kml649/flu/.
[18] Centers for Disease Control and Prevention, seasonal influenza (flu). http://www.
cdc.gov/flu, 2012.
[19] World of DTC Marketing.com, web first place people go for health information. but
[20] The Huffington Post, michigan flu season 2013: Four children die in
michigan-flu-season-2013-ah3n2_n_2458916.html, 2013.
[21] USA Today, 700 cases of flu prompt boston to declare emer-
gency. http://www.usatoday.com/story/news/nation/2013/01/09/
boston-declares-flu-emergency/1820975, 2013.
[22] H. Achrekar, A. Gandhe, R. Lazarus, S.-H. Yu, and B. Liu. Predicting flu trends us-
[24] E. Aramaki, S. Maskawa, and M. Morita. Twitter Catches the Flu: Detecting In-
the MetaMap program. Proceedings / AMIA ... Annual Symposium. AMIA Sympo-
[26] S. Asur and B. A. Huberman. Predicting the Future with Social Media. In Proceed-
[27] H. Becker, M. Naaman, and L. Gravano. Beyond trending topics: Real-world event
166, 1994.
[29] D. L. Blackwell, J. W. Lucas, and T. C. Clarke. Summary health statistics for u.s.
series/sr_10/sr10_260.pdf, 2013.
[30] B. Bloom, L. I. Jones, and G. Freeman. Summary health statistics for u.s. children:
sr_10/sr10_258.pdf, 2012.
[31] O. Bodenreider. The unified medical language system (umls): integrating biomedical
[32] J. Bollen and H. Mao. Twitter mood as a stock market predictor. Computer,
44(10):91–94, 2011.
[33] J. Bollen, H. Mao, and X. Zeng. Twitter mood predicts the stock market. Journal
[34] E. G. Brown, L. Wood, and S. Wood. The medical dictionary for regulatory activities
[35] D. Butler. When Google got flu wrong. Nature, 494(7436):155–156, Feb. 2013.
[36] M. Cai, Q. Xu, Y. Pan, W. Pan, N. Ji, Y. Li, H. Jin, K. Liu, and Z. Ji. Adrecs:
2015.
[37] D. Capurro, K. Cole, I. M. Echavarrı́a, J. Joe, T. Neogi, and M. A. Turner. The Use
of Social Networking Sites for Public Health Practice and Research: A Systematic
[39] L. Chen, H. Achrekar, B. Liu, and R. Lazarus. Vision: Towards real time epidemic
vigilance through online social networks: Introducing sneft – social network enabled
flu trends. In Proceedings of the 1st ACM Workshop on Mobile Cloud Computing
& Services: Social Networks and Beyond, MCS ’10, pages 4:1–4:5, New York,
gone viral: Syndromic surveillance of flu on twitter using temporal topic models.
[41] C. Chew and G. Eysenbach. Pandemics in the Age of Twitter: Content Analysis of
Tweets during the 2009 H1N1 Outbreak. PLoS ONE, 5(11):e14118, 11 2010.
[42] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the Properties of Neu-
[43] N. A. Christakis and J. H. Fowler. Social network sensors for early detection of
S. Jones. The trend to earlier birch pollen seasons in the uk: a biotic response to
[48] G. Eysenbach. Infodemiology: tracking flu-related searches on the web for syndromic
tional, 2011.
Detecting influenza epidemics using search engine query data. Nature, 457:1012–
1014, 2009.
[51] A. Go, R. Bhayani, and L. Huang. Twitter sentiment classification using distant
supervision, 2009.
Nov. 2009.
[54] S. A. Hasan, B. Liu, J. Liu, A. Qadir, K. Lee, V. Datla, A. Prakash, and O. Farri.
Neural clinical paraphrase generation with attention. ClinicalNLP 2016, page 42,
2016.
112
[55] A. Hulth, G. Rydevik, and A. Linde. Web queries as a source for syndromic surveil-
products/modeler/.
[57] N. Kanhabua and W. Nejdl. Understanding the Diversity of Tweets in the Time
[60] S. Kinsella, A. Passant, and J. G. Breslin. Topic classification in social media using
[62] M. Kuhn, I. Letunic, L. J. Jensen, and P. Bork. The SIDER database of drugs and
[63] A. Lamb, M. J. Paul, and M. Dredze. Separating Fact from Fear: Tracking Flu
allergic rhinitis compared with select medical conditions in the united states from
2010.
[66] S. Le Cessie and J. Van Houwelingen. Ridge estimators in logistic regression. Applied
[67] R. Leaman, R. I. Dogan, and Z. Lu. Dnorm: disease name normalization with
[68] K. Lee, A. Qadir, S. A. Hasan, V. Datla, a. prakash, J. Liu, and O. Farri. Ad-
verse drug event detection in tweets with semi-supervised convolutional neural net-
[69] J. Li and C. Cardie. Early Stage Influenza Detection from Twitter. arXiv preprint
arXiv:1309.7340, 2013.
114
[70] N. Limsopatham and N. Collier. Normalising medical concepts in social media texts
of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016,
[71] D. Lindberg, B. Humphreys, and A. McCray. The Unified Medical Language System.
warning indicator of human disease. Johns Hopkins APL technical digest, 24(4):349–
53, 2003.
2012.
[75] A. McCallum and K. Nigam. A comparison of event models for naive bayes text
[76] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word repre-
tations of words and phrases and their compositionality. In Proceedings of the 27th
[78] R. Narayanan. Mining Text for Relationship Extraction and Sentiment Analysis.
ilance from social media: mining adverse drug reaction mentions using sequence
labeling with word embedding cluster features. Journal of the American Medical
[80] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification using
[81] M. J. Paul and M. Dredze. You Are What You Tweet: Analyzing Twitter for Public
WAO-White-Book-on-Allergy_web.pdf, 2011.
[83] F. Pervaiz, M. Pervaiz, N. Abdur Rehman, and U. Saif. FluBreaks: Early Epidemic
Detection from Google Flu Trends. J Med Internet Res, 14(5):e125, Oct 2012.
116
1448, 2008.
the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan, pages
2923–2934, 2016.
[86] A. Prakash, S. Zhao, S. A. Hasan, V. Datla, K. Lee, A. Qadir, and O. F. Joey Liu.
Condensed memory networks for clinical diagnostic inferencing. In The 31st AAAI
cs/9603103, 1996.
[89] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes Twitter users: real-time
[90] H. Sampathkumar, X. Chen, and B. Luo. Mining adverse drug reactions from on-
line healthcare forums using hidden markov model. BMC Medical Informatics and
2009.
[92] A. Sarker and G. Gonzalez. Portable automatic text classification for adverse drug
2013.
[94] R. L. Siegel, K. D. Miller, and A. Jemal. Cancer statistics, 2017. CA: A Cancer
[95] A. Signorini, A. M. Segre, and P. M. Polgreen. The Use of Twitter to Track Levels
of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1
[96] M. Sofean and M. Smith. A real-time architecture for detection of diseases using
ACM Conference on Hypertext and Social Media, HT ’12, pages 309–310, New York,
[98] R. Sugumaran and J. Voss. Real-time Spatio-temporal Analysis of West Nile Virus
ment, CIKM ’13, pages 1685–1690, New York, NY, USA, 2013. ACM.
atmospheric tree pollen levels with three weather variables during 2002-2004 in a
2011.
[103] S. R. Yerva, Z. Miklós, and K. Aberer. What have fruits to do with technology?: the
itoring Influenza Epidemics in China with Search Query from Baidu. PLoS ONE,
8:e64323, 05 2013.