ABSTRACT 
 
 
 
 
Title of Document: GOING VIRAL: INTERNET AND SOCIAL 
MEDIA BASED SURVEILLANCE SYSTEMS 
FOR DETECTING INFLUENZA ACTIVITY 
IN MARYLAND   
  
 Lisa Marie Bowen, MPH Epidemiology, 2015 
  
Directed By: Professor and Chair, Dr. Robert S. Gold, 
Department of Epidemiology and Biostatistics 
 
 
Influenza surveillance is essential for detecting and managing outbreaks. The 
Maryland Department of Health and Mental Hygiene (DHMH) currently includes the 
number of emergency room and physician visits for influenza-like-illness (ILI) to 
track flu activity. Recently, internet and social media based surveillance methods 
have emerged as useful in detecting outbreaks. This study aims to determine if 
internet and social media based surveillance methods are useful in monitoring ILI in 
Maryland through assessing how Google Flu Trends (GFT) and tweets compare to 
portions of DHMH’s formal reporting system. Innovations of this study include using 
symptom based keywords and incorporating a variety of sources of surveillance data. 
Results show tweets had a strong positive correlation with all other surveillance 
sources, Pearson’s correlation coefficients ranged from 0.62-0.68. GFT were more 
highly correlated with DHMH data. Further research should investigate automating 
collection of tweets, application to other diseases, and standardized methods for 
location determination. 
  
 
 
 
 
 
 
 
 
 
 
GOING VIRAL: INTERNET AND SOCIAL MEDIA BASED SURVEILLANCE 
SYSTEMS FOR DETECTING INFLUENZA ACTIVITY IN MARYLAND 
 
 
 
 
By 
 
 
Lisa Marie Bowen. 
 
 
 
 
 
Thesis submitted to the Faculty of the Graduate School of the  
University of Maryland, College Park, in partial fulfillment 
of the requirements for the degree of 
Masters of Public Health 
2015 
 
 
 
 
 
 
 
 
 
 
Advisory Committee: 
Professor Dr. Robert S. Gold, Chair 
Dr. Sandra C. Quinn 
Dr. Xin He 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
© Copyright by 
Lisa Marie Bowen 
2015 
 
 
 
 
 
 
 
 
 
  ii 
 
 
Dedication 
This thesis is dedicated to Blaine, who was (almost) always willing to listen to my 
progress and frustrations, provided encouragement, and was nice enough to resist the 
temptation of watching our television shows without me. 
  iii 
 
Acknowledgements 
Research reported in this paper was supported by Leidos Research Fellowship 
Program. The content is solely the responsibility of the author and does not 
necessarily represent the official views of the Leidos Corporation. Thank you to the 
Leidos team members who provided guidance for my research. I would also like to 
thank my committee members for their assistance and mentoring during my thesis 
process. I appreciate the efforts of David Blythe, Anikah Salim, and Andrea Bankoski 
at the Maryland Department of Health and Mental Hygiene in helping me obtain 
influenza surveillance data.   
  iv 
 
Table of Contents 
 
 
 
Dedication .................................................................................................................... iii 
Acknowledgements ..................................................................................................... iiii 
Table of Contents ....................................................................................................... ivv 
List of Tables ................................................................................................................ v 
List of Figures .............................................................................................................. vi 
Chapter 1: Introduction to Influenza Surveillance ........................................................ 1 
Chapter 2: Research Design and Methods .................................................................... 7 
Chapter 3: Results ....................................................................................................... 17 
Chapter 4: Discussion ................................................................................................. 23 
Definition of Terms................................................................................................. 2929 
Bibliography ........................................................................................................... 3029 
 
 
 
 
.   
 
 
 
 
 
 
  v 
 
List of Tables 
 
 
Table 1: Dataset of frequency per week for Aim 3 data analysis…………………...…15 
Table 2: Pearson’s correlation coefficient between the 2014-2015 and past flu seasons
..................................................................................................................................... 17 
Table 3: Linear relationship between Google Flu Trends and DHMH data for flu 
seasons 2008-2015 ...................................................................................................... 18 
Table 4: Examples of tweets from keywords fever AND (cough OR sore throat)…..20 
Table 5: Pearson’s correlation coefficients for cleaned and raw Twitter data and 
Google Flu Trends with DHMH surveillance data for the 2014-2015 flu season ...... 20 
  vi 
 
List of Figures 
Figure 1: Discrepancy in data from Weekly Influenza Activity Reports ..................... 9 
Figure 2: Flow chart for determining location of tweets ............................................ 13 
Figure 3: Graphical representation of influenza-like-illness activity from all 
surveillance sources used in this study ....................................................................... 22 
 
 
  1 
 
Chapter 1: Introduction to Influenza Surveillance 
The influenza pandemic of 1918 killed conservatively 21 million people 
worldwide, more people than the black plague, and the majority of deaths occurred 
within 24 weeks. Rapid mutation and antigen shift of influenza makes novel strains of 
the virus a continuous threat. Since 1918, six other pandemic influenzas have 
emerged, although none as lethal as the “Spanish flu” (1).  
In 2005, the Federal Government developed a strategy for pandemic 
influenza. This strategy stresses the importance of real time (at onset of illness) 
surveillance in detecting and efficiently managing outbreaks (2). Currently, the 
Centers for Disease Control and Prevention (CDC) and the Maryland Department of 
Health and Mental Hygiene (DHMH) provide weekly reports on a variety of clinical 
data, including the number of emergency room visits due to influenza-like-illness 
(ILI). Recently, other forms of surveillance, such as Google search queries have 
emerged as useful in detecting outbreaks. Google Flu Trends detection shows 
increases in ILI symptoms 1-2 weeks ahead of ILI surveillance reports by the CDC 
(3). Many studies on disease surveillance mention the advantage of using multiple 
sources of surveillance to enhance effectiveness in early detection of outbreaks (4–8). 
For instance, at the onset of the H1N1 outbreak, informal internet based surveillance 
systems were reporting events before health organizations (9).  
Social media provides another form of internet surveillance to track outbreaks 
(5–7,10,11). Social media supplies unique information for disease surveillance apart 
from formal reporting and Google searches by providing access to real time data from 
  2 
 
individuals themselves, who may not be seeking medical care, or searching for online 
health information related to their symptoms.  
Specific Aims 
The long term goal of this project is to improve preparedness for influenza 
pandemics by using surveillance techniques that provide the earliest detection of 
outbreaks.  This is an exploratory study that will illustrate challenges, priorities, and 
strategies associated with utilizing Twitter for ILI surveillance and contribute to the 
growing body of research on using social media for disease surveillance. Twitter is a 
social media platform where users share messages, called tweets, which are a 
maximum of 140 characters in length. The terms Twitter data, tweets, and Twitter 
messages will be used interchangeably throughout the manuscript. This study goes 
beyond measuring an association, instead the purpose of this study is to explore new 
datasets that were not known a decade ago in order to investigate new strategies to 
improve upon and strengthen standard practice in the field of epidemiologic 
surveillance. The objective of this study is to determine if internet and social media 
surveillance methods are useful in monitoring ILI in Maryland. The objective is 
further divided into three specific aims. Aim 1: Determine similarity between 
influenza-like-illness emergency department and physician visits for 2014-2015 flu 
season to past flu seasons in Maryland. This will show how comparable the 2014-
2015 flu season is to other flu seasons. Since only one season of Twitter data will be 
used in this study, this aim will provide evidence for the correlation between DHMH 
and Twitter data in a typical flu season. Aim 2: Assess how Google Flu Trend data 
for influenza in Maryland compares to portions of DHMH’s formal reporting system. 
  3 
 
This aim will investigate whether or not Google Flu Trend data are useful tool for 
detecting ILI in Maryland. Aim 3: Examine Twitter (a widely used social media 
source) messages to determine if they could be used as a source of influenza 
surveillance data by assessing the correlation with DHMH and Google data and 
determine if they provide more timely information on influenza outbreaks. Subaim 
3.1: Analyze the characteristics of Twitter users to investigate whether or not certain 
sub-portions of the population are under or over represented. Subaim 3.2: Explore 
characteristics of Tweets to determine how correlation varies and compare Twitter 
data to DHMH data on laboratory confirmed cases to gain a better insight on the 
variety of ways Twitter data can be used as a surveillance tool. Since traditional 
surveillance is limited to people seeking health care, internet based surveillance 
methods provide a way to strengthen current systems by overcoming this limitation 
(3,7,11). For instance, if the majority of people who have the flu self-treat at home; 
traditional surveillance methods will miss the majority of cases. In addition, a 
retrospective study on the use of social media and internet surveillance methods in 
tracking the 2010 Haiti cholera outbreak found informal sources were highly 
correlated with official data, but provided more immediate access to information due 
to delays in obtaining official reports (5). A multi-faceted approach to influenza 
surveillance has the potential to improve and provide more rapid response to 
outbreaks (6–9). This study is important because it aims to determine if internet based 
surveillance methods (Google Flu Trends and Twitter) are useful in monitoring 
influenza-like-illness in Maryland. Favorable results of this study have important 
implications for emergency preparedness and planning procedures.  
  4 
 
Literature Review 
Transmission of Influenza through droplets that can infect people up to six 
feet away means pandemic causing strains can spread quickly, especially in the 
interconnected world we live in today (12). This makes early detection vital to saving 
the most lives by limiting outbreaks and identifying the causative strain to 
manufacture vaccines. 
 Many studies have begun to mention the importance of social media and 
internet surveillance in tracking outbreaks (3,5–8,13). Specifically, using social media 
can provide information on early outbreaks, as well as monitor public concerns (7). 
Some social media such as Twitter is also easier to use by researchers and 
professionals due to the proprietary nature of Google (10). Twitter was used by a 
Chicago health department during food borne illness outbreaks to link possible cases 
to an internet reporting form. Subsequently, researchers found the majority of 
potential cases who filled out forms did not seek medical treatment, and would not 
have been included if only traditional surveillance methods had been used (11).  
 All studies that have evaluated social media and internet surveillance have 
found a correlation with CDC data and a more immediate detection of outbreaks 
(3,5,13,14). Corley et al. searched all internet blogs for keywords related to influenza, 
and when compared to CDC influenza-like-illness data, researchers calculated a 
Pearson’s correlation coefficient of r = 0.63 (13). Ginsberg et al.’s comparison of 
Google Flu Trends and CDC influenza-like-illness data resulted in a very high (r 
=0.90) correlation (3). Achrekar and colleagues assessed mentions of influenza 
related keywords on Twitter and calculated a correlation of r = 0.98 with CDC data 
  5 
 
(14). An investigation of Twitter content during the H1N1 pandemic found data from 
Twitter predicted outbreaks 1-2 weeks ahead of the CDC on average (10). A report by 
Ginsberg et al. found Google Flu Trends was also ahead of the CDC by 1-2 weeks in 
terms of estimating weekly influenza activity (3).  
A variety of limitations in using these informal surveillance methods have 
also been revealed. Schmidt pointed out that surveillance relying on Google search 
queries may be susceptible to noise, like graduate students researching the flu, 
decreasing its reliability as a method to detect outbreaks (10). While Google has been 
shown to be highly correlated with CDC data for seasonal influenza, it was found to 
have low correlation with formal data during the onset of the 2009 H1N1 pandemic 
(7). One challenge in using social media to track outbreaks is that social media 
contains a large number of news reports, instead of “self identified” influenza 
information (6). Twitter is more popular among young, college educated people. 
Therefore, analyses using Twitter data have the potential to over represent these 
groups and under represent other sub-groups such as minorities and the elderly  (15). 
In addition, the correlation between Twitter data and confirmed influenza cases has 
yet to be established (6,10,13). Another limitation in using Twitter is location 
estimation from users, only approximately 1% of tweets contain geo-coded location 
information (16,17). Therefore, other information should be used to determine the 
location of Twitter users. While no standardized method for location estimation 
exists, previous studies have concluded that time zone information is more reliable 
than location entries in determining the location of a user/tweet (17,18). 
  6 
 
All studies using Twitter data have focused on keywords associated with a 
certain influenza strain or words “influenza” and “flu”. A recent keyword search of 
Tweets using “influenza” and “flu” found a multitude of Tweets related to flu 
vaccines and news/information. This relates to the challenge mentioned by Salathe et 
al. and Corley et al. that many Tweets don’t contain “self-identified” influenza 
information (6,13). An innovation of this proposed study is using key words 
consistent with the influenza-like-illness case definition, fever (cough OR sore throat) 
(19). Using this combination of keywords should provide more data on “self 
identified” illness and help eliminate Tweets on general flu information. A study in 
2010 found no income or racial disparities in the general use of social networking 
sites, though strong disparities remained in internet access (20). More specifically, the 
PewResearch Internet Project shows a significant increase in Twitter usage among the 
65+ population in 2014. In addition in 2014, 25% of online Hispanics and 27% of 
online African Americans used Twitter, compared to 21% of online whites (15). The 
increasing popularity of Twitter with a variety of demographic groups should reduce 
under-representation. However, an analysis of Twitter users will be included in this 
study to determine the demographic characteristics users included in this data set. 
Twitter data will also be compared to laboratory confirmed cases, a current gap in 
knowledge.  
 
 
  
  7 
 
Chapter 2: Research Design and Methods 
The purpose of this study was to investigate the association between different 
methods of influenza surveillance and to assess the usefulness of internet and social 
media based surveillance systems in monitoring influenza activity. This project was 
approved by the University of Maryland Institutional Review Board and was not 
considered human subjects research. All statistical analysis was performed using SAS 
software, version 9.3 of the SAS System for Windows (SAS Institute Inc., Cary, NC). 
Pearson’s product moment correlation coefficient (referred to as Pearson’s correlation 
coefficient through the remainder of the manuscript) was used to calculate the level of 
linear relationship of frequency per week reported by different surveillance methods. 
Pearson’s correlation coefficient was chosen to enable comparisons as previous 
studies have set a precedent for using Pearson’s correlation coefficient when 
analyzing Twitter data. Sample size was limited by the time period of official 
reporting of ILI symptoms (Oct- mid May, or more precisely Morbidity and Mortality 
Weekly Report (MMWR) weeks 40-20) and start of Twitter data collection (October 
30, 2014). However, at least 20 weeks of data were collected for all surveillance 
sources. With 20 weeks of data and a type I error rate of 0.05, Pearson’s correlation 
coefficients of 0.55 or higher can be detected with a power of 81.72%.  
Data collection 
Data were collected throughout the study as it became available from all 
sources. Data collection ended on March 28, 2015 due to project timeline 
requirements, since there is a one week delay in the release of influenza activity 
  8 
 
reports from DHMH, the last week of DHMH data is for the week ending March 21, 
2015. While the flu season does not officially end until the beginning of May, flu 
activity was considered minimal for seven consecutive weeks prior to the end of data 
collection according to DHMH weekly surveillance indicators (23). 
Aim 1: 
Maryland influenza-like-illness (ILI) surveillance data on emergency 
department visits, physician visits, and laboratory confirmed cases for the 2014-2015 
flu season were obtained from the Maryland Weekly Influenza Surveillance Activity 
Reports available from the Maryland Department of Health and Mental Hygiene 
(DHMH) website. Weekly reports included activity from the previous week (“last 
week number”) which usually differed from the activity level documented in the 
initial report (“this week number”) due to delays in obtaining data. Figure 1 contains  
portions of the Weekly Influenza Surveillance Activity Reports from two consecutive 
weeks. Notice the columns marked with the arrows. The total ILI visits listed in “this 
week number” in the report for week ending March 7, 2015 corresponds to the 
number of total ILI visits in “last week number” for the report for week ending March 
14, 2015. Since early detection is of primary interest in this study, in the event of a 
discrepancy between the numbers in the “this week” and “last week” columns, as in 
Figure 1, the number from the initial report (“this week number”) was used. Last 
week numbers were used during weeks when no reports were released due to federal 
and state holidays. 
  9 
 
 
 
Figure 1: Discrepancy in data from Weekly Influenza Activity Reports 
 
 
 
 
  10 
 
Total number of positive rapid flu tests was used for laboratory confirmed 
cases. Data for number of positive rapid flu tests were from 32 clinical labs, rather 
than the DHMH lab administration, which resulted in a larger sample size (21). 
Influenza-like-illness surveillance data on emergency department visits, and 
physician visits from the 2014-2013, 2013-2012, 2012-2011, 2011-2010, 2010-2009, 
and 2009-2008 flu seasons were provided by DHMH from the Electronic 
Surveillance System for the Early Notification of Community-based Epidemics 
(ESSENCE) system. DHMH data were combined into a Microsoft Excel file for 
statistical analysis.  
Aim 2: 
Google Flu Trend data have been approved for re-use and were downloaded 
from Google.org for use in this study. The downloaded dataset from Google.org had 
data from all states, and began in 2003. Only Google Flu Trend data from Maryland 
and from years that had corresponding DHMH data (2008-2015) were used.  
Aim 3: 
Tweets were collected from Twitter’s Streaming API (Application 
Programming Interface) service via Tweetarchivist.com, a company offering 
subscriptions to provide publically available streaming Twitter data on specified 
keywords. The keyword combination “fever AND (cough OR sore throat)” was used 
to gather tweets related to influenza-like-illness. Data included characteristics such as 
username, location, time zone, date and time, and full Tweet text for each Tweet 
returned. The dataset of returned tweets was downloaded into a Microsoft Excel file 
  11 
 
from Tweetarchivist.com four times throughout the data collection period resulting in 
four rounds of data cleaning as data became available to disperse the workload. A 
limit to Streaming API is not providing access to all of the Tweets related to the 
keywords. However, if the Tweets matching the keywords represent less than 1% of 
the total volume of Tweets, streaming API returns 100% of the matching Tweets (22). 
Since it is unlikely that the number of Tweets matching the keywords “fever (cough 
OR sore throat)” exceeded 1% of total Tweets, this was not a limitation of the current 
study. 
 Data cleaning 
Aim 3: 
All data cleaning was performed in Microsoft Excel (2010). Twitter data were 
cleaned to remove re-tweets, multiple tweets from one user in a 6-week time frame, 
and tweets occurring outside of the United States. Since incidence was of primary 
interest in this study, only original tweets were used.  Re-tweets were identified and 
removed by searching for tweets containing “RT @” in the tweet text. Multiple 
tweets containing the same text from the same user were removed from the dataset as 
they were suspected to be bots (automated programmed posts) and not provide any 
information on an actual influenza case. If users had multiple original Tweets 
returned, Tweets were broken down into 6 week periods and only the first Tweet for 
each period was included in the final dataset. Six week periods were chosen based on 
how the CDC classifies new episodes of illness for surveillance reporting (14). This 
was done to help eliminate prevalence data and instead focus on the first incidence of 
  12 
 
illness per user. If a user had a six-week time span between posts, then both posts 
were kept due to the ability to be re-infected with the influenza virus.   
A previous study comparing Twitter Streaming API and Twitter firehose (full 
repository of Tweets) found the Streaming API returned a high percentage (90%) of 
geo-coded Tweets. However, geo-coded Tweets only represent a small minority of 
total Tweets, and can introduce bias (22,24). Therefore, data for the current study 
included tweets that were identified as occurring in the United States, not just 
Maryland. Time zone and location information were used to determine the location a 
tweet originated from. Previous studies have shown that time zone is more reliable 
than location in determining a user’s location in the absence of geolocated data 
(17,18). Since only 0.97 percent of tweets returned were geolocated, location and 
time zone information were the main pieces of information used for location 
determination. The following rules were applied for determining which tweets most 
likely occurred in the United States, and therefore kept in the dataset. A flow chart 
containing the rules used for determining location can be found in Figure 2.  
  13 
 
 
Figure 2: Flow chart for determining location of tweets 
 
No standardized way to determine location from location and time zone data has been 
established. The rules used in this study were developed after reading existing 
literature and examining the data to create a standardized method to ensure the 
majority of tweets actually occurring in the United States were included in the dataset 
with minimal tweets from other countries being included (17,18). The process of 
location determination by hand was time-consuming and limits the application of 
Twitter data for use in public health settings unless automated procedures are 
developed. Therefore, the original data set underwent a separate data cleaning. For 
the second data cleaning method, re-tweets were removed and only one Tweet per 
user was included which reduced the amount of time needed to clean the data. The 
  14 
 
correlation of frequency of tweets per day between the two data sets was then 
calculated using Pearson’s correlation coefficient to determine if the data set which 
had minimal cleaning (referred to as the raw dataset) could be used as a proxy for 
tweets occurring in the United States. 
 After data cleaning, the remaining tweets were combined into one dataset, this 
dataset was then reformatted to allow comparisons to the other forms of surveillance 
data, which are recorded in frequency per week. Local time zone information was 
used to calculate the frequency of tweets per day. Frequency per day was then 
translated into frequency of tweets per week, based on MMWR weeks. A final dataset 
containing week ending date and frequency of tweets was then used in the analysis, 
see Table 1. This same formatting method was performed on the raw dataset in order 
to calculate the correlation coefficient between the two Twitter datasets. 
Statistical Analysis 
Emergency department visits, physician visits, and Google Flu Trend data 
were compared on a weekly basis from the first week of October until the end of May 
(MMWR week 40-20) for past flu seasons, 2008-2014, and until the week ending 
March 21 (MMWR week 11) for DHMH data and the last week in March (MMWR 
week 12) for Google Flu Trends and Twitter data for the 2014-2015 flu season. The 
final dataset used for analyzing the linear relationship between the different forms of 
influenza surveillance contained frequency per week for each surveillance method: 
tweets (raw and cleaned), Google Flu Trends, physician visits, emergency department 
visits, and laboratory confirmed cases (Table 1). A similar dataset containing 
frequency per week for physician visits, emergency department visits, and Google Flu 
  15 
 
Trends from years 2008-2015 was used to calculate the correlation between the 2014-
2015 flu season with past flu seasons. The same dataset was used in analysis of the 
linear relationship between Google Flu Trends and DHMH surveillance data 
(physician visits and emergency department visits) for each flu season dating back to 
the 2008-2009 flu season.  
 
Table 1: Dataset of frequency per week for Aim 3 data analysis 
Week 
Ending 
Date 
Raw 
Twitter 
Data 
Cleaned 
Twitter 
data 
Google Flu 
Trends 
Physician 
Visits 
Emergency 
Department 
Visits 
Laboratory 
Confirmed 
Cases 
11/8/2014 939 623 2129 122 642 24 
11/15/201
4 913 607 1602 100 709 38 
11/22/201
4 879 559 1885 116 703 52 
11/29/201
4 835 552 2186 131 947 175 
12/6/2014 942 650 2698 197 1114 301 
12/13/201
4 950 668 3340 254 1357 652 
12/20/201
4 987 706 4941 406 2265 2100 
12/27/201
4 1017 758 7536 293 3538 3307 
1/3/2015 1030 717 8057 445 3394 2423 
1/10/2015 989 689 6346 326 2298 1442 
1/17/2015 860 594 5389 249 1494 920 
1/24/2015 928 620 4358 282 1332 788 
1/31/2015 951 654 4405 241 994 565 
2/7/2015 913 600 3428 218 1028 514 
2/14/2015 855 549 3139 167 926 312 
2/21/2015 813 533 2516 132 771 258 
2/28/2015 790 508 2228 159 723 203 
3/7/2015 429 276 2099 95 620 136 
3/14/2015 399 266 2125 132 744 161 
3/21/2015 755 495 2134 120 802 183 
  16 
 
 
Aim 1: 
In order to determine if the 2014-2015 flu season was a typical flu season, 
emergency department visits and physician visits data from the 2014-2015 flu season 
were compared to each past flu season by calculating the Pearson’s correlation 
coefficient which resulted in six different correlation coefficients. The correlation 
coefficients were then rank ordered.  
Aim 2: 
Pearson’s correlation coefficient was also calculated to determine the 
correlation between Google Flu Trends and DHMH data. Since Google makes 
revisions to the algorithm used in Google Flu Trends, correlation coefficients were 
calculated for each flu season (3).  
Aim 3: 
Tweets were aggregated into frequency per week to be consistent with DHMH 
and Google Flu Trend’s reporting methods. Twitter data was analyzed starting with 
MMWR week 45 (week ending 11/8/2014) as this was the first full week of Twitter 
data collected. Pearson’s correlation coefficients were calculated for the correlation 
between Twitter data and emergency department visits, physician visits, Google Flu 
Trends, and laboratory confirmed cases, Table 1. Due to a lack of tweets with 
location and/or geo-coded information in Maryland, no separate analysis was 
performed comparing Maryland tweets to the full dataset.  
  17 
 
Chapter 3: Results 
The results demonstrate that internet and social media influenza surveillance 
methods are correlated with DHMH surveillance data on physician visits, emergency 
department visits, and laboratory confirmed cases. Results are further broken down 
and reported according to each aim of the study.  
Aim 1 
Aim 1 investigated the similarity between the 2014-2015 flu season to 
previous flu seasons. The objective of this aim was to determine if the linear 
relationship between tweets and DHMH data is generalizable to a typical flu season. 
The correlation coefficients between the 2014-2015 flu season and previous flu 
seasons varied dramatically; results are reported in ranked order by p-value according 
to physician visits in Table 2. 
Table 2: Pearson’s correlation coefficient between the 2014-2015 and past flu seasons  
 
Physician Visits Emergency Department Visits 
2012-2013 0.655 (p=0.0004) 0.820 (p<0.001) 
2013-2014 0.485 (p=0.01) 0.546 (p=0.55) 
2009-2010 -0.450 (p=0.02) -0.303 (p=0.14) 
2010-2011 0.458 (p=0.46) 0.400 (p=0.05) 
2008-2009 -0.079 (p=0.71) -0.129 (p=0.54) 
2011-2012 -0.071 (p=0.74) 0.708 (p<0.001) 
 
The 2014-2015 flu season was most highly correlated with the 2012-2013 flu 
season, showing a strong positive linear relationship for both physician visits 
(r=0.655) and emergency department visits (r=0.82). The 2009-2010 season had a 
strong negative correlation for physician visits (r=-0.45) and moderate negative 
  18 
 
correlation for emergency department visits (r=-0.303). The 2008-2009 flu season 
showed no association with the 2014-2015 season. The correlation coefficient for 
physician visits and emergency visits generally followed the same trend, except for 
the 2011-2012 season. For the 2011-2012 flu season, physician visits were not 
correlated with physician visit data from 2014-2015 (r=-0.071). But, the emergency 
department visit data for 2011-2012 showed a very strong positive correlation 
(r=0.708) with emergency department visits for 2014-2015. 
Aim 2 
Aim 2 assessed the usefulness of Google Flu Trends in detecting ILI activity 
in Maryland. The level of linear association between DHMH data, represented by 
physician visits and emergency department visits and Google Flu Trend data varied. 
Results are presented in Table 3 in ranked order according to physician visits. Unlike 
the results from aim 1, the correlation between Google and DHMH surveillance data 
always had a positive relationship and the lowest level of correlation still represented 
a moderate relationship between the data sources.  
Table 3: Linear relationship between Google Flu Trends and DHMH data for flu seasons 2008-
2015 
 Physician Visits Emergency Department Visits 
2009-2010 0.952 (p<0.001) 0.980 (p<0.001) 
2010-2011 0.902 (p<0.001) 0.965 (p<0.001) 
2014-2015 0.897 (p<0.001) 0.947 (p<0.001) 
2013-2014 0.874 (p<0.001) 0.967 (p<0.001) 
2012-2013 0.862 (p<0.001) 0.974 (p<0.001) 
2008-2009 0.745 (p<0.001) 0.393 (p=0.02) 
2011-2012 0.394 (p=0.02) 0.724 (p<0.001) 
 
  19 
 
The weakest correlation was seen in the 2011-2012 flu season for physician 
visits (r=0.394), and the 2008-2009 flu season for emergency department visits 
(r=0.393). Apart from the 2008-2009 flu season, the relationship was consistently 
stronger between Google Flu Trends and emergency department visits. The strongest 
correlation was observed for the 2009-2010 flu season for both physician (r=0.952) 
and emergency department visits (r=0.980).  
Aim 3 
Aim 3 examined if tweets from a symptom based keyword combination were 
correlated with Google Flu Trends and DHMH influenza surveillance data to see if 
tweets could be a used as mechanism for influenza surveillance. The fully cleaned 
Twitter dataset had a very strong correlation (r=0.98) with the Twitter dataset that 
was cleaned for re-tweets and multiple tweets from the same user (referred to as the 
raw dataset). However, when calculating the Pearson’s correlation coefficient 
between Twitter data and other sources of influenza surveillance, the fully cleaned 
dataset had a stronger relationship with all other sources (see Table 5). The raw 
dataset contained 18,112 tweets. Only 0.97 of the tweets contained geo-coded 
information. Due to the lack of geo-coded tweets and tweets containing Maryland 
location identifiers, no separate analysis was done comparing Maryland tweets to the 
full dataset. After cleaning the data to include only tweets suspected to have occurred 
in the United States, the sample size was reduced to n=12,268. 67.7% of the tweets 
returned for keywords “fever AND (cough OR sore throat)” were determined to have 
occurred in the United States based upon the location determination system 
developed in this study. From the cleaned dataset, only 952, or 7.8% of tweets 
  20 
 
contained the words influenza or flu within the full tweet text. An example of some 
tweets in the final dataset can be found in Table 4, some of the examples show that 
while most tweets focused on experiencing symptoms, some noise still existed in the 
dataset. 
 
 
 
Table 4: Examples of tweets for keywords fever AND (cough OR sore throat) 
Please pray for healing. I have a bad fever and super sore throat. 
why would u come to school w a fever, stuffy nose, sore throat, and aching body? 
High fever and sore throat and all I want is a chocolate frosty 
This sore throat, fever, runny nose, and back pains are already calling for a great night 
at work! ~feeling miserable~ 
#WheatgrassJuice can be used for treatment of respiratory tract complaints, including 
the common cold, cough, fever, and sore throat. 
Fever, chills, sore throat...where did this come from? Is February over yet? 
#IHateFebruary #WorstMonthOfTheYear 
Going to school with a fever and sore throat sucks ): 
Way to start my birthday month! sore throat, chills, headache, I feel the fever 
coming!!!! Google scares me 
What is swine flu?C)Symptoms similar to those produced by standard, seasonal flu - 
fever, cough, sore throat, body aches and chills 
 
 
 
Table 5: Pearson’s correlation coefficients for cleaned and raw Twitter data and Google Flu 
Trends with DHMH surveillance data for the 2014-2015 flu season 
 Cleaned 
Twitter Data 
Raw Twitter 
Data 
Google Flu 
Trends 
Physician Visits 0.675 (p=0.001) 0.593 (p=0.006) 0.897 (p<0.0001) 
Emergency 
Department Visits 
0.642 (p=0.002) 0.530 (p=0.02) 0.947 (p<0.0001) 
Lab Confirmed 
Cases 
0.616 (p=0.004) 0.494 (p=0.03) 0.927 (p<0.0001) 
Google Flu Trends 0.642 (p=0.002) 0.536 (p=0.01) 1.00 
  21 
 
Results show that tweets had a strong positive relationship with all other 
sources of surveillance data. Pearson’s correlation coefficients for frequency of ILI 
activity per week ranged from r=0.616 with laboratory confirmed cases to r=0.675 
with physician visits (Table 5).  Tweets had a lower correlation with all sources of 
DHMH influenza surveillance data than Google Flu Trends for the 2014-2015 flu 
season. It is interesting to note that Twitter and physician visit data lacked a strong 
peak in activity, as is usually seen during the flu season and as can be observed in the 
other forms of surveillance data, see Figure 3. 
Sub-aim 3.1:  
 
 No racial indicators were included in the Twitter dataset and therefore no 
separate analysis could be performed to investigate whether or not sub-portions of the 
population are being under or over represented in the sample. 
Sub-aim 3.2: 
 Tweets had a strong positive association with laboratory confirmed Influenza 
cases. However, the Pearson’s correlation coefficient between tweets and laboratory 
confirmed cases was the lowest compared to the other surveillance sources analyzed. 
 
  22 
 
 
Figure 3: Graphical representation of influenza-like-illness activity from all surveillance sources used 
in this study
0 
1000 
2000 
3000 
4000 
5000 
6000 
7000 
8000 
9000 
40 42 44 46 48 50 52 1 3 5 7 9 11 
MMWR Week 2014-2015 
Influenza-like-illness Activity per Week 
from Multiple Sources of Influenza 
Surveillance Data   
Twitter 
Google Flu Trends 
Physician Visits 
Emergency Department 
Visits 
Laboratory Confirmed 
Cases 
  23 
 
Chapter 4: Discussion 
 
Aim 1 
Aim 1 was performed to assess the linear relationship between the 2014-2015 
flu season to past flu seasons. Results show that the 2014-2015 flu season was only 
comparable to three other seasons (two moderately, one strongly). Therefore, the 
results assessing the relationship between Twitter and DHMH influenza surveillance 
data is not generalizable to all flu seasons and may be more or less correlated with 
each season based on specific characteristics of that flu season. Differences in 
relationship between the 2014-2015 flu season with other flu seasons could be due to 
severity of the most prominent strain and when activity becomes more widespread. 
For instance, the 2014-2015 season was expected to be more severe due to a vaccine 
mismatch with the circulating influenza strains (25). This could be a reason why the 
2014-2015 flu season had a low correlation with past seasons. The 2014-2015 flu 
season had the weakest relationship with the 2009-2010 and 2008-2009 flu seasons, 
which may be due to unique flu activity resulting from the 2009 Swine Flu (H1N1) 
pandemic (26). The 2012-2013 flu season was the most highly correlated with the 
2014-2015 flu season. According to DHMH’s flu season summary, the 2012-2013 flu 
season was the most active season since the 2009 H1N1 pandemic (27). Similarities 
of the 2014-2015 and 2012-2013 seasons are AH3 as the most prominent strain, and 
being an active flu season (23,25,27). 
  24 
 
Aim 2 
Google Flu Trend data were compared to DHMH physician and emergency 
department visits from flu seasons 2008-2015 in order to determine if Google has 
been useful in tracking influenza activity in Maryland. The magnitude of Google data 
compared to the other surveillance sources demonstrated that far more people seek 
information than care, and confirms that using Google Flu Trends for influenza 
surveillance provides information on cases that would normally be missed in 
surveillance relying only on people accessing healthcare (3,7,11). Results from Aim 2 
were consistent with previous studies which showed an initial low correlation with 
clinical data at the beginning of the 2009 Swine flu pandemic, but that changes to the 
algorithm used in Google Flu Trends drastically improved the correlation between 
official data and Google Flu Trends for the remainder of the pandemic (7). Pearson’s 
correlation coefficients between Google Flu Trends and DHMH data for the 2008-
2009 season was ranked second lowest for physician visits, and lowest for emergency 
department visits while the 2009-2010 season had the highest correlation coefficient 
for both physician and emergency department visits.  
The results of the linear relationship between Google and DHMH surveillance 
data were different than expected. Since Google revises the algorithm used to track 
flu activity it was expected that the most recent years would have the highest 
correlation coefficients. Variation may be due to differing characteristics of each flu 
season or the current algorithm may be perfected to pandemic H1N1 conditions. 
Google states that flu trend data should be interpreted as ILI cases per 100,000 
physician visits (28). Interestingly, in this study apart from the 2008-2009 season, 
  25 
 
Google data were consistently more highly correlated with emergency department 
visits. Based on the data in this study, in Maryland, the same population that uses 
Google to search their symptoms and influenza information might also be more likely 
to visit the emergency department rather than a physician’s office for care. However, 
it is hard to differentiate who is represented and how different groups use Twitter. 
Aim 3 
In this study tweets were found to be positively associated with influenza 
surveillance data on physician visits, emergency department visits, and laboratory 
confirmed cases, as well as with Google Flu Trends. This study went beyond methods 
used in previous studies researching social media for influenza surveillance and took 
a different approach to better capture incident data. The higher Pearson’s correlation 
coefficients reported in previous studies using keywords such as flu and influenza 
were likely heavily influenced by noise produced by tweets from public health 
organizations, news, and tweets related to flu vaccines. Since only 7.8% of tweets 
returned on ILI symptoms contained the words flu or influenza this provides further 
evidence that the use of flu and influenza as keywords for disease surveillance fails to 
identify the majority of self-reported ILI cases.  
A sub-aim of this study was to explore characteristics to determine if certain 
sub-portions of the population were being under or over represented. Twitter is more 
popular with certain portions of the population, such as college students. However, it 
is becoming more diverse; a larger percentage of online African Americans use 
Twitter than online Hispanics or Whites (15). It is possible that the sample of tweets 
  26 
 
used could be representative of all Twitter users and therefore a relatively 
heterogeneous sample. But, since there are no racial indicators on Twitter profiles, no 
comment can be made with certainty on the racial identities of those persons 
generating the tweets used in this study.  
There were many challenges in using Twitter data for research purposes. No 
standardized method for determining location of users or tweets has been developed. 
Even with a flow chart guiding decisions on a user’s location, data cleaning was a 
time consuming endeavor. This limits the ability to use Twitter data in public health 
settings due to time constraints. However, this may be overcome by using tweets on 
ILI that occurred world-wide, represented in this study by the raw dataset. The raw 
and cleaned dataset had a very strong positive correlation. While the relationship 
between raw tweets and DHMH surveillance data was weaker, there was still a 
positive association. So, the raw dataset can provide a rough estimation of activity, 
but ultimately the cleaned dataset is the best choice when using tweets for disease 
surveillance. Research is currently being conducted on algorithms that estimate the 
location of Twitter users and tweets (16). While time constraints currently exist, this 
limitation may be overcome in the near future with continued research and 
development. 
While results of this study show Twitter is correlated with DHMH data there 
was no evidence that Twitter or Google Flu Trends showed increases in flu activity 
earlier than other surveillance sources. The main advantage of Google Flu Trends and 
Twitter for influenza surveillance is being able to access real-time data. Activity 
reports produced by DHMH were released a full week after the week being reported. 
  27 
 
Even after this delay in reporting, often there was still missing data, resulting in a two 
week delay in obtaining complete ILI surveillance data. Public health officials 
themselves may not have to wait the entire two weeks to view surveillance activity. 
But, they are still limited by how many and how quickly physicians’ offices and 
hospitals report ILI visits.  So, while Google Flu Trends and Twitter might not show 
activity increasing earlier than DHMH surveillance, the data can be accessed sooner 
which is important for emergency management and public health officials preparing 
for and responding to outbreaks.  
 The interconnectedness of our world means that influenza outbreaks occurring 
across the country, or even world, can easily spread to Maryland. Not only can the 
data from Google Flu Trends and Twitter be accessed sooner, but an additional 
benefit of using these surveillance methods is being able to track activity outside of a 
health department’s jurisdiction. Increases in influenza activity occurring in other 
parts of the country can help preparedness efforts for local health departments.  
Limitations 
 There were a variety of limitations in this study. The method developed to 
determine the location of tweets has not been validated, and there is currently no 
standardized method that exists. This resulted in a time consuming data cleaning 
process that limits the application of Twitter outside of research settings, unless 
automated tools are developed to streamline this process. Since there were no 
racial/ethnic identifiers, the representativeness of the tweets cannot be verified. It is 
possible that the dataset could be over or under representing certain sub-groups, and 
therefore not representative of the entire population. Lastly, Google Flu Trends and 
  28 
 
DHMH data were collected from Maryland, while tweets were collected from the 
entire United States. This means there is a difference in the base populations used in 
this study. However, it is hypothesized that the correlation would increase if only 
tweets from Maryland were used. Subsequently, the true correlations between tweets, 
DHMH, and Google Flu Trends might be higher than the correlations reported in this 
study. 
Conclusions 
In general, Google Flu Trends and ILI symptom based tweets were positively 
correlated with current surveillance methods used by Maryland’s Department of 
Health and Mental Hygiene. Since every flu season was found to be unique, the 
overall relationship between Google Flu Trends and tweets may vary year to year. In 
conclusion, the results of this study reinforce that influenza surveillance data should 
be gathered from a variety of sources in order to provide the greatest understanding of 
influenza outbreaks.   These different sources of surveillance represent different 
portions of the population, such as those not seeking healthcare, and provide earlier 
access to data on influenza activity in order to best prepare for and manage an 
outbreak (7). Future work should focus on development of a tool which automatically 
collects tweets based on ILI keywords and cleans the dataset, application of internet 
and social media surveillance to other diseases, and standardized methods for 
determining location from Twitter data.  
  29 
 
Definition of Terms 
Bot: an application that is programmed to produce tweets 
Geo-coded: contains a geographic reference point 
Influenza-like-illness: illness with symptoms of fever, and cough and/or sore throat 
used to estimate influenza activity 
Morbidity and Mortality Weekly Report (MMWR): Weekly series containing timely 
public health information prepared by the Centers for Disease Control and Prevention 
Real time surveillance: surveillance that occurs at or very close to the onset of the 
disease 
Re-tweet:  a re-post of a tweet 
Tweet: A message/post on Twitter, also referred to as Twitter messages  
Twitter: Social media platform where users share 140 character messages called 
tweets 
 
 
 
 
  30 
 
Bibliography 
 
1.  Barry J. The Great Influenza: The story of the Deadliest Pandemic in History. 
New /York, New York: Penguin Books; 2009. 546 p.  
2.  U.S. Government. National Strategy for Pandemic Influenza. 2005 Nov.  
3.  Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. 
Detecting influenza epidemics using search engine query data. Nature. 2009 Feb 
19;457(7232):1012–4.  
4.  Denecke K, Krieck M, Otrusina L, Smrz P, Dolog P, Nejdl W, et al. How to 
exploit twitter for public health monitoring? Methods Inf Med. 2013;52(4):326–
39.  
5.  Chunara R, Andrews JR, Brownstein JS. Social and news media enable 
estimation of epidemiological patterns early in the 2010 Haitian cholera 
outbreak. Am J Trop Med Hyg. 2012 Jan;86(1):39–45.  
6.  Salathé M, Freifeld CC, Mekaru SR, Tomasulo AF, Brownstein JS. Influenza A 
(H7N9) and the importance of digital epidemiology. N Engl J Med. 2013 Aug 
1;369(5):401–4.  
7.  Milinovich GJ, Williams GM, Clements ACA, Hu W. Internet-based 
surveillance systems for monitoring emerging infectious diseases. Lancet Infect 
Dis. 2014 Feb;14(2):160–8.  
8.  Salathé M, Bengtsson L, Bodnar TJ, Brewer DD, Brownstein JS, Buckee C, et 
al. Digital epidemiology. PLoS Comput Biol. 2012;8(7):e1002616.  
9.  Brownstein JS, Freifeld CC, Madoff LC. Influenza A (H1N1) virus, 2009--
online monitoring. N Engl J Med. 2009 May 21;360(21):2156.  
10.  Schmidt CW. Trending now: using social media to predict and track disease 
outbreaks. Environ Health Perspect. 2012 Jan;120(1):A30–33.  
11.  Harris JK, Mansour R, Choucair B, Olson J, Nissen C, Bhatt J, et al. Health 
department use of social media to identify foodborne illness - Chicago, Illinois, 
2013-2014. MMWR Morb Mortal Wkly Rep. 2014 Aug 15;63(32):681–5.  
12.  Centers for Disease Control and Prevention. How Flu Spreads [Internet]. 
Centers for Disease Control and Prevention. 2013 [cited 2014 Nov 10]. 
Available from: http://www.cdc.gov/flu/about/disease/spread.htm 
13.  Corley CD, Cook DJ, Mikler AR, Singh KP. Using Web and social media for 
influenza surveillance. Adv Exp Med Biol. 2010;680:559–64.  
  31 
 
14.  Achrekar H, Gandhe A, Lazarus R, Yu S-H, Liu B. Predicting Flu Trends using 
Twitter data. 2011 IEEE Conference on Computer Communications Workshops 
(INFOCOM WKSHPS). 2011. p. 702–7.  
15.  Duggan M, Ellison NB, Lampe C, Am, Lenhart  a, Madden M. Social Media 
Update 2014 [Internet]. Pew Research Center’s Internet & American Life 
Project. [cited 2015 Jan 20]. Available from: 
http://www.pewinternet.org/2015/01/09/social-media-update-2014/ 
16.  Mahmud J, Nichols J, Drews C. Home Location Identification of Twitter Users. 
ACM Trans Intell Syst Technol ACM Trans Intell Syst Technol. 2014;5(3):1–
21.  
17.  Graham M, Hale SA, Gaffney D. Where in the World Are You? Geolocation 
and Language Identification in Twitter. Prof Geogr. 2014;66(4):568–78.  
18.  Burton SH, Tanner KW, Giraud-Carrier CG, West JH, Barnes MD. “Right time, 
right place” health communication on Twitter: value and accuracy of location 
information. J Med Internet Res. 2012;14(6):e156.  
19.  Centers for Disease Control and Prevention (CDC). Overview of Influenza 
Surveillance in the United States [Internet]. 2015. Available from: 
http://www.cdc.gov/flu/weekly/overview.htm 
20.  Kontos E, Emmons K, Puleo E, Viswanath K. Communication Inequalities and 
Public Health Implications of Adult Social Networking Site Use in the United 
States. J Health Commun. 2010;15(Supplement):216–35.  
21.  Office of Infectious Disease Epidemiology and Outbreak Response Infectious 
Disease Bureau Prevention and Health Promotion Administration Maryland 
Department of Health and Mental Hygiene. Maryland Weekly Influenza 
Surveillance Actvitity Report [Internet]. 2014 Oct. Available from: 
http://phpa.dhmh.maryland.gov/influenza/fluwatch/SiteAssets/SitePages/Home/
Weekly%20Influenza%20Report%202014-10-4.pdf 
22.  Morstatter F, Pfeffer J, Liu H, Carley KM. Is the Sample Good Enough? 
Comparing Data from Twitter’s Streaming AP with Twitter’s Firehose. 
ICWSM. 2013 Jul;  
23.  Maryland Department of Health and Mental Hygiene. fluwatch [Internet]. 
Department of Health and Mental Hygiene. 2015. Available from: 
http://phpa.dhmh.maryland.gov/influenza/fluwatch/SitePages/Home.aspx 
24.  Freelon D. Twitter geolocation and its limitations [Internet]. 2013 [cited 2014 
Nov 11]. Available from: http://dfreelon.org/2013/05/12/twitter-geolocation-
and-its-limitations/ 
  32 
 
25.  Robert Roos. CDC’s flu warning raises questions about vaccine match 
[Internet]. CIDRAP. [cited 2015 Apr 7]. Available from: 
http://www.cidrap.umn.edu/news-perspective/2014/12/cdcs-flu-warning-raises-
questions-about-vaccine-match 
26.  CDC Novel H1N1 Flu | The 2009 H1N1 Pandemic: Summary Highlights, April 
2009-April 2010 [Internet]. [cited 2015 Apr 7]. Available from: 
http://www.cdc.gov/h1n1flu/cdcresponse.htm 
27.  Maryland Department of Health and Mental Hygiene. Influenza in Maryland 
2012-2013 Season Report [Internet]. Available from: 
http://phpa.dhmh.maryland.gov/influenza/fluwatch/Shared%20Documents/FIN
AL%20FLU%20REPORT%202012_13_9SEP13_Final.pdf 
28.  Google Inc. Frequently asked questions [Internet]. Google.org flu trends. 2011. 
Available from: http://www.google.org/flutrends/about/faq.html