ABSTRACT 
 
 
 
 
Title of Dissertation: DESCRIBING URGENT EVENT DIFFUSION 
ON TWITTER USING NETWORK 
STATISTICS 
  
 Hechao Sun, Doctor of Philosophy, 2017 
  
Dissertation directed by: Assistant Professor, Bill Rand, Department of 
Business Management, NC State University 
 
 
In this dissertation, I develop a novel framework to study the diffusion of urgent 
events through the popular social media platform—Twitter. Based on my literature 
review, this is the first comprehensive study on urgent event diffusion through 
Twitter. I observe similar diffusion patterns among different data sets and adopt the 
"cross prediction" mode to handle the early time prediction problem. I show that the 
statistics from the network of Twitter retweets can not only provide profound insights 
about event diffusion, but also can be used to effectively predict user influence and 
topic popularity. The above findings are consistent across various experiment 
settings. I also demonstrate that linear models consistently outperform state-of-art 
nonlinear ones in both user and hashtag prediction tasks, possibly implying the strong 
log-linear relationship between selected prediction features and the responses, which 
potentially could be a general phenomenon in the case of urgent event diffusion.  
 
  
 
 
 
 
 
 
 
 
DESCRIBING URGENT EVENT DIFFUSION ON TWITTER USING NETWORK 
STATISTICS 
 
 
 
by 
 
 
Hechao Sun 
 
 
 
 
 
Dissertation submitted to the Faculty of the Graduate School of the  
University of Maryland, College Park, in partial fulfillment 
of the requirements for the degree of 
Doctor of Philosophy 
2017 
 
 
 
 
 
 
 
 
Advisory Committee: 
Professor William Rand, Chair  
Professor Michelle Girvan 
Professor Jeffrey Herrmann 
Professor Shawn Mankad 
Professor Paul Smith 
 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
© Copyright by 
Hechao Sun 
2017 
 
 
 
 
  
 ii 
 
Dedication 
This thesis is dedicated to everyone who has helped and supported me during my 
whole degree.   
  
 iii 
 
Acknowledgements 
Foremost, I would like to express my sincere gratitude to my advisor Dr. Bill Rand 
for the continuous support of my Ph.D. study and research, for his patience, 
motivation, enthusiasm, immense knowledge and expertise. His guidance helped me 
through the research and the writing of this thesis. I could not have imagined having a 
better advisor and mentor for my Ph.D study. 
 
Besides my advisor, I would like to thank the rest of my thesis committee: Dr. Shawn 
Mankad, Dr. Paul Smith, Dr. Jeffrey Herrmann, and Dr. Michelle Girvan, for their 
insightful comments and encouragement, but also for the hard questions that 
incentivized me to broaden my research from various perspectives. 
 
Also I would like to express my heartfelt thanks to my parents who have supported 
me spiritually throughout the writing process and throughout my life in general. 
 
Last, but not least, I want to thank all of the colleagues in my research group who 
gave me suggestions and enlightened my research, including but not limited to: Dr. 
David Darmon, Dr. Keith Burghardt, Mr. Arjuna Ariyaratne, and Mr. Chen Wang. 
  
 iv 
 
Table of Contents 
 
Dedication ..................................................................................................................... ii 
Acknowledgements ...................................................................................................... iii 
Table of Contents ........................................................................................................ iiii 
List of Tables ............................................................................................................. viv 
List of Figures ............................................................................................................ viii 
List of Abbreviations ................................................................................................... xi 
Chapter 1: Introduction ................................................................................................. 1 
1.1 Background ......................................................................................................... 1 
1.1.1 Motivation .................................................................................................... 1 
1.1.2 Why retweets on Twitter .............................................................................. 2 
1.2 Related work ....................................................................................................... 5 
1.2.1 Information diffusion study ......................................................................... 5 
1.2.2 Prediction during urgent events diffusion .................................................... 6 
1.3 Data sets ............................................................................................................ 10 
1.3.1 15k_user data ............................................................................................. 10 
1.3.2 Events data ................................................................................................. 11 
1.4 Network analysis ............................................................................................... 12 
1.4.1 Retweet network ........................................................................................ 12 
1.4.2 Related network statistics .......................................................................... 13 
1.4.3 Dynamic network statistics extraction ....................................................... 15 
1.5 Outline of work ................................................................................................. 17 
Chapter 2: Tracking urgent event diffusion on Twitter .............................................. 19 
2.1 Introduction ....................................................................................................... 19 
2.1.1 Motivation .................................................................................................. 19 
2.1.2 Data description  ........................................................................................ 19 
2.2 Event evolution tracking  .................................................................................. 22 
2.2.1 Global level tracking .................................................................................. 22 
2.2.2 Node level tracking .................................................................................... 25 
2.3 Event hashtag identification .............................................................................. 26 
2.3.1 Hashtag evolution comparison ................................................................... 26 
2.3.2 Hashtag clustering ...................................................................................... 27 
2.3.3 Event hashtag classification ....................................................................... 30 
2.4 Discussion ......................................................................................................... 31 
2.4.1 Summary .................................................................................................... 31 
2.4.2 Future work ................................................................................................ 32 
Chapter 3: Predicting user influence during urgent event diffusion ........................... 34 
3.1 Introduction ....................................................................................................... 34 
3.1.1 Motivation .................................................................................................. 34 
3.1.2 Data description ......................................................................................... 36 
3.2 Prediction task formulation ............................................................................... 41 
3.2.1 Task types .................................................................................................. 41 
3.2.2 Features ...................................................................................................... 42 
3.2.3 Training and testing sets ............................................................................ 42 
  
 v 
 
3.2.4 Task parameters ......................................................................................... 43 
3.2.5 Methods...................................................................................................... 48 
3.2.6 Variable importance ................................................................................... 48 
3.3 Prediction results ............................................................................................... 51 
3.3.1 Rank prediction .......................................................................................... 51 
3.3.2 Classification.............................................................................................. 60 
3.3.3 Direct prediction ........................................................................................ 64 
3.4 Discussion ......................................................................................................... 73 
3.4.1 Summary .................................................................................................... 71 
3.4.2 Future work ................................................................................................ 72 
Chapter 4: Predicting topic popularity during urgent event diffusion ........................ 73 
4.1 Introduction ....................................................................................................... 73 
4.1.1 Motivation .................................................................................................. 73 
4.1.2 LDA model implementation ...................................................................... 74 
4.2 Prediction on hashtag popularity ...................................................................... 77 
4.2.1 Response distribution ................................................................................. 77 
4.2.2 Experiment factors ..................................................................................... 81 
4.2.3 Rank prediction results .............................................................................. 83 
4.2.4 Direct prediction results ............................................................................. 90 
4.3 Prediction on latent topic popularity ................................................................. 94 
4.3.1 Features and response ................................................................................ 94 
4.3.2 Prediction results ...................................................................................... 101 
4.4 Discussion ....................................................................................................... 103 
4.4.1 Summary .................................................................................................. 103 
4.4.2 Future work .............................................................................................. 105 
Chapter 5:  Conclusion.............................................................................................. 106 
5.1 Discoveries and contributions ......................................................................... 106 
5.2 Limitations and expectations .......................................................................... 108 
Bibliography ............................................................................................................. 111 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 vi 
 
List of Tables 
 
Table Page 
 
 
     1.1   Twitter engagement and network formation ...........................................................2 
 
     1.2   One example of retweet from OBL set .................................................................12 
 
     2.1   15-k user data summary ........................................................................................19 
 
     2.2   OBL data summary ...............................................................................................21 
 
     2.3   OBL retweet network summary ............................................................................22 
 
     2.4   Event hashtag classification  
             average AUC scores with 100 repetitions .............................................................31 
 
     3.1   Event data sets comprehensive summary .............................................................37 
 
     3.2   User response distribution over varying T  
             among various data sets (h=5d) ............................................................................46 
 
     3.3   KS statistics of past scores  
             between various training sets and Nepal set .........................................................66 
 
     3.4   KS statistics of past scores  
             between various training sets and Blizzard set .....................................................69 
 
     3.5   Prediction further investigation  
             on selected nodes using GLM ...............................................................................70 
 
     3.6   Missing nodes count  
             under various feature sets using GLM ..................................................................71 
 
     4.1   Average topic coherent scores distribution varying 
             number of topics with 100 repetitions for the whole Nepal set ............................76 
 
     4.2   Hashtag response distribution over varying T  
             among various data sets (h=5d) ............................................................................79 
 
     4.3   KS statistics of past scores for various training sets with  
             Nepal set (top) and Blizzard set (bottom) .............................................................92 
 
     4.4   Prediction further investigation  
             on selected hashtags using GLM ..........................................................................93 
  
 vii 
 
 
     4.5   Missing hashtags count  
             under various feature sets using GLM ..................................................................93 
 
     4.6   Selected topic probabilities comparison  
             for Nepal set with T=36h ......................................................................................97 
 
     4.7   Top 10 tokens for each topic  
             from Sandy set (top) and Nepal set (bottom) ........................................................98 
 
     4.8   Selected persistent topics  
             over various periods for Sandy and Nepal sets ...................................................100 
 
  
 viii 
 
List of Figures 
Figure Page 
 
   1.1 Information diffusion by retweets on backbone of F-F network .............................3 
 
   1.2 Global level and node level network statistics  
        from a simple network ...........................................................................................13 
 
   1.3 Dynamic network features extraction illustration ..................................................16 
 
   1.4 Two types of dynamic statistics series—segmented series (top)  
        and cumulative series (down) ................................................................................17 
 
   2.1 15k_user data sets F-F distribution comparison ....................................................20 
 
   2.2 Evolution plot for OBL and background sets ........................................................21 
 
   2.3 Global level network statistics evolution plots for OBL network .........................23 
 
   2.4 Global level network statistics evolution plots comparison for three networks ....24 
 
   2.5 Node level network statistics evolution plots for OBL network............................25 
 
   2.6 Evolution plots for event and non-event hashtag networks ...................................26 
 
   2.7 Hashtag tweets text cleaning and processing .........................................................28 
 
   2.8 Hashtag K-means clustering 2D principle components representation 
        (K=5, the upper one uses the network statistics,  
        the lower one uses the text features) .....................................................................28 
 
   2.9 Hashtag K-means clustering sum of squares ratio  
        varying by number of clusters  
        (the left one uses network features, the right one uses text features) ....................30 
 
   3.1 Example of retweet cascades from Sandy set ........................................................35 
 
   3.2 F-F distribution plots for event data sets (the upper two plots  
            are from the Nepal set with TH=100 and 500 respectively;  
            the lower two are plots from  
            Irene set and Sandy set with TH=500 respectively) ..............................................38 
 
   3.3 Evolution plots for event data sets 
        (from top to bottom, the plots are for  
        Irene, Sandy, Nepal and Blizzard set respectively) ...............................................39 
  
 ix 
 
 
   3.4 A general prediction task flowchart .......................................................................43 
 
   3.5 User scores distribution with various data sets with default parameters ...............44 
 
   3.6 User scores percentile and ratio of increase distribution  
        for Nepal set with default parameters ....................................................................45 
 
   3.7 Top 10k user components analysis  
            over varying T values (h=5d) for  
            Irene set, Sandy set and Nepal set from top to bottom respectively ......................47 
 
   3.8 Rank predictions using GLM on Nepal set 
            varying methods and training sets ..........................................................................52 
 
   3.9 Rank prediction using GLM on Nepal set varying T, h and Δ ..............................54 
 
   3.10 Rank prediction using GLM on Nepal set varying the underlying networks ........55 
 
   3.11 Rank predictions on Nepal set method stability— 
            with methods GLM, RF and DL from top to bottom .............................................56 
 
   3.12 Rank predictions using GLM on Sandy set varying methods and testing sets ......57 
 
   3.13 Rank predictions using GLM on Blizzard set  
            varying methods and training sets ..........................................................................58 
 
   3.14 Rank prediction on Nepal set GLM variable importance ......................................59 
 
   3.15 Top quantile users binary classification comparison  
            for Nepal set—using classification and rank prediction ........................................60 
 
   3.16 Top quantile users binary classification using GLM 
            from rank prediction varying parameters ...............................................................61 
 
   3.17 Top quantile users binary classification  
            from rank prediction PR curves—1%, 5% and 10% from top to bottom ..............63 
 
   3.18 Rapidly increasing users binary classification  
            from rank prediction PR curves—1%, 5% and 10% from top to bottom ..............63 
 
   3.19 Direct predictions on Nepal set varying methods and T ........................................64 
 
   3.20 Direct prediction using GLM on Nepal set varying training sets ..........................65 
 
   3.21 Direct predictions on Sandy 4.5 set and Blizzard set varying methods .................67 
  
 x 
 
 
   3.22 Direct prediction using GLM on Blizzard set varying training sets ......................68 
 
   3.23 Direct prediction on Nepal set GLM variable importance .....................................69 
 
   4.1 LDA model illustrations ........................................................................................75 
 
   4.2 Hashtag response distribution with various data sets with default parameters ......78 
 
   4.3 Hashtag scores percentile and ratio of increase distribution  
            for Nepal set with default parameters ....................................................................79 
 
   4.4 Top 1k hashtag components analysis  
            over varying T values (h=5d) for Sandy set, Nepal set and  
            Blizzard set from top to bottom respectively .........................................................80 
 
   4.5 Nepal hashtag rank prediction results varying by methods, training sets and T ....83 
 
   4.6 Nepal hashtag rank prediction results by varying additional factors .....................85 
 
   4.7 Blizzard hashtag rank prediction results by varying methods and training sets ....86 
 
   4.8 Nepal hashtag rank prediction GLM method stability ...........................................87 
 
   4.9 Nepal hashtag classification results .......................................................................88 
 
   4.10 Nepal hashtag rank prediction GLM method variable importance ........................89 
    
   4.11 Nepal hashtag direct prediction results varying by methods, training sets and T ..90 
 
   4.12 Blizzard hashtag direct prediction results by varying methods and training sets ..91 
 
   4.13 Nepal hashtag direct predictions GLM method variable importance ....................92 
 
   4.14 LDA topic response comparisons across various data sets ....................................95 
 
   4.15 LDA expected versus true topic response comparison  
            across various data sets (the data sets are Sandy, Nepal and  
            Blizzard from top to bottom respectively) .............................................................96 
 
   4.16 LDA topic popularity predictions for Nepal set varying T and methods ............101 
 
   4.17 LDA topic popularity predictions for Blizzard set  
            varying T, methods and training sets ...................................................................102  
  
 xi 
 
List of Abbreviations 
 
                      
                                     
                                       
                                        
          
Abbreviations Full name 
B Baseline features 
BSD Baseline+static network+dynamic network features 
CDF Cumulative distribution function 
COO Co-occurrence 
CV Cross validation 
DL Deep learning 
F-F network Following follower network 
GLM Generalized linear model 
LDA Latent Dirichlet Allocation 
LSCC Largest strongly connected component 
LWCC Largest weakly connected component 
RF  Random forest 
RMSE  Root mean square error 
RT Retweet  
SD Static network+dynamic network features 
  
  
 1 
 
Chapter 1: Introduction 
 
1.1 Background 
1.1.1 Motivation 
In recent years, the emergence of popular social media platforms, such as Facebook 
and Twitter, has largely changed the way people communicate and interact with each 
other. These online social networks, not only provide global platforms for users 
around the world to share information and express opinions, but also generate 
massive amounts of unstructured data resulting from the human-to-human 
interactions. These huge amounts of data are still far from been fully utilized, which, 
at present, creates great challenges for people or organizations to use the data to make 
better and intelligent decisions (Evans, 2010). One crucial area still to be explored is 
the study of information diffusion (spreading) on these platforms, which has great 
business potential and marketing implications (Evans, 2010) (Chen, 2010) (Stieglitz, 
2013)  
 
Information diffusion can be considered a subtopic of the study of general diffusion, 
and it includes a broad range of research directions. It can be classified into different 
sub-categories depending on the research context—exploratory (Zhou, 2011) vs. 
predictive analysis (Kupavskii, 2012), parametric (with specific concrete diffusion 
models) (Galuba, 2010) vs. non-parametric (empirical) studies (Lerman, 2010),  
diffusion mechanism study (Romero, 2011) vs. user influence study (Bakshy, 2011) 
or topic popularity study (Ma, 2013) etc.  As to the different platforms themselves—
whether they are large social media platforms or only restricted small social 
networks, it is not surprising to find that many research results are generalizable and 
can be directly applied to one another (Borge-Holthoefer, 2013). This is because 
although these platforms do possess diverse properties, they share much more in 
common—which can be summarized as general principles of information diffusion 
(more details in the next section).   
 
For my work, I concentrate on one small aspect of this large topic—information 
diffusion study during urgent events on the well-known social media platform—
Twitter. Here, urgent events are defined as any large event that results in rapid and 
large-scale diffusion. More specifically it may include the following cases:  
 
1. A large emergency or crisis—such as natural disasters, terrorist attack, or any other 
types of events that will cause large-scale public concern.  
 
  
 2 
 
2. General world-wide breaking news that attracts great public interest.  
 
3. Other types of explosive information that can spread over with comparable scale 
and speed, such as diffusion of news about notable brands.  
 
 
 Although in my study I mainly use the data from the first category, I believe the 
results should hold consistently for other types of urgent events. This is because as 
stated above, all these urgent events share common diffusion patterns—vast 
spreading speed and large spreading scales regardless of the types of urgent event. I 
hypothesize that all the tools I have developed can be easily migrated.  
 
However those special properties of urgent events diffusion will add significant 
difficulty on the predictive analysis of the data, since an early period prediction will 
be necessary for stake holders to respond in time and to make proper decisions based 
on the prediction results. This is the main emphasis of my work, which includes 
prediction on user influence and topic popularity during the urgent diffusion.  
1.1.2 Why retweets on Twitter  
Being a microblogging service with over 300 million monthly active users, Twitter is 
commonly used to propagate information using short text messages. Though the 
conversations occurred online, they can provide deep insights in how generally 
people behave and offer enough information to understand human behavior across a 
variety of fields (Bollen, 2011) (Borondo, 2012). Especially for urgent events or 
crises, Twitter has become a powerful medium that propagates news globally at a 
high rate. Therefore, I use information diffusion on Twitter as my study case to give 
practical implications and to develop reasonable decisions or strategies. 
 
Engagement Functionality Network 
Follow others Build friendship Following-follower 
network 
Create new tweets Start new conversation No network directly 
formed 
Mention others in tweets Involve in conversation Mention network 
Retweet others’ tweets Spread information Retweet network 
 
Table 1.1 Twitter engagement and network formation 
 
Table 1.1 gives different types of engagement that a user can participate in on 
Twitter. Posting tweets is the content generating step, but how much the content will 
be exposed will depend on the following-follower network (F-F network in short for 
simplicity) and the privacy settings (a user can control whether his or her tweets can 
be seen by the general public). This is why study of the F-F network has been the 
mainstream research area (Kwak, 2010) (Galuba, 2010): it more or less controls the 
diffusion potentials of new content, in most situations mentioning and retweeting can 
  
 3 
 
only occur on the backbone of the F-F network (Figure 1.1). But this can also be the 
reason why study of F-F network alone is not enough—it just provides the backbone 
for the diffusion not the actual diffusion pathways. Moreover, research shows that 
using number of followers as the user influence measure could be misleading (Cha, 
2010) , where large divergence has been observed between user influences measured 
number of followers and number of retweets, and the latter measure is usually more 
meaningful in practical applications.   
 
 
 
Figure 1.1 Information diffusion by retweets on backbone of F-F network (Morales, 2014) 
 
 
As to retweeting, it is a typical Twitter behavior where users can share tweets that 
they believe are interesting or important to their followers or general public (Table 
1.1). Under the context when real life events occur, this behavior can serve to spread 
important information through the whole Twitter network (Kwak, 2010).  
 
However, retweeting is not the only way to diffuse important information on 
Twitter—other engagement behavior such as tweets with links or hashtags and 
mentions are also popular alternatives for retweeting (Galuba, 2010) (Bakshy, 2011) 
(Rattanaritnont, 2012).  But compared to mentions, URLs and hashtag tweets, 
retweets have the following distinct properties:  
 
1. Retweets can be considered as the minimum pathways for information diffusion. 
Perhaps not all informative tweets are retweeted by the users, but people choose to 
  
 4 
 
retweet usually when they believe the focal tweet is more important and informative. 
This argument seems to be how some tweets gain a high number of retweets.  Thus 
retweets should highlight tweets which are crucial in information spreading. Using 
retweets is a simple and straightforward way to capture the underlying diffusion 
pathways in the lower bound sense.  
    
2. Retweets can spread information regardless of the origin—they are not restricted 
by the forms of the tweets. Tweets that include URLs, hashtags or simply plain text 
are all able to deliver useful information as long as they are generating retweets. 
Though from my data I observe people do use more URLs and hashtags during urgent 
diffusion, there are still many examples of informative tweets—which got a lot of 
retweets and do not include URLs or hashtags: for example, tweets like 
"HURRICANE WATCH for CT shoreline!!!! Just issued!" occur in the hurricane 
data sets. On the other hand, hashtags or URLs can often be the breeding ground for 
all types of spam messages (Grier, 2010) (Thomas, 2011)  
 
3. Another advantage of using retweets compared to tweets with URLs or hashtags is 
that retweets can provide solidly tractable pathways for information diffusion models 
(Zhou, 2011) (Hong, 2011) (Kupavskii, 2012). While for URLs and hashtags tweets, 
an actual diffusion model that defines the information pathways is needed to do the 
analysis—the diffusion  models defined on chronological adoption of URLs for 
example (Bakshy, 2011). Also as mentioned in the same literature, there are situations 
where chronological adoption of URLs or hashtags does not imply influence at all--
when users could just cite them independently, while in contrast retweets with URLs 
or hashtags can be considered as stronger indicators of influence than these merely 
chronological influence assignments. Moreover, the users' detailed following-
follower structure is often necessary for these influence models definition, which is 
generally not available especially for large number of users.    
 
4. Compared to the other alternative diffusion pathways—mentions, retweets are 
considered as a subset of them in the Twitter context. Thus mentions still possess 
those advantages over URLs and hashtags, but the problem with mentions is that they 
also include casual mutual conversations among users, which cannot be accounted as 
really informative tweets.  
 
Because of these reasons, in my study I focus on retweets for the information 
diffusion study. I understand that they may not reflect the actual diffusion pathways, 
i.e., since users can communicate outside the Twitter platform. Actually I am seldom 
fully exposed to the ground truth information diffusion pathways unless I am keep 
tracking of all users and all of their communication, which is impossible for big 
urgent event diffusion on Twitter—a world-wide information exchange platform. 
Therefore I believe using retweets is the currently best available choice to keep track 
of and study the information diffusion on Twitter.  
  
 5 
 
1.2 Related work 
1.2.1 Information diffusion study  
Studies concerning information diffusion in general can be divided into descriptive 
analysis and predictive analysis, and information diffusion modeling usually takes 
both aspects into consideration. My work also contains both descriptive and 
predictive analysis during urgent events diffusion.  
 
In theory, information generally propagates in the form of information cascades 
through various networks. According to Bikhchandani's defination, the propagation 
results from actions of individuals to follow behavior of preceding individuals 
without regarding to their own information (Bikhchandani, 1992). Different types of 
information cascades occur under different networks (Zhou, 2011) (Borge-
Holthoefer, 2013) (Cheng, 2014). As to Twitter particularly, the information cascades 
are usually the retweet cascades or retweet trees (Kwak, 2010) (Zhou, 2011), though 
there are also more complex definitions (Wu, 2011) (Hui, 2012) (Taxidou, 2014).  
 
The descriptive studies of various networks--blog graph (Gruhl, 2004), Facebook 
photo sharing graph (Cheng, 2014), Twitter retweet graph (Borondo, 2012) and even 
more general social networks (Borge-Holthoefer, 2013) have shown that information 
cascades share common properties: one property is the broad but very skewed size 
distribution, which usually resembles the power law distribution; the other one is the 
relative shallow depth and tree-like topology. With these resemblances under various 
contexts, the methods can usually be used interchangeably regardless of the 
underlying networks and diffusion models. Previous research explored information 
diffusion using various models: an epidemic modeling approach is the classic way to 
study the general diffusion process and includes diffusions other than information 
diffusion (Gruhl, 2004) (Khelil, 2002); there are also other nonparametric influence 
models such as the threshold model (Karimi, 2013) and the linear influence model 
(Yang J. &., 2010). However, many of these diffusion models require knowledge of 
the underlying affinity network—in the Twitter case it would be the Following-
Follower (F-F) network, which provides the backbone for potential diffusion to occur 
and is not available in the data sets used for my prediction study (more details in data 
description). Yang (Yang J. &., 2010) built an influence model without a prior 
knowledge of the F-F network and was able to make good predictions about 
influence. But his approach does not fit well into my prediction task because i) He 
studied diffusion of various hashtags instead of diffusion of a specific event;  ii) He 
used data sets that span a long period of time but I only have limited data information 
for the prediction since it is based on an urgent event diffusion.   
 
Despite of the model differences, all the studies more or less use the temporal 
properties of diffusion—e.g. the dynamics of the associated information cascades. 
Given the context of my problem, a very relevant work on information diffusion 
study is from Rogers (Rogers, 2014). She did the diffusion study through exploring 
properties of the retweet chains—e.g. dynamics and topology. Moreover she used the 
  
 6 
 
same 15k_user data sets (more details later) as I do for the diffusion tracking study. 
But she did not perform any prediction analysis and she did not use the underlying 
retweet network properties to characterize the diffusion process, which should have 
both high explanatory power and high predictive power. I attempt to utilize all 
available network properties (more details in later section) to reveal elaborate 
dynamics of the diffusion that are not easily exposed using previous approaches. Thus 
the diffusion study in this dissertation can be considered as study of temporal 
networks (Holme, 2012), and I adopt the approach "representing temporal data as a 
static graph" mentioned in the paper.  
 
1.2.2 Prediction during urgent events diffusion  
As to urgent events, crises or emergencies, Palen and colleagues developed 
comprehensive studies (Hughes, 2009) (Vieweg, 2010). But the majority of his work 
focuses on descriptive analysis—information extraction, social impact and user 
behavior etc. Due to the high spread rate and large spread scale of information during 
urgent events, prediction is a challenging task, but at the same time it also 
demonstrates great application potentials—e.g. providing recommendations to stake 
holders who are interested in the urgent diffusion process. Though little previous 
work exists on urgent event diffusion prediction, I can still draw useful lessons from 
previous related work, and then make proper accommodations. As noted in the first 
section, the prediction tasks include user influence prediction and topic popularity 
prediction. However, based on both previous literatures and my discoveries they are 
very closely related from each other (especially when the topics are referred to as 
hashtags in Twitter), many of the models and frameworks can actually be used 
interchangeably.   
 
User influence characterization and prediction has always been one of the central 
topics in the field of general diffusion study—not just restricted to information 
diffusion. Strict definitions of diffusion context and influence are the first things to 
consider in developing any study on user influence. Since there is always a certain 
underlying network that can have the diffusion procedure run through, and nodes 
within the network will represent the actual users, thus various network centrality 
scores from that network can be directly used as the user influence measure.    
 
Since the definition and quantification of user influence is contextually subjective, I 
have to settle a ground truth measure for my prediction study. I use the size of 
information cascades—the retweet cascades to be the golden-rule measure, which is 
considered as a standard empirical influence metric for information seed users 
(Kleinberg, 1999) (Kupavskii, 2012). My fundamental question is which centrality 
metrics have good prediction power for this influence measure. Various classical 
network centrality measures (degree, PageRank, HITS scores etc.) (Kwak, 2010) 
(Cha, 2010) (Galuba, 2010) (Rattanaritnont, 2012) , or additional derived measures 
(Benzi, 2013) (Laflin, 2013) (Mantzaris, 2013) could be used as the potential 
predictors. However, there is no comprehensive study to predict node influence using 
those different network centralities features derived from the underlying diffusion 
  
 7 
 
network—the retweet network in my case (more details in next section). I believe 
these various centrality scores derived from the diffusion network should be closely 
correlated with users' retweet cascades size and thus serve as potential effective 
predictors.  
 
Previous research has shown prediction concerning information cascades. For this 
work, I just pick several examples for illustration and comparison purposes.  Bakshy 
(Bakshy, 2011) created an overall prediction on average size of cascades formed by 
root users using a decision tree, but the prediction performance is relatively poor due 
to the highly skew distribution of cascade size, high variability among users behavior, 
according to the author. Noticing this fact, I do not plan to make general prediction 
for all nodes, but only focus on "important" ones. This is a practical consideration 
since the majority of nodes involved in the network will be inactive in the future and I 
only need to keep track of those really influential ones. My prediction framework 
starts with this framework as a foundation, and then implements new network 
features in addition to those have been adopted in the paper. Kupavskii (Kupavskii, 
2012) did a thorough search of possible features and obtained decent results on 
retweet cascades prediction, which provides a broad list of potential features to be 
used. But what he was trying to predict are cascades size formed by specific tweets, 
not by users, thus what he is actually doing is predicting tweet influence instead of 
user influence. Thus, some features he employed—especially the tweet content 
features, are not usable in my case.  
 
Cheng (Cheng, 2014) avoided direct prediction about the cascade size and 
innovatively switch the problem to examine whether each cascade will grow over the 
medium size or double its current size. His work provides insights by discovering 
significant features in controlling growth of cascades, which is also enlightening in 
identifying potential influential users who will double their scores in the future. But 
my main goal is not on the predictability of cascade growth and how the effects of 
various features change with evolution of cascades. It is less relevant for me to know 
the detailed growth procedure or whether a cascade will double its size in the future, 
since the majority of cascades turn out to be small, I only care about nodes that are 
going to be "important" in the near future—top tiers nodes in the influence scores 
rank. Also I am studying Twitter retweet network instead of the Facebook photo re-
sharing network, thus some features are not applicable to my study. Hofman and 
colleagues (Martin, 2016) (Hofman, 2017) had a thorough discussion on the 
predictability of the problem. He mentioned that there was unavoidable inherent 
variability within the prediction problem even perfect information was given. He 
suggested unifying the standard for the prediction and evaluation. He also stressed 
that prediction and interpretation were both important and should be viewed as 
complements to each other. While his work does have profound implications, it was 
of less concern to my goal, which was to find the best available solution for the 
prediction problem given the limited information.  
 
Simmie (Simmie, 2014) applied a hidden Markov model to rank user influence, 
where he combined information from users' social context--following-follower 
  
 8 
 
network information and direct tweets observation together. However, his model is 
not suitable for my problem because it requires full social link structure of all users—
which is not available in my data sets. And it is challenging to collect this information 
in urgent diffusion case even if I acquire the ability to do so—the diffusion spreads 
rapidly and the corresponding users graph grows exponentially. One way out would 
be to narrow the scope only to a much smaller group of users, but the problem is how 
to select those users and collect their full social links with only limited information 
given due to the urgent diffusion reality, let alone the fact that the user influence can 
vary much over time—meaning users selected before may not represent the true 
influential users in the population. Moreover, estimation or selection of parameters 
involved in the hidden Markov model requires a large amount of data to obtain 
meaningful results, which, again, is not feasible in my study. Finally, as the author 
noted, the scalability would be an issue. Work done by Galuba (Galuba, 2010) 
illustrates the classical parametric modeling approach using popular epidemic models 
characterize cascades and making prediction about next hop probability. But again it 
required information from all users' follower structure. Significant work has been 
done on predicting the tweet popularity—number of retweets a tweet can get (Hong, 
2011) (Jenders, 2013) (Petrovic, 2011). Although these studies have profound 
pioneering implications, they are not about predicting influential "nodes". And they 
put much value on content features, which are features derived from single tweets and 
are difficult to aggregate onto the user level, thus makes them not feasible for user the 
prediction.  
 
Topic prediction is another area that is well researched. The topic is closely related to 
the content of the diffusion and refers to be a relative stable and consistent collection 
of relevant content (Lambrecht, 1996) (e.g. tweets in Twitter), thus topic study is 
usually involved in the field where users are spreading collections of content or 
information that can be characterized and properly clustered.  Therefore under the 
context of information diffusion, Twitter is the ideal platform to perform topic 
analysis with the actual text in the tweets serving as the content to be grouped.  
 
In the Twitter context, one straightforward topic definition is simply using the 
hashtag (see https://support.twitter.com/articles/49309 for hashtag usage 
recommendation from Twitter): which should include the keywords or summary of 
the corresponding tweets, thus can be used to categorize tweets and describe a topic 
or theme. There have been quite a few successful studies about predicting popularity 
of hashtags. One very well cited work is from Tsur (Tsur, 2012), where the author 
predicted weekly hashtags volume using LASSO linear regression. The author 
adopted a mixture of wide range of features including hashtag content, hashtag tweets 
content, user F-F network features and some basic temporal features.  While many of 
these features are readily available in my data, some others are hard to obtain 
considering the early time prediction demands for my task. Moreover, the prediction 
uses training set lasting a long period of time (several months) on a coarse time 
granularity—weekly volume, which clearly is not suitable in my study.  
 
  
 9 
 
Another well-known research is done by Ma (Ma, 2013). He performed a thorough 
feature exploration for newly emerging hashtags popularity prediction. He modified 
the problem to a multi-class classification task to avoid direct prediction and make a 
comprehensive comparison among a wide range of feature sets for their predicting 
effectiveness. He did doing the prediction on daily basis, which is a finer time 
granularity and similar to my task. He also creatively adopted some context features 
derived from users mention network to improve the prediction, while in my work, I 
use similar ones from the retweet network. Thus his work is much more relevant to 
my purpose, but still does not precisely fit my situation. Firstly, the time frames and 
scales of the data are different.  The data the author used are restricted to Singapore 
based users and have a time span of several months; while the data I have access to 
are generally world-wide tweets about specific urgent events and usually only proper 
about 1-2 weeks. With the long-term and relative homogenous (same set of users) 
data in hand, instead of a strict time based prediction—which is necessary for my 
implementation, the author simply performs a 10 fold cross-validation (CV) over 
hashtags occurring at different time stamps. This research explores the predictive 
power of wide range types of features on newly emerging hashtags prediction; but it 
is not in good alignment with what I aim to achieve: a time-based early-stage 
prediction with only limited information available within a short period of time. Also 
the author defined the number of users adopting a given hashtag as the popularity 
measure, but I decide to use the number of retweets a hashtag obtained as the target 
response to characterize the hashtag ability of spreading information under the urgent 
information diffusion context. This was also due to the differences between the goals 
of the prediction task. Moreover, granted the distinctions on data and prediction task, 
many features adopted by the author became either unnecessary or unavailable—e.g. 
the border user definition and the corresponding exposed vectors were less relevant 
given retweets as the target response. There were also other features not available for 
other reasons—using information from the full data set (the clarify scores) or 
requiring extra effort (hashtag lexical features).  
 
As to more general and abstract topics, while considerable literature focuses on 
various Twitter topic modeling or detection techniques (Cataldi, 2010) (Hong L. a., 
2010) (Mathioudakis, 2010) (Ramage, 2010) (Godin, 2013), fewer studies are seen on 
predicting the continuous popularity of these topics. Potential reasons might be the 
prediction task is not as intriguing as the topic modeling or other popular prediction 
problems—prediction of tweets popularity for example. For the actual topic definition, 
one state-of-art topic modeling to apply is Latent Dirichlet Allocation (LDA) (Blei, 
2003). Many topic modeling studies simply adopt it or other variants of LDA models, 
including the ones mentioned above. However, LDA may not be the most suitable 
topic modeling method in my case. Yang (Yang, 2014) has a very good discussion 
about large-scale and high precision topic modeling in Twitter. Finding the best topic 
models is not the focus of my study. I use the LDA model simply as an extension 
study of the hashtag prediction in order to see how the prediction will vary between 
straightforward hashtag defined topics and more abstract model defined topics. Thus,  
in this study, I just use LDA to model the latent or general topics and then attempt to 
  
 10 
 
predict the popularity of the generated topics as weighted aggregation of documents 
(to be defined) popularity.  
 
In short, although my proposed prediction tasks, the focus of my discussion in this 
study, share things in common with previous studies, these tasks still exhibit unique 
properties and thus require novel solutions. 
1.3 Data sets 
There are two types of data sets involved in my study: 15k_user data sets and event 
data sets.  
1.3.1 15k_user data set 
All of the data are collected through Twitter API (for more details see 
https://dev.twitter.com/overview/api), where users can specify various conditions 
(user status, key words, timestamps, locations etc.) to obtain the corresponding tweets 
information. The analyses are performed either with offline or online—streaming 
mode with different API options. Once the conditions for the queries are settled, the 
API server will responds with the query request—for offline mode it will return the 
batch data, for online mode it continuously will update with streaming data.  
 
The data obtained contain all basic information for each tweet such as: status_id 
(unique identifier for each tweet), timestamp, text content, user_id (unique identifier 
for each user), user_name, basic social profiles of the user, links and hashtags 
information of the tweet etc. For my study, I have collected two types of data sets—
the 15k_user data and events data sets.  
 
The 15k_user data sets collect all tweets, followers and friends from a selected group 
of 15k users (Swaroop, 2014) with a period lasting about 2-3 months. The main 
motivation for the collection of 15k_user data sets is to find representative subsets of 
the whole Twitter universe—including all users, tweets and the associated network 
structure. Although collection of Twitter universe over a fixed period of time is not 
infeasible, but it does require additional cost from the Twitter API and often is not 
necessary for many applications. According to Swaroop (Swaroop, 2014), the 
following protocol is used to select the 15k users:  
1. Find the first active user and add it to the active_user_list 
2. Obtain followers of the active users, add them into the user_list 
3. Iterate user from the user_list and determine if it is an active user, if yes repeat step 
1 and 2.  
4. Repeat the above user picking steps until the size of the active_user_list reaches 
15k.  
 
In the above steps, an active user is defined to satisfy: i) Having at least one retweet 
within his or her last 100 tweets; and ii) Having the tweet frequency not less than 1 
tweet per day within his or her last 100 tweets. After the 15k active users have been 
  
 11 
 
selected, all of their activity (tweets), followers and friends are also obtained within a 
period of 2-3 months.  
 
Two sets of 15k_user data are obtained for my analysis—15k_2011 and 15k_2012, 
which are collected from different months in 2011 and 2012 respectively, but the 
users are shared between the two sets. Since the collection only specifies users and 
does not put restriction on the tweets content, the data include everything the users 
talked about—news, real time events, normal chats, or any other conversations. This 
is in contrast to the event data set, which only includes tweets about a specific event.  
 
The 15k_user data is mainly used for the study in Chapter 2, where I describe the 
generation of a subset of 15k_2011 set—the OBL data set. This data set is about the 
real big event—death of Osama Bin Laden and is used as the sample set for the 
diffusion tracking study.  The OBL set is extracted from the 15k_2011 set using 
regular expression with some key terms derived from his name—Osama, bin laden 
etc. More details of the data are described in later sections.  
1.3.2 Event data set  
The event data sets have similar structure and data fields as the 15k_user data; but 
now the tweets are confined by selected key terms instead of selected users. These 
event data sets are used for the prediction study described in Chapter 3 and Chapter 4. 
With certain conditions specified—keys words, location or other restrictions by the 
end users, the Twitter streaming API is expected to return a collection of tweets 
fitting into those conditions. But according to the Twitter API document, the volume 
of the returning collection cannot exceed the volume of a small fraction of total public 
tweets stream at the same time. This is saying that if the volume of the target tweets 
collection is larger than the volume of a certain fraction of the total public tweets 
stream at the same time, only a sample of the target collection will be returned; 
otherwise, the full collection should be returned. Even under the condition that only a 
sample of target collection is returned, the sample itself usually already constitutes a 
huge amount of data considering the full volume of the public tweets stream, thus the 
effect of sample bias introduced in this way is limited. But Twitter also provides 
service to remove this volume cap with additional financial cost if that is desired.  
 
Once an urgent event is identified, the streaming collection can be started using key 
terms specifying that event. Following this, 4 event data sets are collected, each of 
which is concerning a crisis event: Hurricane Irene in 2011, Hurricane Sandy in 2012, 
Nepal earthquake in 2015 and Jonas blizzard in 2016.  Simple key words are used to 
track the tweets—for example the following terms are used to grab the Irene data set: 
#irene, irene, #hurricane, hurricane, #hurricaneirene, where the prefix "#" denotes 
hashtags in Twitter.  This simple approach will inevitably catch irrelevant tweets and 
does not include all tweets talking about the events, but I believe the level of noise is 
low and the volume of the collection is large enough so that these bias factors do not 
affect the final results. I will come back to discuss more details on the data sets in 
later sections.   
 
  
 12 
 
1.4 Network analysis 
1.4.1 Retweet network  
As mentioned in the previous section, retweets can provide valuable insights for 
information diffusion study on Twitter. Thus the network formed by retweets can be 
considered as the approximate to the underlying diffusion pathways. Moreover, the 
time stamps information associated with the retweets can be used to investigate the 
temporal properties of the diffusion. In contrast to the other often studied network—
the F-F network, the retweet network not only better represent the diffusion pathways, 
but can also provide rich dynamic information about them. The definition of the 
retweet network is straightforward: if user A retweet user B, then I consider there is a 
directed edge formed between user A and user B (with the timestamp of the retweet 
as the edge attribute), thus a directed network is formed.   
 
Table 1.2 shows one example of retweet with several common data fields. Generally 
a retweet can be captured by either the retweet indicator or using the text regular 
expression "RT @username". However as seen in this example the retweet indicator 
does not work and cases like this are not in a small number, thus I decide to use the 
regular expression to capture all the retweets. In the above example I will form a 
directed edge from user GLB62 to user keithurbahn.   
 
From now on, network will refer to the retweet network by default unless otherwise 
specified.   
 
 
Date Status_ID User_I
D 
User_na
me 
Timesta
mp 
Text Retwe
et 
indicat
or 
5/1/20
11 
22:27 
6487865544423
4200 
149444
71 
GLB62 1304303
251 
WOW 
RT@keithurb
ahn: So I'm 
told by a 
reputable 
person they 
have killed 
Osama Bin 
Laden. Hot 
damn. 
0 
 
                         Table 1.2 One example of retweet from OBL set 
 
  
 13 
 
1.4.2 Related network statistics 
Given the retweet network, I have chosen several important network statistics for the 
study. Since the network is directed, I choose network statistics that are properly 
defined for directed network.  
 
 
 
Figure 1.2 Global level and node level network statistics from a simple network 
 
 
Figure 1.2 gives a picture illustration of the network statistics involved in my study. A 
simple directed network with 5 nodes and 6 edges (each arrow in the graph is a 
directed edge) is used to explain various network statistics. In summary the network 
statistics studied can be classified into two categories: global level statistics and node 
  
 14 
 
level statistics. Global level statistics refer to statistics defined on the global or the 
whole network level and node level statistics refer to statistics defined on the node or 
user level.  
 
More specifically, global level statistics include the followings:   
1. Average degree: since an edge is directed, each link can be considered as an 
inbound link or an outbound link depending on which node is considered as the 
reference; but the average degree over all the nodes will be the same no matter 
inbound links or outbound links.  
 
2. Density: it is a measure for how complete a network is connected—the number 
fraction of connected edges over the maximum possible edges.   
 
3. SCC (WCC), LSCC (LWCC) and reciprocity: SCC and WCC are strongly 
connected components and weakly connected components respectively. For directed 
network a strong link exists between two users if two users can be connected in both 
directions, while it is a weak link if the connection only exists in one direction. All 
nodes that are strongly (weakly) connected from each other will form a strongly 
(weakly) connected component. LSCC (LWCC) is largest strongly (weakly) 
connected component, which is the fraction of size of largest strongly (weakly) 
connected component over the total size (number of nodes). Reciprocity is simply the 
counting fraction of edges that are reversible (bi-directional).   
 
4. Global transitivity (clustering coefficient) and diameter: global transitivity counts 
the fraction of connected triangles over all connected triples, here I only consider 
undirected connections for simplicity and isolated nodes are not counted. The 
diameter is the longest path length (length of shortest path) between two nodes 
existing in the graph, also only count undirected connections for simplicity.  
 
For node level statistics:  
1. Indegree and ourdegree: simply counts of inbound links and outbound links for 
each node.  
 
2. Authority and hub scores, eigenvector and pagerank: these are all centrality 
measures calculated based on adjacency matrix of the graph, which can be considered 
as "deeper and extended" measures instead of one-step measures of indegree and 
outdegree.  
i) Authority and hub scores are computed recursively based on each other, which 
characterize the value of a node by inbound links and outbound links respectively 
(Kleinberg, 1999).   
ii) Eigenvector centrality is derived from the principle eigenvector of adjacency 
matrix, and again the direction of edges is ignored in the calculation due to sparsity 
problems of the graphs.  
iii) Pagerank is a popular centrality measure to quantify importance of web pages, 
what makes it different from the rest three is that it has an external parameters to be 
  
 15 
 
determined—the damping factor. This factor is used to characterize the decaying 
influence over long-ranged links, which is usually set to be 0.85 for convention.       
  
3. Local transitivity (clustering coefficient): similar to the global transitivity, it counts 
the fraction of triangles over connected triples from a given node. Isolated nodes are 
set to be zeroes and direction is also not considered.  
 
4. Betweenness and closeness centrality: these two measures are related to shortest 
paths among nodes. Closeness centrality is proportional to inverse of the average path 
length (length of shortest path) between the given node and the rest ones. 
Betweenness centrality simply counts the number of shortest paths through a given 
node.   
 
All the above statistics are computed using the R package—igraph. Firstly the graph 
is constructed from an R data frame formed by the edge-list. Then different network 
statistics are computed using different built-in functions included in the igraph 
packages, further implementation details are illustrated in the manual of the package.      
 
1.4.3 Dynamic network statistics extraction 
In order to study the dynamic properties of the diffusion process, instead of using just 
the aggregate (static) version of all these statistics, I also need to take statistics from 
snapshots of temporal networks based on the timestamps. The idea is straightforward, 
and Figure 1.3 demonstrates the statistics extraction steps from a simple network. 
 
In Figure 1.3, I assume to have the network below within time period [0, 𝑇], and the 
edges are formed within each sub-period: e.g. user d retweet user c and user e retweet 
user c within sub-period [
1
3
𝑇,
2
3
𝑇]. Then within each sub-period, a sub-network is 
formed and the corresponding statistics are extracted, I will get one set of statistics 
within each given time interval and in the sections of this dissertation, I just refer 
these statistics as dynamic network statistics, while the statistics derived from the 
whole period [0, 𝑇] (the full network) will be called static statistics—implying they 
do not contain the temporal information of the network.   
 
 
 
  
 16 
 
 
Figure 1.3 Dynamic network features extraction illustration 
 
 
The network truncation time 𝑇 defines the full network I would like to study. With 
the additional parameter—the time window size (denoted as Δ), which is 
1
3
𝑇 in my 
example, I can control how transient or persistent temporal information I want to 
know from those dynamic statistics.  
 
The example in Figure 1.3 just shows one way of defining the dynamic statistics, 
which I call it segmented dynamic statistics. The other way to extracting the dynamic 
statistics is to define the cumulative networks over time (Figure 1.4) and I can obtain 
the cumulative dynamic statistics.  
 
  
 17 
 
                        
 
Figure 1.4 Two types of dynamic statistics series—segmented series (top) and cumulative 
series (down) 
 
The potential problem with the cumulative series is that the statistics series are 
correlated (future values depend on past values) and it is not good for predictive 
analysis—the correlated features will cause problems especially for linear models. 
Thus I will stick to the segmented series from now on unless otherwise specified. But 
there are still situations when I need to use the cumulative series: one is when the 
segmented sub-graphs are very sparse and cause the segmented statistics less 
informative or even not valid; the other may just be I do want to incorporate the 
cumulative effects into my observation.  
1.5 Outline of work 
In this dissertation I want to demonstrate how static and dynamic network properties 
of the retweet network reveal valuable insights of urgent event diffusion on Twitter, 
the outline of the work is listed as follows: 
 
 In Chapter 2 I focus on how I use these network statistics to perform real time 
tracking of urgent event diffusion and how they can shed lights on various temporal 
patterns of the diffusion process. Moreover, I make a simple attempt to compare 
whether these statistics can differ among different types of popular topics on 
Twitter—trends (Mathioudakis, 2010), which may not be real life events. All the 
analysis done in this chapter used the 15k_user data since only these data include 
trends of various types and is not restricted to real life urgent events.  
 
In Chapter 3, I describe how I aim to spend my effort on one important problem in the 
field of information diffusion study—the user influence prediction. I take the size of 
  
 18 
 
retweet cascades as a measure of influence and established an early time prediction 
framework using a "cross prediction" mode to make effective prediction on the user 
influence within only limited amount of prior information given. I also study how 
various experiment factors affect the prediction performance and I reach relatively 
consistent results. In this study, I use the event data sets since 15k_user data only 
include a small subset of the whole diffusion of the event.  
 
In Chapter 4, I move to the prediction of popular topic during urgent event diffusion.  
The same data sets are used and similar prediction framework as the user case was 
applied. I adopt two types of topics to perform my prediction tasks: self-defined 
topics, i.e.,—hashtags, model defined topics, i.e., —latent topics. I observed similar 
results with the topic prediction as to the user case. 
  
Chapter 5 is a comprehensive conclusion of the whole study and gives promising 
pathways for future extension.  
 
 
 
  
 19 
 
Chapter 2: Tracking urgent event diffusion on Twitter 
 
2.1 Introduction 
2.1.1 Motivation 
The retweet network is suitable to explore the temporal patterns of information 
diffusion on Twitter. Given all the information provided by the Twitter API, the 
retweet network can serve as a real time monitor of information diffusion on 
Twitter—not just for diffusion of events, but for diffusion of any popular topics—
which are called trends on Twitter. By viewing the event diffusion process as 
expansion of the retweet network, the diffusion can be tracked and analyzed with 
retweet network analysis. In this Chapter, I show that the dynamic network statistics 
derived from the retweet network can reveal detailed dynamics of urgent event 
diffusion. Moreover, these statistics can also be used to differentiate hashtags of 
different types—e.g. hashtags related to real life events or hashtags related more 
general topics on Twitter.  
2.1.2 Data description 
For the diffusion tracking study, I mainly use the OBL data set from the 15k_user 
data as my sample data set. Similar results are derived from the event sets.  
 
Data 
set 
No.twe
ets 
No. 
user
s 
No.retweet
s 
No. 
mentions 
No.tweets 
with URLs 
No.tweets 
with 
Hashtags 
15k_20
11 
109792
80 
146
23 
2426257(0.
221) 
6879875(0.
627) 
3510811(0.
320) 
2154061(0.
196) 
15k_20
12 
785058
3 
120
43 
1806645(0.
230) 
4577257(0.
583) 
3487325(0.
444) 
2327301(0.
296) 
 
                                       Table 2.1 15-k user data summary 
 
Table 2.1 gives a summary for the two 15k_user data sets I have analyzed, the time 
frames are from 25 Apr 2011 13:24:57 GMT to 17 Aug 2011 01:23:22 GMT for the 
2011 set and from 07 Sep 2012 17:00:04 GMT to 11 Dec 2012 11:56:23 GMT for the 
2012 set. The numbers in the parentheses represent the corresponding fractions over 
the total number of tweets. Not all users are active within the periods. While fractions 
of retweets and mentions remain stable, the fractions of URLs and hashtags have 
increased—implying users are getting more used to links and hashtags.  
  
 20 
 
 
Figure 2.1 shows the overlay comparison of F-F distribution between the two 
15k_user sets, only users (11966) from the intersection set of 15k_2011 users and 
15k_2012 users are shown in the figure. Most users do not change their behavior 
much, though some users shift a little bit. The number of tweets has generally 
increased. For both 2011 and 2012 sets there is an unusual turning slope, due to the 
Twitter restriction on the number of friends a given user can follow—the limit of the 
friends depends on the number of followers a user has. From the plot it seems that 
this is simply a linear relationship and it is possible to infer the approximate linear 
equation from the data.  Also a user is likely to post more tweets with higher number 
of followers, that could be because that posting more tweets and having more 
followers could have a reinforcement effect on each other.   
 
           
 
                        Figure 2.1 15k_user data sets F-F distribution comparison 
 
Table 2.2 is a comparison table for the OBL data and the background set—which is 
simply a mixture of everything (all tweets) from the 15k_user data but with the same 
time frame as the OBL set: from 02 May 2011 02:27:10 GMT to 04 May 2011 
02:23:51 GMT.   
 
 
  
 21 
 
Data set No.tweets No.retweets No. mentions No.tweets 
with URLs 
No.tweets 
with Hashtags 
Background 470438 114206(0.243) 288058(0.612) 174262(0.370) 120052(0.255) 
OBL 28743 12815(0.446) 16338(0.568) 14076(0.490) 7666(0.267) 
                                  
                                             Table 2.2 OBL data summary  
 
The OBL set, which is the representative for the event data set, has significantly 
higher fractions of retweets and URLs compared to the background—implying more 
retweets and links occur during urgent event diffusion compared to the normal-level 
Twitter activity. While the components of hashtags and mentions do not differ much, 
showing these two types of tweets are not well distinguishable between urgent event 
diffusion and normal activity.  
 
 
                               
                  Figure 2.2 Evolution plot for OBL and background sets  
 
Figure 2.2 is the evolution plot for OBL and background sets, and the time interval 
length for the series is 60 min. It clearly shows how the diffusion dynamics of an 
event differ from the normal-level activity. All curves from the background set have 
clear periodical patterns, while ones from the OBL set slowly decay.  All types of 
tweets follow the similar trend for both sets, only differing by their scales.  
 
 
Statistics Background 
network 
OBL network OBL_15k 
network 
No. nodes 61963 10099 754 
No. edges  114206 12815 787 
Avg. indegree 1.843 1.269 1.044 
No. SCC 61397 10088 743 
No. WCC 2118 1110 146 
LSCC 0.00749 0.000495 0.00663 
  
 22 
 
LWCC 0.873 0.711 0.561 
Transitivity 0.00513 0.00223 0.0193 
Reciprocity 0.00687 0.00141 0.0234 
Diameter 29 8 7 
 
                                                Table 2.3 OBL retweet network summary 
 
Table 2.3 shows the retweet networks comparison, where the OBL_15k network 
refers to the network only including edges between the 15k users. The three networks 
are ordered in decreasing scopes, and also form a decreasing subset series—the latter 
ones are subsets of preceding ones. Besides the differences on the scales of the 
network, the OBL_15k network is better connected as can be shown by the 
transitivity and reciprocity. But the networks are not generally well connected in the 
strong sense and do not have much clustering patterns (number of triangular links). 
By looking at the retweet networks formed by other subsets similar results are 
observed; thus this can be considered as general properties of the retweet network: (1) 
few mutual connections (users have been retweeted by each other) and clusters, and 
(2) mainly composed of unidirectional links. All these values are even lower in the 
event network—the OBL network, which means this property is reinforced during 
event diffusion.  
 
2.2 Event evolution tracking 
2.2.1 Global level tracking 
In the next two sections I will describe the various network statistics that are used to 
track the diffusion of the event over time, the default network used will be the OBL 
network unless otherwise specified.  
 
Figure 2.3 shows the evolution plots of various global statistics of the OBL network, 
the titles of the subplots are the corresponding time window sizes. Both the number of 
nodes and number of edges are decaying gradually--implying the diffusion is 
gradually ceasing. The curves become smoother with a larger time window; local 
patterns average out as some small peaks in the plots have disappeared with 
increasing time window size.  The peaks at 8AM of May 3rd in the LSCC, LWCC 
and reciprocity plots demonstrate there are some mutually connected links at that 
time, but the number of the links is not large given the values of the statistics. Except 
for some small peaks, the reciprocity remains zero, showing barely any edges connect 
both directions. In the big picture, it shows the network is weakly connected 
relatively well but not strongly connected. This can also be seen from the differences 
of the trends between the LSCC and LWCC curve: while the LSCC curve remains 
nearly flat except few bidirectional edges, the LWCC curve exhibits similar patterns 
as the total number of nodes in the long run. This indicates the largest weakly 
connected component follows nearly the same pace with the number of nodes, which 
  
 23 
 
can also be viewed as the sum of size of all components. To put this in another way, it 
shows that the diffusion starts out with large-size weakly connected components—
many users are connected by the retweet paths, then gradually dies out with fragments 
of small size weakly connected components—much fewer users continue their 
discussion in smaller groups; only few users have retweeted each other in the whole 
procedure—meaning the flow of the information mainly goes in one direction. In 
summary, by controlling the time window size and the statistics types, I can reveal 
both local and global dynamic patterns of the diffusion to a specified level of 
resolution.  
 
 
 
 
 
  
 24 
 
 
 
       Figure 2.3 Global level network statistics evolution plots for OBL network   
 
 
         
Figure 2.4 Global level network statistics evolution plots comparison for three networks   
 
Figure 2.3 has illustrated the comparison varying the statistics and the time window 
size, Figure 2.4 will demonstrate the comparison from different scopes of networks. 
Apart from the obvious differences on the scale, clearly both the strongly connected 
and clustering patterns stand out when the scope of the networks is decreasing. This is 
a natural result from the fact that only retweets originated from the 15k users are 
included in the data—meaning the mutually connected links can only come from the 
15k users. This result also discloses a flaw in the 15k_user data—it naturally carries a 
bias when studying the retweet network by excluding all retweets not coming from 
the 15k users. Thus this diffusion study will inevitably be biased, but similar results 
are likely to manifest by observing the patterns within the 15k network, which is free 
of bias. Moreover, even if the results are incorrectly specified due to the underlying 
bias, the methodology still proves its value and that is the main idea I want to convey 
in this section.  
 
  
 25 
 
2.2.2 Node level tracking 
 
The previous section mainly discuss how I use various global level network statistics 
to track detailed dynamic patterns of diffusion, in this section I will extend similar 
discussion to node level statistics.  
 
While the global level statistics can tell explicit information about the global patterns, 
it is natural to believe the node level statistics can also be informative for node level 
tracking.  
 
 
       
                        
         Figure 2.5 Node level network statistics evolution plots for OBL network   
 
Figure 2.5 shows the top 10 node level statistics evolution plots ranked by 4 different 
centrality measures, the upper 4 plots use the segmented series and the lower 4 plots 
use the cumulative series for comparison (for the differences between the two series, 
please check the dynamic network statistics section). The top 10 nodes are different 
with different centrality measures, although authority score and indegree, hub score 
  
 26 
 
and outdegree are closely related from each other. The most valuable information 
obtained from the figure is that the dynamics of important nodes can be monitored, 
where the importance is defined by various centrality scores. The segmented series 
plot emphasizes the nodes importance within each selected time window, while in the 
cumulative plot I can see how the cumulative importance ranking is changing over 
time. Considering the confusion from plotting too many lines, only 10 nodes are 
shown for the node level plot, and these are the top nodes that are needed to be 
studied. But this methodology provides the flexibility where nodes can be targeted 
and monitored as decided. All nodes can be monitored at the same time, or target top 
nodes ranked by a specific centrality measure.   
 
2.3 Event hashtag identification 
2.3.1 Hashtag evolution comparison 
I have shown that these different network statistics can be used to shed light on the 
dynamics of urgent event diffusion.  Now I would like to further investigate whether 
a real time event will exhibit distinct network properties to general popular topics on 
Twitter—trends or not. I think of this as the extension of the diffusion tracking study 
not only since they are closely related, but also because I have the full activity of the 
15k users on general trends (not just ugent events).      
 
                 
Figure 2.6 Evolution plots for event and non-event hashtag networks 
 
In Figure 2.6 I select two hashtags of different types and then compare the retweet 
network properties derived from the two hashtags. "TeamFollowBack" is a popular 
meme on Twitter, using this meme, interested users will adopt the hashtags, follow 
each other and communicate each other. Hashtag "royalwedding" corresponds to the 
British royal wedding in 2011. The first hashtag can stand for a large group of trends 
on Twitter which are not related to life outside Twitter and I call this a "meme 
hashtag", while the second hashtag serves as an indication of an event happening 
  
 27 
 
outside Twitter and I call this an "event hashtag". Clearly both the hashtag volume 
(number of tweets with a given hashtag) and the number of edges (the hashtag 
retweets volume) show the differences between these two types of hashtags—the 
meme hashtag does not have a clear trend and remains relatively stable over time 
while the event hashtag follows a sharp burst-decay trend. Other plots also show the 
differences: the LWCC and LSCC of the event hashtag network have much larger 
variations than the ones from the other network.  But the meme hashtag network has 
more clustering and mutual links since it includes more mutual communication; this 
is in contrast to the event hashtag network, which mainly contains simple one-
directional links. Similar results also happened in other event related hashtags.  
 
The above comparison raises the conjecture that hashtags related to real life events 
might be distinguished from other types of hashtags using the network statistics. 
Although I do not have many event related hashtags in the whole 15k_user data, this 
is a potential future research direction.  
2.3.2 Hashtag clustering  
Following the conjecture from the previous section, I make an attempt to cluster a 
group of selected hashtags using network statistics. The clustering results are 
compared to the results obtained using the hashtag tweets text, which is considered as 
the ground truth here. The selected hashtags are from both 15k_user data sets:  
15k_2011 set: 
'royalwedding','Osama','OBL','BinLaden','Tornado','Joplin','MemorialDay','Te
amFollowback','FF','NowPlaying','YHP','SocialMedia','MADNESS','quote','SHOUTO
UT','500aday','fb‘ 
 
15k_2012 set: 
'election','election2012','iphone5','debate2012','blackfriday','hurricanesandy','h
alloween','hurricane','facebook','dope','twitter','business','hiphop','travel','jobs','music
','marketing','tech','hr' 
 
  
 28 
 
 
                            Figure 2.7 Hashtag tweets text cleaning and processing 
 
Here I intentionally choose a balanced mixture of event related hashtags (boldfaced) 
vs. non-event ones to see how they are different from each other. Here I have tried my 
best to find hashtags related to real events. The same list of hashtags will also be used 
in the binary classification task for the next section. The network features used for 
both tasks are: average indegree', 'density', 'LSCC', 'LWCC', 'WCC ratio' and 
'transitivity'. The WCC ratio is the ratio of number of weakly connected components 
over total number of nodes. To accommodate the variations among time spans of 
different hashtags, I choose the uniform time span of 6 days for all hashtags, where 
starting times are the times when the hashtags are first adopted. Also the time window 
size is chosen to be 1d to obtain various dynamic network statistics, both the static 
statistics and the dynamic ones are combined to perform the clustering.  
 
 
 
Remove special 
strings  and 
punctuation 
Perform 
tokenization, 
lowercasing and 
stop words 
removal  
Remove remaining 
tokens with non-
alphanumeric 
characters   
Perform token 
stemming and 
remove tweets 
with token 
number less than 
three 
  
 29 
 
 
 
   Figure 2.8 Hashtag K-means clustering 2D principle components representation (K=5, the 
upper one uses the network statistics, the lower one uses the text features) 
 
I adopt the simple but powerful K-means clustering method (Hartigan, 1979), which 
is an iterative clustering method trying to minimize within-cluster distance given the 
clusters number.. Figure 2.7 shows the full text processing flowchart for hashtag 
tweets text, which are pretty standard text cleaning steps. Here the text for each 
hashtag refers to the text of all types of tweets (retweets or mentions etc.) from a 
given hashtag. The obtained tokens after the cleaning process are further filtered by 
selecting the top 10k tokens weighted by TF-IDF (term frequency inverse document 
frequency) (Blei, 2003) , which is a popular weighting strategy for document related 
analysis such as topic modeling.  All the text cleaning steps are done using Python 
library NLTK. The special strings to be removed, i.e. mainly URLs, are defined using 
regular expressions. The NLTK ordinary word tokenizer was used and the stop words 
list from NLTK is modified a bit by adding letter "t" to all negation stop words, doing 
so is because I will perform punctuation removal before stop words removal. The 
stemmer used is the Snowball stemmer. The same process will be used for topic 
prediction task in Chapter 4.  
 
 
Figure 2.8 demonstrates the clustering results (K=5) with a 2D principle components 
representation, which is using the first two components of principle component 
analysis (PCA) to visualize the clustering results. First I look at the results from text 
features (the lower plot), some closely related pairs are close from each other, which 
is really making sense: SocialMedia, twitter and facebook; election, election 2012 and 
debate2012; Osama, OBL and BinLaden; hurricane, hurricanesandy, Joplin and 
Tornado etc. While for clustering results using network statistics, they are more 
concentrated. This is mainly due to the difference between the dimensionality of the 
data space: the total number of network features is only 42, and the number of text 
features is 10k. In spite of the great dimensionality difference, some meaningful pairs 
can still be obtained: Osama, OBL and BinLaden; Joplin and Tornado; hurricane and 
hurricanesandy; election, election 2012 and debate2012. There are still some 
  
 30 
 
discrepancies, especially to those non-event hashtags, but it is clearly demonstrating 
that the network statistics do have the ability to distinguish hashtags to some extent.  
 
 
    
 
 Figure 2.9 Hashtag K-means clustering sum of squares ratio varying by number of clusters 
(the left one uses network features, the right one uses text features) 
 
Usually the performance of the clustering is measured by some external ground truth 
evidences, but since I do not have them here, I use the internal metric instead—the 
sum of squares ratio. The ratio is defined as the fraction of sum of squares among 
clusters over the total sum of squares, thus higher ratio will imply higher fraction of 
between-cluster variation—meaning better clustering performance. Figure 2.9 shows 
the sum of squares ratio changes over number of clusters from K=2 to K=10. While 
the ratio becomes saturated after K=5 for clusters from network features, it is 
continuously increasing for clusters from text features. Again this is mainly due to the 
high dimensionality of the text features, but it is also indicating that the hashtags are 
difficult to cluster generally--there has been much noise in the data.  
 
Therefore I have observed that using network statistics can indeed provide some 
meaning results, however due to the noise of the overall data, it seems hard to obtain 
a good clustering even using the text features based on the internal metric—sum of 
squares ratio. In the future work, I will explore using subject matter expertise to 
provide an external ground truth. This will create the ability to validate whether using 
text features only is a good choice or not. In this case, I conjecture that the network 
features may provide additional improvement on the clustering and a combination of 
network features and text features could offer better results.  
2.3.3 Event hashtag classification 
To verify how network features can be used to distinguish event related hashtags 
versus non-event ones, I performed a simple binary classification task in addition to 
K-means clustering. The same set of hashtags and features are used in this task. The 
two classes are specified as event ones (boldfaced) and non-event ones. The AUC 
(area under curve) score is used as the performance metric since this is a normal 
balanced class classification, and AUC works well in that context. For a classification 
task with imbalanced classes, AUC is no longer a good score and other alternatives 
should be used (more discussion on this in later sections). The classification method 
used is deep learning (deep neural network), which is a state-of-art nonlinear method, 
  
 31 
 
the details of the method and corresponding parameters setting will be mentioned in 
later sections. The reason I am using this method here is due to the high 
dimensionality of text features, the ordinary linear model cannot handle this problem 
well due to singularity issues; another alternative state-of-art nonlinear method—
random forest was examined, but the results are not presented due to computational 
costs.  
 
Features  min 1st Qu median mean 3rd Qu max 
Network 0.35 0.75 0.85 0.83 0.91 1 
Text  0.3 0.85 0.95 0.89 1 1 
Combined 0.55 0.85 0.95 0.92 1 1 
 
       Table 2.4 Event hashtag classification average AUC scores with 100 repetitions 
 
I propose exploring the task using the standard binary classification settings:  
1. Split the data into 3: 1 ratio of training vs. testing sets with the same class labels 
ratio as the original data.  
 
2. Train the model on the training set, then apply the trained model on the testing set, 
and measure the AUC score.   
 
3. Repeat the random splitting in step 1 by n=100 times and all the steps above, 
collect all the AUC scores.  
 
Table 2.4 shows the results from the classification. The performance of the 
classification highly depends on the training sets splitting, this is mainly due to the 
sparsity and labeling of the data—I need more well labeled data to better train my 
model. In spite of the large deviation in the results, there is still a slight improvement 
with the addition of network features, and using network features alone can already 
give comparable results. Considering the dimensionality difference between the two 
types of features, I believe the network features are indeed able to identify event 
hashtags well. However, due to the limited amount of data I have, I am not able to 
further validate this hypothesis.  
2.4 Discussion 
2.4.1 Summary 
This Chapter mainly discusses how I utilize the 15k_user data to study the event 
diffusion on Twitter. In the event diffusion tracking section, I use various evolution 
plots to demonstrate how different network statistics can be used to reveal detailed 
diffusion patterns of an event. There are many parameters that can be finely tuned in 
order to deliver the most accurate information I want. The time window size is the 
one that controls the level of precision: a smaller time window size will result in 
sparser networks and more local dynamic patterns, while a larger time window size 
  
 32 
 
will average out small local variations and provides more stable trends. And the 
choice of the time window size should be based on the full time span of the 
data/event/network—a shorter time window size for more transient events and a 
longer time window size for more long-lasting ones. The type of network statistics to 
monitor is also an important factor to consider. For general purpose diffusion 
tracking, the ordinary ones would suffice: number of nodes and edges, average 
degree, or any other content related tweets statistics—fractions of retweets, URLs and 
hashtags etc. For more specialized patterns, such as connectedness, reciprocity, 
clustering or other path related information, I should resort to the corresponding 
specific statistics to handle. Also if I am more interested in how a single or a specific 
group of users behave during the diffusion instead of the global diffusion patterns, I 
should monitor the node level statistics—the type of centrality measure to monitor 
will subject to the requirements of the task: i.e. betweenness centrality for the 
"bridging" functionality of a node and PageRank for a general comprehensive 
measure of the node importance. Sometimes the scope of the underlying network is 
another factor to manipulate, especially when different levels of networks are already 
well defined. Just like the comparison I have shown in the Figure 2.4, different 
choices of networks can either enlighten or blur specific diffusion patterns, deciding 
which networks to choose will depend on the actual task to accomplish.  
 
I have also shown that network statistics can not only be used to keep track of the 
event diffusion, but can also be used to distinguish event related hashtags versus non-
event ones. First I used a figure to illustrate that an event hashtag can be visually 
differed from a non-event one using the network statistics evolution plots. Then I 
further investigated this hypothesis by performing K-means clustering and binary 
classification on selected hashtags from the 15k_user data sets. The results are not 
perfect, but they do demonstrate the network features have the ability to distinguish 
the event hashtags to some extent. However, due to data limitations, I will not be able 
to further verify this conclusion.  
 
In summary, the network statistics can not only provide much insight to the dynamics 
of the diffusion, but can also reveal distinct properties of event related hashtags. But 
more data are needed to further validate the latter conclusion.   
2.4.2 Future work 
As notified previously, one main direction for future work would be to validate the 
obtained results for event hashtag identification. I need well defined classes or 
clusters of hashtags and accurate corresponding labels on each hashtag. Moreover, I 
need a large amount of data like this to drive more persuasive conclusion. Finding 
such data is not an easy task, much time and effort are needed, but the potential 
impact will also be great if I can show the network statistics can indeed generally 
distinguish hashtags of different types.  
 
Another meaningful extension would be trying to build a real-time monitoring system 
to track diffusion process using various network statistics, on both a global and node 
level. Unlike the previous work described, this will be a task requiring intense 
  
 33 
 
engineering effort, but it is equally or even more impactful with practical 
implementation purposes.   
  
  
 34 
 
Chapter 3: Predicting user influence during urgent event 
diffusion 
 
3.1 Introduction 
3.1.1 Motivation 
The definition of a user's influence can vary in different contexts, and in my study I 
confine influence to be the ability to "spread information" during the urgent event 
diffusion on Twitter. Thus I choose the total sizes of retweet cascades generated by a 
given user to be the measure of the user influence.  A retweet cascade can be simply 
considered as the full tree-like structure generated from the original tweet, where each 
edge is also labeled with a timestamp. Figure 3.1 shows some concrete examples of 
retweet cascades from the Sandy set. The majority of cascades will appear just like 
the upper one in the figure with centric star shapes; occasionally, there will be 
cascades with more complicated structures like ones in the lower part of the figure. 
Another part to clarify is that since I am measuring the user influence and a user can 
produce several popular tweets with different number of retweets, some weights are 
needed to aggregate these sizes of retweet cascades to generate a single score for that 
user. There are no well-established conclusions for this choice of weights from 
previous studies, therefore I will use the simplest method—the total size of all retweet 
cascades from a given user as the influence measure for that user. In this way, users 
with one highly popular cascade will be considered equally influential as ones with 
multiple median level cascades.    
 
The chosen influence measure is only well defined for users who have posted some 
original tweets and at least received some retweets. I will refer to these users as seed 
users—implying they are the seeds of the diffusion. However, there are definitely 
many other users that also take part in the diffusion process, and some of them are 
also playing important roles—such as users who actively retweeting others. But for 
the purpose of this work, I will concentrate on the seed users. Thus I clarify here that 
the user influence here will literally imply the influence power resulting from the 
"seeding behavior", not something else.  
 
Moreover, based on the literature review, I have discovered the following limitations 
on the influential node prediction:  
1. Since the prediction problem is time sensitive, the authors will usually use data 
spanning a long period to train the model (Bakshy, 2011) (Galuba, 2010)  (Kupavskii, 
2012). However, this is not always feasible since the information given to me for 
training is often limited, especially for the urgent diffusion case—high diffusion rate, 
large diffusion scale and users will not be aware of the event in advance. In this 
situation the value of the prediction will largely depend on whether I can make decent 
  
 35 
 
prediction when only having restricted information, which is often the case in the 
Twitter data due to various restrictions of Twitter API. 
 
2. The network features derived from the actual diffusion networks are seldom used 
in prediction, often only the number of retweets are used; while features from users' 
social context--number of followers, number of friends or pagerank of F-F networks 
etc. have been used or suggested frequently (Bakshy, 2011) (Kupavskii, 2012) 
(Petrovic, 2011) (Hong, 2011) (Simmie, 2014). The structural and dynamic 
information contained in the diffusion network--the retweet network in my case, 
could potentially help the prediction.   
 
 
 
                         Figure 3.1 Example of retweet cascades from Sandy set 
 
3. There is no a comprehensive comparison among different prediction methods, 
many authors have adopted decision tree based method—decision tree, random forest 
or gradient boosted tree.  It is generally considered as a better approach than the 
classic GLM (generalized linear models) method (Bakshy, 2011) (Kupavskii, 2012), 
  
 36 
 
but may not be the case everywhere;  and the other state-of-art machine learning 
method that I will employ—deep learning has not yet been applied to this field.       
 
In this Chapter, I attempt to work through the above limitations. I propose to select a 
specific number of top nodes based on given information, extract both baseline and 
network features (more details later) of these nodes, then implement three state-of-art 
machine learning methods to predict future influence scores for those nodes. By 
observing similar behavior among similar type of data sets(which will be discussed 
later), I create a feasible way to overcome the limited information by using past data 
sets as the training set to train the model instead of choosing both training and testing 
sets from the same data set. I will show this approach indeed produces decent results 
and that my methodology best suits the conditions of the Twitter streaming API, 
where I can collect real time streaming tweets with proper structural format about 
current major events using specific key words. All the features I have used in my 
models are directly included in those streaming tweets, and do not require any 
additional resources. I will show that addition of network statistics, especially 
dynamic network statistics can consistently produce significant improvement over 
baseline features on the prediction tasks with various experiment settings. Moreover, 
I also discover that the linear model consistently outperform the nonlinear ones in 
various settings, which could imply existence of strong linear relationship yet barely 
nonlinear relationship between the selected features and the response on the log scale 
(I perform log transformation on features and response before prediction).  
 
3.1.2 Data description 
Table 3.1 shows the comprehensive summary for the event data sets. The numbers 
within the parentheses are corresponding fractions.  The retweet cascade depth is 
simply the length of the retweet chain, or the depth of the corresponding retweet tree. 
The four data sets differ on the scales, and the fractions of retweets and hashtags are 
increasing over time—representing the evolution of user habits and the increase in the 
Twitter use. Comparing the most two recent sets—Nepal and Blizzard, users seem to 
prefer adopting mentions (including retweets) and hashtags to spread information. 
Meanwhile less usage of links is observed in Blizzard set and this does not follow the 
trend in the rest three data sets where the fraction of links increased over time. The 
underlying reason for this is unknown—it could possibly just be normal variations 
among data sets. It is also observed that over time the retweet cascades are becoming 
"shallower"--the fraction of cascades with depth greater than 1 is dropping drastically. 
This perhaps also reflects changes in the user behavior—users are retweeting the 
source tweets directly instead of following the chains. The fraction of seed users, who 
are direct contributors of popular tweets, also decreased significantly. Both of these 
observations—"shallower" cascades and fewer seed users, seem to together imply that 
the general distribution of the retweets becomes more concentrated around a few seed 
users; considering this in concert with the trend in the increasing fraction of retweets, 
the chosen influence measure—total retweet cascades of seed users, becomes a more 
notable measure for user influence during the event diffusion.  
 
  
 37 
 
Figure 3.2 includes the F-F distribution plots for several event sets under different 
conditions.  All the plots are restricted to be within one week from the starting date of 
the data collection for equal comparison. Despite of this, there are still too many users 
to be shown in one plot, the users are further truncated by their influence scores—the 
total size of retweet cascades. The threshold of the scores I will consider is labeled as 
TH.  
 
 
Data set Irene Sandy Nepal Blizzard 
No. tweets 3178604 16529593 7310929 1716234 
No. users 1487056 5588298 2229702 941488 
No. 
retweets 
965158(0.304) 8410769(0.509) 3905264(0.534) 1101108(0.642) 
No. 
retweet 
cascades 
with 
depth>1 
71684(0.074) 408919(0.049) 27716(0.007) 14644(0.008) 
No. 
mentions 
1874213(0.590) 9272919(0.561) 4381043(0.599) 1226905(0.715) 
No. tweets 
with URLs 
395835(0.124) 5394189(0.326) 4562842(0.624) 643483(0.375) 
No. tweets 
with 
hashtags 
102954(0.032) 7393790(0.447) 2959370(0.405) 1369829(0.798) 
No. seed 
users 
218298(0.147) 981626(0.176) 184863(0.083) 62611(0.067) 
Time 
frame 
26 Aug 2011 
14:00:00 GMT ----
12 Sep 2011 
23:31:26 GMT 
25 Oct 2012 
04:00:00 GMT ---
06 Nov 2012 
04:59:59 GMT 
27 Apr 2015 
02:00:02 GMT -- 
22 May 2015 
00:24:45 GMT 
22 Jan 2016 
16:09:42 GMT – 
31 Jan 2016 
05:23:00 GMT 
 
                                          Table 3.1 Event data sets comprehensive summary 
 
  
 38 
 
The upper two plots are from the Nepal set with TH=100 and 500 respectively, and 
the lower two plots come from the Irene and Sandy set with TH=500. The rough 
shape of the slopes is similar to those displayed in Chapter 2. Also users with a higher 
number of followers constitute a higher fraction with increasing TH, which is natural 
since the higher number of followers implies a higher potential of being retweeted.   
 
 
 
 
 
Figure 3.2 F-F distribution plots for event data sets (the upper two plots are from the Nepal 
set with TH=100 and 500 respectively; the lower two are plots from Irene set and Sandy set 
with TH=500 respectively) 
 
Comparing the three plots with TH=500, the differences on the data scale also are 
also reflected here:  the Sandy set has the densest plot and the Irene set has the fewest 
users in the plot. Also I find the users seem to be a little concentrated round the band 
with the number of friends between 100 and 1000, meaning higher number of friends 
does not necessarily lead to more retweets.  
 
  
 39 
 
Figure 3.3 is a collection of evolution plots for each of the four event data sets. 
Except for Sandy set, the other three sets more or less start with the peak period, this 
is inevitable for urgent event diffusion since the data collection only begins after the 
start of the event and it is not possible to anticipate the occurrence of the event before 
it actually happens.  
 
 
  
 40 
 
 
 
 
Figure 3.3 Evolution plots for event data sets (from top to bottom, the plots are for Irene, 
Sandy, Nepal and Blizzard set respectively) 
 
The plots show different components of tweets are evolving with similar a pace with 
the total number of tweets. Across data sets from earlier timestamps to later 
timestamps, hashtags and URLs are adopted more frequently; while mentions and 
retweets also have a small increase but remain relatively stable. All these 
observations coincide with the data summary table.  The Sandy set seems a little 
special with peak occurs several days after the data collection. After investigation, the 
peak time seems to correspond to the landing time of Hurricane Sandy onto the 
eastern shore of US. Since Hurricane Sandy started at Caribbean Sea instead of US 
mainland, thus the discussion began earlier than it actually landed onto US. To 
identify how this delaying peak time will affect my prediction, I created another data 
set called Sandy 4.5 for direct comparison, which is simply defined as a truncated 
version of Sandy set by starting 4.5 days later than the starting time of data collection.  
 
  
 41 
 
3.2 Prediction task formulation 
3.2.1 Task types  
Generally speaking, I have formulated three prediction tasks: rank prediction, 
classification and direct prediction, they are distinguished by types of the response 
variables. Direct prediction simply means predicting the users' raw scores—total size 
of retweet cascades. Rank prediction only cares about predicting the ranking of the 
users instead of their actual scores. Classification will aim to identify a specific class 
of users. There are two types of classification tasks performed in my study:  top 
quantile users classification—e.g. binary classification on top 5% users by the scores; 
and rapidly increasing users classification (only users with response exceeding a 
certain threshold are selected)—e.g. binary classification on users who have scores 
increasing greater than 5 times but also with scores greater than 10.    
 
The rank prediction task will be my main focus, due to the following considerations:  
1. Due to the limited information dilemma, I adopt the "cross prediction" mode—
using data sets corresponding to past events as training data to predict future events 
sets. However, different data sets have different scales. Thus it is a better choice to 
standardize (subtract the mean and divide by the standard deviation) the data sets 
before performing the prediction, which means the prediction scores will be in 
normalized forms and cannot recover to the raw cascades scores. Under this 
condition, what I obtain from the models will be the normalized scores, which implies 
that I am actually performing a rank prediction and not a direct quantity prediction.  
 
2. Since standardization has been performed, the results obtained from rank 
prediction will usually be more reliable and stable than direct prediction; this could 
still be the case even when no heterogeneity exists among the data sets.  
 
3. It will be shown that the classification task can actually be accomplished together 
with the rank prediction task.  
 
As to the performance measure, I will use Spearman rank correlation for rank 
prediction (the results are similar if using Kendall Tau correlation, another frequently 
used metric for ranks). For classification, I will use the area under the precision-recall 
curve (PRAUC), which proves to be an informative measure for highly imbalanced 
classification tasks. As to the direct prediction, I simply use the root mean square 
error (RMSE). The rank correlation is computed using the R base function "cor" and 
rank of ties is defined as the average. The PRAUC is computed using the R package 
PRROC, so does the plot of the PR curve 
 
Even though I have put much emphasis on rank prediction, I will display results from 
all three types of prediction for comparison. And which task to perform in practice is 
subject to what kind of problem needs to be solved.    
 
  
 42 
 
3.2.2 Features 
The features used in all three tasks will be the same, they are all user-specific features 
and include two categories—baseline features and network features, and network 
features are consist of static network features and dynamic network features:  
1. The baseline features are features often used in previous prediction studies yet also 
available in my data, they are:  
'Total of tweets', 'Number of followers', 'Number of friends',  'Number of tweets', 
'Number of mentions', 'Number of tweets with URLs', 'Number of tweets with 
hashtags', and 'Past scores'. Past scores are users influence scores within the feature 
collection period, while the prediction responses are influence scores within the 
prediction period (more details mentioned later). The total tweets are the total number 
of tweets posted by the given user since creation of the account; and number of tweets 
only count the tweets posted from the beginning of data collection, this is the same 
for other features from various specific types of tweets.  
 
2. The network features are constructed using the following network statistics:  
'Indegree', 'Outdegree', 'Pagerank', 'Authority score', 'Hub score', Eigenvector 
centrality', 'Closeness centrality' and 'Local transitivity' (local clustering coefficient). 
(i) The static network features are aggregate statistics of the whole network within the 
feature collection period.  
(ii) The dynamic network features are ones from subnetworks within certain time 
interval. Compared to the static features, dynamic features contain rich dynamic 
information and they could provide additional improvement for the prediction.  
Here I am not including betweenness centrality due to its extremely high computation 
cost and limited benefit.  
 
The features mentioned above are almost all of the features that can be derived from 
the data; some other commonly used features, such as network features derived from 
the F-F networks, are not available in my data.    
 
3.2.3 Training and testing sets 
As mentioned before, training sets will be data sets concerning to past events 
compared to the selected testing set. For example, majority of my analysis will use 
Nepal set as the testing set, and the default training set will be Irene set combined 
with Sandy set; results from Irene set alone or Sandy set alone as training sets are 
listed for comparison. Predictions with testing set Sandy or Blizzard are performed to 
validate the conclusion, and the default training sets are Irene and Irene+Sandy 
respectively. It is worthwhile to note the standardization for rank prediction should be 
performed within each data set separately if multiple training sets are involved, this is 
still due to the scale differences among data sets.  
 
  
 43 
 
 
 
                                Figure 3.4 A general prediction task flowchart  
 
What I would like to obtain from these settings is how the choice of training sets and 
testing sets will affect the prediction results. It turns out they are one of the most 
deciding factors on the prediction performance. Especially when the testing set is 
usually fixed in practical applications, choice of training sets from a list of potential 
candidates will be extremely important, and I will develop further discussion on this 
in later sections.   
 
3.2.4 Task parameters 
In my prediction study, the following parameters are used:  
 
T: The feature collection period length, from period [0, T] I can construct the whole 
network and extract both static and dynamic network features. 
h: The prediction period, indicating the length of period for future user score 
prediction. Thus the responses for my experiments will be cascades size within period 
[T,T+h] for nodes under consideration. 
Δ: Time window size for dynamic network features extraction, use segment statistics 
series to reduce features correlations.  
k: Size of top nodes (ranked by past scores) to monitor, the upper bound is 10k in my 
case. 
 
The general flowchart for the prediction task is shown in Figure 3.4. The default 
values for the parameters are: T=1d, h=5d, Δ=6h and k=10000; I will stick to these 
values unless special notes are informed.    
 
Tweets 
collection from  
various data 
sets 
Extract user 
features from 
historical 
tweets by time 
T  
Train the 
prediction 
models from  
previous data 
sets 
Predict in test 
set user 
response in 
future time  
[T, T+h] 
  
 44 
 
Figure 3.5 shows the response distribution among various data sets with the default 
parameter settings. The first one is the normal log-log distribution plot and the second 
one is the CDF (cumulative distribution function) plot of standardized score.  From 
the log-log plot I can clearly see the distribution fits the power law distribution well. 
The normalized CDF plot demonstrates using normalization (standardization) can 
indeed reduce heterogeneity among data sets; although the distribution differences are 
still obvious after normalization—especially for nodes with lower scores, but this 
may not be as serious as it appears since I care more about top-ranked nodes, where 
the distributions are much closer from each other.  
 
 
 
 
Figure 3.5 User scores distribution with various data sets with default parameters 
  
 45 
 
  
 Figure 3.6 User scores percentile and ratio of increase distribution for Nepal set with default 
parameters 
           
Figure 3.6 shows the response distribution for Nepal set in another aspect. The y-axis 
is simply the quantile (percentile) of the response; the x-axis is the ratio of increase of 
the influence scores, which is defined as the ratio of response score (cascades size 
during [T, T+h]) over past scores (cascades size during [0,T]). This ratio is the key 
factor to perform the classification on rapidly increasing users. The actual response 
size is illustrated by size of the point. And the red horizontal line represents the cutoff 
value to truncate rapidly increasing users—set to be 10 as the default, which 
corresponds to about 0.6 in quantile.  Thus for the top quantile classification task, my 
main focus is on top bands of the plot; while for rapidly increasing classification task, 
my main focus is on nodes staying in the top right corner.  
 
Table 3.2 shows how the nodes response distribution changes over T among various 
sets, notice only nodes who have shown up in the retweet network (either retweet 
others or being retweeted at least once) are counted. The columns from top to down 
are: total nodes count, nodes response 75% quantile, nodes response 95% quantile, 
nodes count with response >=10, nodes count with response >=100, numbers in the 
parentheses are corresponding counting fractions within top 10000 nodes (ranked by 
past scores), which can be used to illustrate how representative the top 10k nodes are. 
The values in the first row are just the varying T values and h is always equal to the 
default value 5d. I can see from the table if I do not restrict the number of nodes to the 
top 10k, the response distribution will be more skewed. This implies that the majority 
of involved nodes are completely inactive in the next 5 days, they may just retweet 
others' tweets but they do not contribute the content for retweeting. Also the top 10k 
nodes ranked by their past scores can constitute majority of important nodes, although 
there are always missing ones since the base number of nodes is too large. Thus 
choosing the top 10k nodes would suit the task. On one hand, extracting features from 
all nodes would be unnecessary and waste time; on the other hand, even given 
sufficient computational power, it will do no good for the prediction since it is simply 
adding noise to the data by adding data points with null values. In conclusion, all the 
prediction tasks in this study will be confined to the top 10k nodes ranked by past 
  
 46 
 
scores—e.g. the rank or the quantile will refer the rank or quantile within the top 10k 
nodes.  
 
 
         Table 3.2 User response distribution over varying T among various data sets (h=5d) 
 1d 2d 3d 4d 5d 
Irene 
Total 235870 445808 581597 615176 630557 
75% 0 0 0 0 0 
95% 3 1 0 0 0 
>=10 
3047(0.52
9) 
1324(0.65
4) 
538(0.766) 342(0.795) 232(0.810) 
>=10
0 
230(0.813) 128(0.891) 55(0.909) 34(0.882) 24(0.917) 
Sandy 
Total 74232 170010 291564 716825 1963399 
75% 0 0 0 0 0 
95% 13 10 8 5 2 
>=10 
4494(0.62
0) 
9060(0.44
3) 
13307(0.35
5) 
21045(0.28
0) 
23765(0.27
5) 
>=10
0 
952(0.808) 
1603(0.75
0) 
2071(0.704) 2895(0.684) 3065(0.727) 
Nepal 
Total 427359 652866 797268 899826 979524 
75% 0 0 0 0 0 
95% 1 0 0 0 0 
>=10 
6482(0.66
6) 
6341(0.62
5) 
5897(0.632) 5374(0.631) 4979(0.636) 
>=10
0 
1130(0.91
1) 
1078(0.86
9) 
1008(0.856) 921(0.847) 872(0.841) 
Blizzar
d 
Total 214508 497547 602503 647983 672856 
75% 0 0 0 0 0 
95% 0 0 0 0 0 
>=10 
2384(0.77
8) 
1874(0.82
0) 
1140(0.814) 753(0.841) 476(0.840) 
>=10
0 
461(0.887) 272(0.941) 137(0.905) 88(0.898) 58(0.914) 
  
 47 
 
 
 
 
 
Figure 3.7 Top 10k user components analysis over varying T values (h=5d) for Irene set, 
Sandy set and Nepal set from top to bottom respectively 
 
  
 48 
 
The other information I can obtain from the table is the overall trend of activity by 
looking at the counts of top nodes. For example, the number of nodes with response 
greater than 10 has decreased from around 3k to only about 200, this is indicating the 
diffusion is gradually dying out. A similar trend is observed in Nepal set and a reverse 
trend is observed in Sandy set, which coincides with what see from the evolution 
plots.  
 
With the nodes selection, if performing a rank prediction the rank will simply be 
based on the selected nodes. This is inevitable unless no nodes selection is performed, 
which is not suggested due to previous discussion. There could be complicated trade-
off considerations on what is the best k value to choose, but this is out of scope of my 
discussion and I will just stick to k=10k in my study.   
 
Figure 3.7 demonstrates the other aspect of dynamics of nodes response 
distribution—what the real top 10k nodes by response consist of.  The three plots 
correspond to Irene set, Sandy set and Nepal set from top to bottom respectively. The 
purple numbers on top of plots are total nodes count by the given T values. There are 
three types of nodes involved in the plots—new nodes, old nodes and rising nodes. 
There are defined as follows:  
1. Old nodes: these are nodes belonging to the top 10k nodes in the past (within 
period [0, T]) 
2. Rising nodes: these are nodes showing up in the past but are not among the top 10k 
nodes.   
3. New nodes: these are nodes not showing up in the past at all, they simply emerge 
within period [T, T+h].  
 
The y-axis in the plots displays the fractions of each type of nodes over top 10k nodes 
ranked by their responses. Fractions of different types of nodes can shed light on the 
underlying diffusion dynamics. In the Irene plot, new nodes and rising nodes are both 
staying above and remain stable, which implies: there are consistently new users 
joining the conversation and generating "impact"; there are not many old users 
continuously remain active (less than 20% and still decreasing over time), meaning 
the ranking is not stable. In the Sandy plot, the trend is clear: the new users are 
dropping over time, old and rising users are gradually catching up. If I refer to the 
evolution plot of Sandy set, I will find around 5 days from starting time the diffusion 
reaches peak hour, which just corresponds to the tails of curves. From there fraction 
of old nodes begin taking the lead—showing the ranking is becoming more stable 
from then on. Similar patterns are observed in the Nepal set, where old nodes are 
taking relatively steady fractions over time—implying stable ranking. And new nodes 
fractions are dropping—implying the diffusion are becoming saturated. Thus I can 
see rich information about the dynamics of the diffusion can be derived from the 
plots.  
3.2.5 Methods 
I adopt three state-of-art machine learning methods for my prediction tasks: LASSO 
linear regression (belongs to one type of generalized linear model, will simply call 
  
 49 
 
this GLM from now on), random forest and deep learning (deep neural network). 
While GLM consists of nearly all popular linear models of various types, the other 
two methods are typical non-linear models.  
min
𝛽∈𝑅𝑛
[
1
2𝑙
∑(𝑦𝑖 − 𝑥𝑖
𝑇 ∙ 𝛽)2 + 𝜆𝑃𝛼(𝛽)
𝑙
𝑖=1
] 
𝑃𝛼(𝛽) = (1 − 𝛼)
1
2
‖𝛽‖2 + 𝛼‖𝛽‖ 
                     
The above equations show the loss function of elastic net, which is a more general 
model than LASSO regression. It still has the same square loss function as the 
ordinary linear regression, but with addition of a penalized term to regularize 
magnitude of predictors to prevent overfitting (Babyak, 2004).  The penalized term 
for the elastic net model is the weighted sum of L1 norm and L2 norm of the features 
as above. When α=0 the model will become ridge regression and when α=1 it will 
become LASSO regression. The lower plot illustrates how LASSO regression works 
differently as opposed to ridge regression. In the vector space formed by the 
parameters, the optimal parameters to be estimated will be the intersection of the 
contour plot of the square loss function and the geometry object formed by the 
penalized term. This geometry object formed by L1 penalty term and L2 penalty term 
are hyper-cube and hyper-sphere in the parameter space respectively. Since the 
contour plot of the loss function is convex, the intersection with L1 penalty term will 
inevitably be at the vertices or edges of the hyper-cube. This will force some 
parameters values to be zeroes, and thus the features selection is achieved. Therefore 
LASSO regression is a linear model that can prevent over fitting and achieve variable 
selection, and is used as the deputy of linear model in my task.   
 
Random forest is an ensemble method of decision (regression) trees. Decision trees 
are a popular method used in a previous study of social media influence (Bakshy, 
2011) (Kupavskii, 2012): the basic idea is to perform recursive splitting of features 
values to grouping data points to various leaves based on certain criteria. The features 
and response values within each leaf will be used to decide the final output; the 
convention is majority votes for classification and simple average for real-valued 
prediction. While the random forest method will collect results from many decision 
trees and aggregate them together to produce the final output.  Each tree in a random 
forest is formed by splitting a group of randomly selected features instead of all 
features based on a random sample of full data, and these are where the term 
"random" come from. All these can help reduce correlation among trees so as to avoid 
overfitting. Generally speaking, random forest is a powerful nonparametric yet 
nonlinear model, but the drawback will be the poor model interpretability compared 
to linear models.  
 
Another state-of-art nonlinear machine learning method is deep learning or a deep 
neural network. Although there has been debate about how "deep" a neural network 
should be in order to call it deep learning, I will skip this discussion and simply 
follows the convention from 
  
 50 
 
Wikipedia(https://en.wikipedia.org/wiki/Deep_learning)–network with hidden layers 
greater or equal than two is considered as deep learning.   
 
For the deep neural network the features values are fed as input layer, the values from 
previous layer of neural network will experience similar types of nonlinear 
transformation to generate values for the next layer. The transformation will usually 
be to multiply by some weights and then applied to certain pre-defined activation 
function—e.g. the sigmoid function and the tanh function. The choice of activation 
function will generally be subject to the field of the application, but sometimes a 
different activation function is also needed for the output layer depending on the type 
of target response. Overfitting can be avoided either by setting a dropout ratio to 
randomly remove a certain fraction of values fed to the next layer or by adding 
penalized terms just like in GLM. Deep learning is a powerful machine learning 
method and is actually the only effective method for certain tasks (LeCun, 2015). 
This power could be explained by the Universal Approximation Theorem (Hornik, 
1991), which provides a theoretical foundation the mechanism of the method. 
Moreover, there has not been a comprehensive application of deep learning to social 
media user predictive analysis, thus I choose this approach in my study to see its 
performance for my tasks.    
 
For the implementation, I perform all my prediction analysis using R. The packages I 
have used include: igraph for network construction and features extraction, glmnet, 
glm2, party and h2o for the prediction. I adopt the default settings for all the three 
methods: 
1. For LASSO GLM, I use the 10-fold cross validation (CV) and refit the model with 
selected features that gives the least CV error. 
 
2. For random forest, I reduce the number of trees from 500 to 200 for runtime 
consideration. Each tree is trained on an independent bootstrap sample of the training 
data with a default number of 5 features randomly selected for each splitting. The 
final prediction outputs are the average or majority votes of all trees.  
 
3. For deep learning, I use 2 hidden layers of 200 neurons in each layer, with 
hyperbolic tangent activation function, quadratic loss function for real-valued 
prediction and cross entropy loss function for classification, with the input features 
dropout ratio as 0.5 within each hidden layer.  
 
A brief tuning parameters search was completed but the above default settings proved 
the best, thus I will adopt them in my experiments. More careful grid searches can be 
carried out to find optimum settings for each method for particular implementations.  
 
For rank prediction and classification, all the features and responses are first log 
transformed and then standardized (centered by means and scaled by standard 
deviations) within the nodes selected (10k in my case). For direct prediction, they are 
just log transformed and not standardized.  For the classification experiments, down-
  
 51 
 
sampling is performed to obtain balanced training set and average prediction scores 
are used to produce the PRAUC values.  
 
3.2.6 Variable importance              
 
Variable importance is another useful output of the prediction other than the predicted 
values. It reflects how effective a given predictor on the given response when the 
predictor is not well correlated with other predictors. Among the three methods I have 
used, only GLM provides a rigorous defined yet easily interpreted variable 
importance measure. This is one of the biggest advantages of the linear model. 
Although not perfect, there are some usable variable importance metrics for the rest 
two methods. The variable importance option provided for random forest is obtained 
by permuting features values and monitor how the prediction performance varies with 
it. If the values of more important features are permuted, the performance will suffer 
more serious degradation, thus a variable importance measure can be computed based 
on this. For deep learning, there is an approximate measure to aggregating weights of 
first few layers of the neural networks, this may work well for shallow neural 
networks, but for really "deep" ones there are still no proper candidates.  
 
 
3.3 Prediction results 
3.3.1 Rank prediction 
Now I show the results of rank prediction with various parameters settings, but as 
mentioned earlier, I will always stick to the default settings unless specific 
notifications. All the rank correlation scores are obtained from the average prediction 
scores out of n=10 (as default value) replicated runs of the experiment.    
 
Figure 3.8 shows the rank prediction results for Nepal set varying methods and 
training sets. In the first plot, the differences among methods are subtle, GLM 
performs slightly better.  The feature combination static+dynamic delivers the best 
results, compared to results with static only, I can tell dynamic features indeed help 
the prediction much. Also addition of network features to baseline ones indeed 
improvement the performance significantly. Since the two nonlinear methods fail to 
beat GLM, I will adopt GLM as the default method from now on since it is fast to run 
and straightforward to interpret. The next two plots show the results from GLM with 
varying number of nodes involved. The x-axis labels the number of top nodes 
included for rank correlation calculation, of which are ranked by past scores; the 
corresponding y axis is the rank correlation measured within that group of nodes. 
Therefore it is not surprising to see the results exhibit much more variations for fewer 
nodes and stabilize when the number is large enough. The network features, 
especially the dynamic ones, outperform the baseline ones much more with smaller 
  
 52 
 
nodes count than larger nodes count. I believe this is because the top nodes (top 10 or 
top 50), compared to lower ranked ones, have exhibited rich activity and thus have 
generated informative network features. Thus the functionality of the network 
statistics can be maximally explored; with more and more nodes involved—especially 
those much less active and lower ranked nodes, the predicting power of network 
statistics will decrease gradually since many of them will be almost inactive or even 
have null values. This is not the case for some of the baseline features, such as 
number of followers/friends and total of tweets, which will remain relatively constant 
regardless nodes activity, thus baseline features will generally provide the most stable 
(although not the best) results. Addition of network features consistent outperform 
baseline ones with changing number of nodes count, and for Nepal set even network 
features alone can provide much better results. Also the choice of training sets here 
does not matter much—they all produce comparable results. Irene+Sandy does a 
slightly better job and thus will be used as the default training set.   
 
 
  
 53 
 
 
 
  Figure 3.8 Rank predictions using GLM on Nepal set varying methods and training sets 
 
Figure 3.9 displays the rank prediction results of Nepal set varying T, h and Δ (all 
default settings are applied unless otherwise specified). The titles of the first four 
plots simply represent the four different values of Ts. The addition of network 
features can consistently improve the performance under various settings, though the 
extent of improvement is different depending on the conditions. The best on average 
feature combination across all settings still seems to be static+dynamic, showing the 
network features alone can have good predicting power, at least to the Nepal testing 
set. While I do not see significant performance differences with varying T and h, Δ 
with 6h indeed does obviously better than Δ with 3h. Considering the static feature 
combination as the case when Δ=24h, there seems to exist an optimum value for Δ to 
give the best performance. A possible explanation for this is: too large Δ cannot 
provide enough dynamic information while too small Δ value may incorporate many 
null feature points thus also degrade the performance due to overfitting. This 
optimum value will depend heavily on the training and testing sets and can be found 
with fine tuning carefully, though the current settings seem satisfactory.  
  
 54 
 
 
 
 
 
 
        Figure 3.9 Rank prediction using GLM on Nepal set varying T, h and Δ  
 
  
 55 
 
 
 
Figure 3.10 Rank prediction using GLM on Nepal set varying the underlying networks 
 
The above Figure 3.10 compares the rank prediction results using two network 
statistics from two different networks—retweet network and mention network. In 
Twitter the mention network is implied denoted with "@" and it is common that a 
user can mention multiple users in a single tweet, which is unlike retweets. And all 
retweets generally belong to mentions, so do all replies towards the tweets. Thus I can 
consider the mention network is including all sorts of user communication, but is not 
mainly diffusion-oriented.  Therefore retweet network features outperform mention 
ones significantly, though the mention network will be much better connected than 
the retweet network.  
 
I have already shown the network features can improve the prediction performance 
with respect to the baseline ones, but I would like to further investigate whether this 
improvement is stable or not.  Thus I carry out the stability analysis in Figure 3.11. 
The rank correlation is calculated from each single run of the experiment instead of 
the average scores from the above plots. Then a rough distribution of the rank 
correlation scores is simulated by repeating the experiments n=100 times. The mean 
and mean±sd values are computed for comparison. For GLM both static+dynamic 
and baseline+static+dynamic outperform the baseline statistically. Moreover, GLM 
stands out the rest two in terms of stability, which is not surprising since both the rest 
two methods have much "randomness" associated with them, especially for deep 
learning due to hidden layer values drop out.   
 
Figure 3.12 and Figure 3.13 show the rank prediction results for Sandy set and 
Blizzard set as the extension study.  The effect of the testing sets is clearly large, 
especially for the network features. The best feature combination is 
baseline+staic+dynamic and the improvement towards baseline is less. This implies 
testing sets will have considerable influence on the predicting power of network 
statistics, but less influence for the baseline ones since they remain stable over time. 
That might be the reason why many of the baseline features share popularity among 
  
 56 
 
previous related studies—they do provide robust results that are less affected by the 
data sets under study.  
 
 
 
  
 57 
 
Figure 3.11 Rank predictions on Nepal set method stability—with methods GLM, RF and 
DL from top to bottom 
 
For the prediction on the Sandy set, the Irene set is the default training set. GLM still 
provides the best results. The result on the Sandy 4.5 set is much better than the 
Sandy set. A possible explanation for this is that the diffusion has just broken out at 
the beginning time of Sandy 4.5, just like the case of Nepal set. This makes the 
predicting power of the features much better than ones from Sandy set, where the 
diffusion has not started.  
 
 
    
Figure 3.12 Rank predictions using GLM on Sandy set varying methods and testing sets 
 
As to the prediction on Blizzard set, the training set is Irene+Sandy for the first 
methods comparison plot. Again GLM still gives the best on average results. Like in 
the Nepal case, different training sets do not give different results. The feature 
combination baseline+static+dynamic consistently produces the best results over 
various settings, which validates the fact that network statistics can indeed improve 
  
 58 
 
the prediction but the extent of improvement will largely depend on the testing set 
and other various settings.  
 
 
 
Figure 3.13 Rank predictions using GLM on Blizzard set varying methods and training sets 
  
 59 
 
 
 
 
          Figure 3.14 Rank prediction on Nepal set GLM variable importance 
 
Figure 3.14 provides the variable importance measure for Nepal set rank prediction 
and the titles for each plot corresponds to various T values. The subscripts for the 
network features refer to the separate time windows of length Δ. All values are 
computed based on n=100 runs, where during each single run different features can 
be selected and different feature weights can be obtained. The upper histograms show 
the frequency of selection from the LASSO algorithm for features over 100 runs, and 
the lower bar charts show the distribution of feature weights for top 10 features 
ranked by the absolute values of their weights over 100 runs. Some baseline 
features—number of followers, number of tweets, number of tweets with URLs and 
past scores are always selected and exhibit high weights across different T values, 
indicating high predicting power consistently. But certain dynamic network features 
are also always chosen and have significant weights, sometimes even outperform the 
baseline ones. Moreover, I can see from the bar plots that all feature weights are 
robust over the 100 runs—again indicating the stability of the GLM method.  In 
  
 60 
 
summary, the network features not only help the prediction, but also are also of great 
importance in the prediction—not add-ons features that just improve the results.   
3.3.2 Classification 
 
 
 
Figure 3.15 Top quantile users binary classification comparison for Nepal set—using 
classification and rank prediction 
 
Figure 3.15 gives the results comparison for top quantile nodes binary classification 
using direct classification and rank prediction. The default method for rank prediction 
is GLM and the default quantile for prediction is 5%. The lower plot uses random 
forest for the classification since it is the best on average method for classification. It 
is interesting to find the rank prediction can also handle this task well, even better 
than the standard classification. Larger improvement is observed for method GLM, 
where the main reason may be the down-sampling in the classification since the 
quality of the CV operation in GLM depends heavily on the size of training set, and 
there is no training data loss due to down-sampling for rank prediction. Great 
improvement of rank prediction over classification with quantile 1% seems to support 
  
 61 
 
this hypothesis, consideration the sample size for 1% is only 400 for classification 
compared to 20k for rank prediction for each run.  
 
 
 
 
  
 62 
 
 
  Figure 3.16 Top quantile users binary classification using GLM from rank prediction 
varying parameters 
 
However, this down-sampling seems to be necessary for classification, the 
performance turns out even worse without it. The main reason that the rank prediction 
is well suitable for this task may simply be that it is a top rank nodes classification 
problem—meaning the positive class to be predicted is highly correlated with the 
rank. While for classification of user classes less correlated with rank, the results 
would be different.  Figure 3.16 lists results from various settings for the top quantile 
nodes binary classification using rank prediction. They are similar to results from the 
ordinary rank prediction, which is natural. The order of the feature combination 
remains stable over various settings. Combined with similar results from rank 
prediction, I conclude the improvement of network statistics is robust.    
 
 
  
 63 
 
 
 
  Figure 3.17 Top quantile users binary classification from rank prediction PR curves—1%, 
5% and 10% from top to bottom 
 
Figure 3.17 displays the actual precision-recall curves for three quantile values. The 
sharp rise for feature combination static is likely due to the first few top-ranked nodes 
by prediction scores do not include any true positive values. While the PRAUC value 
can be considered as an average measure of the performance, the PR curve gives the 
full dependency plot of precision and recall values. Again the improvement of 
network statistics is still robust across changing recall values.   
 
 
 
     Figure 3.18 Rapidly increasing users binary classification from rank prediction PR 
curves—1%, 5% and 10% from top to bottom 
 
Figure 3.18 shows the results for rapidly increasing nodes binary classification using 
standard classification (with RF) and rank prediction (with GLM). The x-axis stands 
for the percentage of increase, and y-axis uses the same PRAUC score as the top 
quantile nodes classification. The ordinary classification can simply be replaced by 
  
 64 
 
rank prediction since they are producing comparable results. Compared to the true 
fractions of these nodes, which can be considered as the approximate scores for the 
complete random model, the results are bad—only do twice better towards the 
random guess. The differences among various settings are not significant; this simply 
implies that my framework cannot handle this task well based on given data.  
 
3.3.3 Direct prediction 
 
 
 
 
      Figure 3.19 Direct predictions on Nepal set varying methods and T 
 
Figure 3.19 gives the results of direct prediction on Nepal set varying methods and T, 
recall the default training set used here is Irene+Sandy.  The GLM still performs the 
best, network features still improve the prediction consistently and dynamic network 
features still outperform the static ones. Considering the response distribution on the 
log scale, results with RMSE around 2 can be considered good.  However, after I 
  
 65 
 
perform some extension experiments on direct prediction, I find the results are not as 
stable as ones obtained from rank prediction.  
 
 
 
    Figure 3.20 Direct prediction using GLM on Nepal set varying training sets  
  
 66 
 
 
Set Irene Sandy Sandy 4.5 Irene+Sandy Irene+Sandy 4.5 
Log past scores 0.394 0.740 0.742 0.561 0.277 
 
      Table 3.3 KS statistics of past scores between various training sets and Nepal set  
 
Figure 3.20 lists the result of direct prediction on Nepal set by varying training sets. 
The second plot is the same as the first except results from Sandy 4.5 are removed for 
better visualization. Clearly the choice of training sets proves to be important for 
direct prediction, which is unlike the rank prediction case. The performance can go 
from pretty good to extremely bad if the training set is not properly chosen. The main 
reason lying behind should be the heterogeneity between the training and testing sets. 
However, there is no existing rigorous method for training set selection for my task 
here. Thus I try to propose some empirical criteria on my own. One potential 
candidate is using the Kolmogorov–Smirnov (KS) statistics, which is a commonly 
used metric to measure similarity between two probability distributions.  Here I 
simply calculate the KS statistics of past scores between each given training set and 
the testing set. The KS scores are computed using the R base function "ks.test". The 
reason to use past scores is that I believe past scores represent past activity of users 
and should be most correlated to their future response. I can definitely compute the 
KS statistics for each feature that I think are important and aggregate the results to 
give a single output. However, given that the variable importance is not given 
beforehand, it is difficult to determine which features to choose and how to aggregate 
the resulting scores from each feature. Thus I will just use past scores to compute KS 
statistics for now being.  
 
Comparing the prediction results in Figure 3.20 and KS statistics in Table 3.3, the KS 
statistics can indeed identify good training sets to some extent, though still far from 
perfect. Since the calculation of KS statistics depends on the sample size of the 
empirical distributions, I need to compare KS scores of single training set and double 
training sets separately. Following this, Table 3.3 tells Irene set is the best single set 
while Irene+Sandy 4.5 is the best double sets, which all fit the results well. However, 
the KS scores cannot tell whether Sandy or Sandy 4.5 is worse, thus my suggestion is 
just use the score to choose the best one (the one with significantly lower score) to 
use.   
 
  
 67 
 
 
 
Figure 3.21 Direct predictions on Sandy 4.5 set and Blizzard set varying methods 
 
Figure 3.21 gives the results for the direct prediction extension study on Sandy 4.5 set 
and Blizzard set varying methods. The reason to use Sandy 4.5 set simply is it returns 
better results for rank prediction than Sandy set. The leading position for GLM is still 
firm, though the order of feature combinations are not so stable compared to rank 
prediction.  
 
Figure 3.22 shows the results on Blizzard set using GLM method with varying 
training sets. Similar problems as Nepal set prediction are observed, and it becomes 
even worse. Also the performance of network statistics suffer most from wrong 
choice of training set, although they still provide some improvement for the best 
training set.  This is why the direct prediction is unstable.  
 
  
 68 
 
 
 
 
Figure 3.22 Direct prediction using GLM on Blizzard set varying training sets 
 
 
 
  
 69 
 
Set Irene Sandy 
4.5 
Nepal Irene+Sandy  Irene+Sandy 
4.5 
Irene+Nepal 
Log past 
scores 
0.380 0.900 0.590 0.102 0.444 0.444 
 
       Table 3.4 KS statistics of past scores between various training sets and Blizzard set  
 
Once again, the KS scores provide good suggestion on choosing a proper training set. 
The best single set and double set given by KS scores is Irene and Irene+Sandy 
respectively, both of them are indeed the ones that give the best on average results. 
Since the network statistics could potentially suffer much degradation due to 
improper choice of training sets, I use the more stable baseline feature combination 
for the direct prediction task when I cannot pick a best training set from KS scores or 
other related metrics.  
 
 
           Figure 3.23 Direct prediction on Nepal set GLM variable importance  
  
 70 
 
Figure 3.23 shows the variable importance of GLM method on direct prediction of 
Nepal set. The setting is exactly the same as the rank prediction case. Here I rank the 
frequency with increasing order in the upper selection frequency plot since most of 
the features are selected all the time. For the feature weights bar charts, there are 
more dynamic network features staying on this top10 rank than the rank prediction 
case, recall the features are ranked by the absolutely value of their weights.  
Therefore I believe the network statistics still have great potentials for the direct 
prediction task, but how well they can perform will heavily rely on a good choice of 
training set.  
 
 
User_name 
(rank/log(score+1)) 
Past score  Predicted 
by SD 
Predicted 
by B 
Predicted 
by BSD 
True  
ActualidadRT 13/7.84 4/9.11 17/8.69 7/9.65 7/8.60 
johnspatricc 32/7.33 9/8.68 66/7.89 36/8.35 6/8.60 
AP 8/8.67 15/8.38 4/8.24 11/9.95 3/8.82 
TurkKizilayi 238/5.83 181/6.26 516/4.90 301/5.93 2/8.83 
ksushma140 21/7.54 11/8.46 299/5.64 172/6.54 5/8.66 
 
               Table 3.5 Prediction further investigation on selected nodes using GLM  
 
In order to further investigate how the features are acting on various users, I selected 
some users as examples to see their individual prediction output. Table 3.5 shows the 
prediction results for 5 selected users. The short notation SD, B, and BSD refer to the 
feature combination static+dynamic, baseline and baseline+static+dynamic 
respectively. The values in the table are shown in the format of "rank/log(score+1)". 
For users "johnspatricc" and "ksushma140", the network features alone can make 
good prediction while the baseline features are off target. On the contrary, baseline 
features predict well on user "ksushma140" when the network features are off the 
target. And in both cases, combining baseline and network features will generate 
average outputs.  As results from various settings show, baseline+static+dynamic is 
indeed the most stable feature combination and nearly always outperforms the 
baseline ones. However, for user "TurkKizilayi", none of the feature combinations 
can do a decent prediction. The main reason is that this user is simply not active in the 
past if I look at the past scores, thus I cannot obtain informative features. Thus this 
brings one drawback of the framework—it does not well on users with poor activity, 
which is something hard to avoid for this type of urgent diffusion prediction unless 
additional information can be obtained from external resources.  
 
Thinking on the side of stake holders who want to find potential influential users, the 
actual scores or rank of the users might not be a major concern, where the goal is to 
find a list of potential candidates.  In that case, the main goal is to discover how many 
"real" top nodes are missed from each feature combination. Table 3.6 shows the 
missing count by the three feature combinations plus the missing count by the union 
of top nodes predicted by all three feature sets.   
 
  
 71 
 
 
 
 
 
Missing 
counts by 
rank 
Missed by 
SD 
Missed by B 
Missed by 
BSD 
Missed by 
all 
Count of 
the union 
Top 10 5 6 7 4 16 
Top 20 11 12 11 8 33 
Top 50 19 30 24 15 75 
Top 100 40 45 44 27 144 
 
               Table 3.6 Missing nodes count under various feature sets using GLM  
 
For instance, in the first row it shows among the real top 10 nodes ranked by the true 
response, only 5 are captured by top 10 nodes given by feature set static+dynamic and 
only 4 are captured by top 10 nodes given by feature set baseline etc.; if I union the 
top 10 nodes from three feature combinations together, I can capture 6 of real top 10 
nodes but the total number of nodes I am using will increase to 16.  Here I illustrate 
an alternative way to solve the problem, while whether using the union to achieve the 
maximum coverage is subject to the practical implementation considerations—e.g., 
what is the cost of targeting each user and what is the benefit if I correctly specify one 
more user.  
 
3.4 Discussion 
3.4.1 Summary 
In this Chapter I have described a comprehensive study that is across various settings 
on user influence prediction during urgent event diffusion.  The main findings are: 
  
1. The network features as defined, especially the dynamic network features can help 
the prediction and improve the prediction performance over baseline features across 
various settings consistently. The dynamic features outperform the static ones 
consistently—implying the additional dynamic information is helpful in the 
prediction. Similar results are shown by both the prediction results and the variable 
importance ranking. The extent of improvement largely depends on the testing set and 
other parameters.  
 
2. GLM proves to be the best prediction method considering various factors: 
prediction performance, interpretability, variable importance and runtime efficiency 
etc. The good performance of GLM implies that there could be strong log-linear 
dependency among the selected features and the response. In contrast to the 
  
 72 
 
conventional impression that the nonlinear models will usually outperform the linear 
model, the two nonlinear methods involved in my prediction task perform similarly or 
even slightly worse than the linear method. One direct reason for this could be that 
there is no significant nonlinear relationship between the selected features and the 
response.  
 
3. The rank prediction demonstrates the most stable and reliable result, while the 
result for direct prediction is not stable and highly depends on the choice of training 
sets and testing sets. Moreover, it is shown that both classification tasks—top quantile 
nodes and rapidly increasing nodes classification can be solved well using rank 
prediction solely.  
 
4. KS statistics can potentially serve as an empirical criterion to choose training set 
for direct prediction, which proves to be of extreme importance.  
 
5. Both baseline and network features do not predict well on nodes with poor activity, 
nor for the rapidly increasing nodes classification task. The main reason behind is 
similar—the features used are not informative, external resources or data is needed to 
handle this problem.  
 
In summary, I have established a novel early time prediction framework that utilizes 
all of the limited information that would be available in an urgent diffusion context 
and still achieves reasonable results. Moreover I have shown that the newly adopted 
features from the underlying diffusion network have consistently impressive 
predicting power on the future user influence.   
3.4.2 Future work 
One crucial future direction would be to extend this framework to other types of 
urgent events other than the natural disasters used in my work, which should output 
comparable results, at least in the Twitter context.  Another promising direction is to 
evolve the whole framework would be to other platforms—e.g. Facebook or 
Instagram, which again should not be difficult to do since these social media share 
much in common and as noted in the introduction part—many models and methods 
can be applied interchangeably among these platforms. Finally, provided with enough 
time and engineering effort, it would be possible to build a user friendly API to 
implement the framework that would interface with the Twitter API to provide real-
time online prediction for any user of the API.  
 
 
  
 73 
 
Chapter 4: Predict topic popularity during urgent event diffusion 
 
4.1 Introduction 
4.1.2 Motivation 
In the previous section, I discussed two well—cited studies on hashtag popularity 
prediction and the main reasons why they are not well fit for my prediction task.  In 
this Chapter, for the hashtag prediction task I describe the use of the number of 
retweets a hashtag can obtain in the prediction period as the prediction goal. 
Generally speaking, the framework for hashtag popularity prediction is similar to the 
user prediction case since they both belong to the urgent diffusion prediction 
category. Since the behavior of hashtags can be somehow considered as an 
aggregation of many users' behavior, there should be something more general in 
common between the two tasks. Thus I propose to make the general settings of the 
two tasks as similar as possible. I also adopt two sets of features—baseline and 
network features for the hashtag prediction, corresponding to ones used in the user 
case. Moreover, since I do not adopt any text content features for the user prediction, 
I also remove them from the general settings for hashtag prediction. But to compare 
with previous studies, I also show how the results can vary if all available text content 
features are included.     
 
As the topic prediction part, I will use the LDA model for topic modeling and propose 
prediction on generated latent topics. One important question left to determine is the 
definition of documents. Godin (Godin, 2013) uses single tweets as documents to 
train LDA models for hashtags recommendation towards general tweets not including 
hashtags. But Hong (Hong L. a., 2010) has pointed out that the effectiveness of topic 
models can be highly influenced by the length of the "documents", and for better 
topic modeling aggregation of short messages is recommended. Ma (Ma, 2013) 
considers hashtags as documents and uses topic probabilities derived from LDA 
model as text content features for hashtag popularity prediction, suggesting hashtags 
are potential candidates for LDA documents. Moreover Mehrotra (Mehrotra, 2013) 
points out pooling of tweets by hashtags can provide an improvement over LDA topic 
modeling of single tweets. Therefore in this task I decide to simply treat hashtags as 
documents and perform LDA topic modeling only on tweets containing hashtags, this 
is due to the similar reason I mentioned earlier in literature review part: I do not have 
the time and interest on specifying related hashtags for general tweets since it is not 
the main focus of my study. Further details on hashtags text cleaning and LDA 
modeling are mentioned in the next section.  
 
  
 74 
 
In this Chapter I will show there are a great similarity yet some slight differences 
between prediction of user influence and prediction of hashtag popularity. In both 
tasks I find addition of network statistics can have significant improvement on the 
prediction performance over baseline features, also the linear method outperforms the 
nonlinear ones consistently—possibly indicating strong log-linear existing between 
selected features and target response for both prediction tasks. The LDA topic 
prediction shows a slight different pattern since each topic is a weighted aggregation 
of all the involved hashtags.   
 
 
4.1.2 LDA model implementation 
Before implementing the LDA model, I need to preprocess the text first. The steps for 
the text preprocessing are exactly the same as in Chapter 2 for hashtags clustering and 
classification analysis (see Figure 2.7).  One extra step to do before the text cleaning 
is filtering. Here I adopt a simple filtering strategy: removing hashtags with unknown 
digits or including only one digit, by doing so many noisy and irrelevant hashtags will 
be removed. More involved and rigorous filtering may include spam hashtags 
detection (Stringhini, 2010) (Castillo, 2011) or hashtags semantic clustering (Costa, 
2013) (Ozdikis, 2012) (Vicient, 2014). However, these hashtags filtering and cleaning 
approaches often require well labeled data to train the models and validate the results, 
which are not available to me; moreover, my main goal is to explore the early time 
prediction framework for urgent diffusion on Twitter, and I believe hashtags filtering 
will not affect the performance much but will indeed matter for practical 
implementation where the quality of the underlying topics will be a big concern.  
 
After the simple filtering is performed, each hashtag is considered as a separate 
document and the text of the document is simply the aggregation of all tweets 
including that hashtag. The text is processed following the steps showing the Figure 
2.7. Then the top 10k tokens, across all documents, will be used with tokens 
occurring over 50% across all documents eliminated to remain more distinguishable 
tokens among the documents. Finally TF-IDF weighting is applied to improve the 
modeling performance.  
 
Generally speaking, the Latent Dirichlet Allocation (LDA) model, is a flexible topic 
model to extract latent topics from a collection of documents while each document is 
considered as a mixture of various topics with the probabilities provided by the 
model. This is where the flexibility comes from—each document can either be more 
concentrated on fewer topics or be more diverse on more topics. Another advantage 
of LDA is the generalizability to new documents, compared to the other popular topic 
model— probabilistic latent semantic analysis (pLSA), of which the size of 
parameters is increasing linearly with the number of training documents and causes 
serious overfitting problems.  
 
Figure 4.1 shows the plate notation for the ordinary LDA model. Here I assume there 
are M documents, K underlying topics and each document has the same number of N 
  
 75 
 
words for simplicity. Since the general framework is Bayesian, each parameter 
involved is a random variable. Among these variables only the word identification 
variable—w is known to me; thus it is in grey color and rest unknown variables are in 
white.  More specifically, the meaning of each variable is:  
α: the Dirichlet prior parameter for document-topic distribution 
θ: the multinomial topic distribution probabilities varying from each document 
z: the topic assignment for each word within each document 
β: the Dirichlet prior parameter for topic-vocabulary distribution  
φ: the multinomial vocabulary distribution probabilities varying from each topic 
w: the identification for each word within each document 
 
The model considers each document as a collection of words, and the following is the 
general data generating process:  
1. Generate the document-topic distribution θ from the prior α for each document 
2. Generate the topic-vocabulary distribution φ from the prior β for each topic 
3. For each word within each document, draw the topic assignment z from the 
distribution given by parameter θ, then draw the word w from the topic-vocabulary 
distribution φ with the topic assigned as z  
This data generating process explains the high flexibility and generalizability of the 
model, and it can be further adapted to different situations by various extensions and 
derivations.  
 
           
 
                                              Figure 4.1 LDA model illustrations  
(source: https://en.wikipedia.org/wiki/File:Smoothed_LDA.png, this file is licensed under the 
Creative Commons Attribution-Share Alike 3.0 Unported license) 
 
As to the model inference, since the target posterior distribution is intractable, 
approximate approaches are implemented. There are two state-of-art inference 
approaches for LDA model—collapsed Gibbs sampling (Porteous, 2008) and 
variational Bayes method (Blei, 2003). While the collapsed Gibbs sampling integrates 
out unimportant variables to improve convergence speed, variational method chooses 
a simplified distribution to approximate the target by minimizing the KL divergence. 
In my study, I implement the LDA model using the python library genism (see 
  
 76 
 
https://radimrehurek.com/gensim/ for more implementation details), which adopts an 
online version of variational Bayes method. The algorithm is better scalable to larger 
collection of documents and can update the model continuously. For my task, 
estimates of the document-topic probabilities are the only ones that I care about. In 
most results I will adopt the default settings unless otherwise specified.  
 
An important parameter to be fixed before I apply LDA modeling is the number of 
topics for the model. Ideally this value should be determined by some external 
evidence—human judgement, extra information about the documents or some 
supervised leaning tasks associated with the underlying topics etc. If no external 
resources are available, like in my case, I have to resort to some internal metrics. One 
of the intrinsic metrics often used for topic quality measurement is the topic coherent 
score. Topic coherent scores are generated to measure content consistency and 
coherence within each topic. There are various definitions for the coherent scores, in 
my study, I will just stick to one of them—the UMass score (Mimno, 2011). This 
score calculates the conditional probabilities of less frequent terms on more frequent 
terms where all terms are from top ranked words by frequency from each topic. 
Recalling that I have adopted the TF-IDF weighting before implementing the LDA 
model, thus the TF-IDF weighted version of corresponding probabilities are used for 
both model inference and coherent scores calculation.  
 
Number  5 8 10 15 20 
Mean  -341 -354 -364 -377 -387 
SD 22 15 21 14 11 
 
Table 4.1 Average topic coherent scores distribution varying number of topics with 100 
repetitions for the whole Nepal set 
 
Table 4.1 shows the mean and standard deviation of average coherence scores across 
topics after n=100 repetitions under various number of topics for the whole Nepal set. 
More specifically, under a given number of topics, top ranked (top 20) terms from 
each topic of the whole Nepal set are selected to compute a score and then the 
average score from all topics is obtained; the process is repeated by 100 times, the 
corresponding mean and standard deviation of the mean scores are listed in the table. 
Simply judging by the mean value of the average score, the coherent score increases 
as the number of topics decreases. However, the standard deviation is large so that it 
can just fill the differences, implying the results are unstable. One possible 
explanation for this unstability could just be that the topics derived from the model 
are not table. Moreover, this is only a relatively static calculation, the actual number 
for the topics may change over time and the content of the documents (hashtags) can 
also drift. Since the main goal of my task is to check how the prediction will vary 
between the appearing topics—hashtags and the abstract topics, I believe the actual 
quality of the topic modeling is of less concern. Out of this consideration, I simply 
choose an intermediate number 10 as the number of topics for my task and I believe 
the obtained conclusion is well generalizable to other cases with different number of 
topic.   
  
 77 
 
 
 
4.2 Prediction on hashtag popularity 
4.2.1 Response distribution 
For the hashtag prediction task in this section, the setting and the framework are 
similar to the user case. So I simply repeat the same analysis done in the previous 
Chapter to see how these two prediction tasks will vary from each other. The same 
sets of data are used except for Irene set, which is excluded since it does not have 
many hashtags involved. Figure 4.2 shows the response distribution of various data 
sets under the default setting: T=1d, h=5d and k=10k. However, I notice that for 
Nepal and Blizzard sets the total number of hashtags that has any retweets within 
period T is only about 7000-8000, thus the actual number of hashtags involved in 
prediction is not 10k here. The distribution patterns are similar to what I have 
observed for the user case: power law distribution and comparable normalized 
response distribution. Figure 4.3 shows the percentile and ratio of increase 
distribution for hashtags from Nepal set, the distribution is more extreme (the 
differences among the points are much larger) compared to the user case: the red 
threshold line raises about 20% in the quantile value (with the same threshold 
value=10) and the magnitude increases by about 2 for top quantile points.    
 
  
 78 
 
 
 
   Figure 4.2 Hashtag response distribution with various data sets with default parameters 
 
This "the rich get richer" is similar to the famous "Matthew effect" (Merton, 1968). 
The differences between the distribution of user score and hashtag score demonstrates 
this Matthew effect is much stronger for hashtags than users, which I believe is a 
natural result from the fact that the number of hashtags is much less than the number 
of users involved.  
 
  
 79 
 
 
        Figure 4.3 Hashtag scores percentile and ratio of increase distribution for Nepal set with 
default parameters 
 
Table 4.2 is the response distribution summary from various data sets with varying T. 
Here the Sandy set refers to the Sandy 4.5 set and I will use this reference for all 
prediction tasks in this Chapter.  
 
 1d 2d 3d 4d 5d 
Sandy 
Total 62755 80394 87838 93743 99759 
75% 1 0 0 0 0 
95% 18 9 6 4 2 
>=10 4793(0.729) 3828(0.728) 3440(0.744) 2816(0.783) 2030(0.833) 
>=100 879(0.914) 692(0.899) 607(0.904) 469(0.932) 299(0.960) 
Nepal 
Total 7800 11861 14457 16573 18309 
75% 8 3 2 1 1 
95% 110 51 35 25 19 
>=10 1795(1.000) 1739(0.969) 1670(0.947) 1575(0.928) 1460(0.927) 
>=100 424(1.000) 368(0.986) 334(0.991) 306(0.987) 267(0.978) 
Blizzard 
Total 7812 16018 19267 20991 21943 
75% 6 1 0 0 0 
95% 75 15 7 4 2 
>=10 1524(1.000) 1145(0.948) 719(0.940) 484(0.942) 308(0.954) 
>=100 312(1.000) 166(0.964) 97(0.969) 64(0.984) 38(1.000) 
 
     Table 4.2 Hashtag response distribution over varying T among various data sets (h=5d) 
 
  
 80 
 
 
 
 Figure 4.4 Top 1k hashtag components analysis over varying T values (h=5d) for Sandy set, 
Nepal set and Blizzard set from top to bottom respectively  
 
 
  
 81 
 
All the specifications of Table 4.2 follow exactly the same meanings as Table 3.2. As 
stated before, the total number of hashtags is much less than the number of users in 
the same settings, especially for Nepal set and Blizzard set. This makes the 
distribution more concentrated: top quantile values are larger and there are larger 
fractions of "active" hashtags (hashtags with response greater than 10 and 100) within 
top 10k hashtags (recalling the numbers in parentheses are the corresponding 
fractions within the top 10k hashtags ranked by the response). Thus top 10k hashtags 
better represent the population of "active" hashtags than the user case. Also the 
number of "active" hashtags is dropping over time: it implies the gradual decaying of 
the diffusion and coincides with the observation for the user case.  
 
Similar to Figure 3.7, Figure 4.4 shows the components evolution of the hashtags 
over time. But since I have much less number of total hashtags, I choose a smaller 
threshold for hashtags components observation—top 1k hashtags as opposed to top 
10k users for the user case. Here I follow the same definitions for the three types: old 
ones refer to hashtags staying in the top 1k rank within past period (period [0,T]), 
rising ones refer to hashtags existing in the past and staying below the top 1k rank, 
and new ones refer to hashtags not existing in the past. All fractions will correspond 
to the fractions over top 1k hashtags by response during period [T,T+h]. Recall the 
Sandy set here is actually the Sandy 4.5 set, thus it is not surprising to see the 
fractions for all three types of hashtags remain steady over time since the diffusion 
turn to be steady during that period for Sandy. For the Nepal and Blizzard set, the 
fraction of new hashtags follows a gradually decreasing trend which means the 
diffusion process has turned from eruption state to steady state. While the old 
hashtags takes the lead at most times for all three sets—showing relative higher 
stability in the ranking compared to the user case, the behavior of rising ones is 
slightly different—showing differences of hashtag dynamics among three data sets.     
 
To summarize, compared to the user case, similar yet more concentrated distributions 
exist for hashtags, and relatively consistent dynamic patterns from evolutions of both 
response and node components persist. All these similarities seem to demonstrate the 
same underlying truth about the urgent diffusion regardless of the diffusion media 
(users or hashtags) I have chosen. Thus consistent prediction results should also be 
obtained for the hashtag case and I will show it in next sections.     
 
4.2.2 Experiment factors 
The basic experiment settings will be the same as the user case. I will set up a similar 
series of experiments compare these two prediction tasks on every aspect. Thus I will 
only mention the parts that will be different. Since now the prediction is based on 
hashtags, the extracted features will be different from the user ones. I am still going to 
have two general types of hashtag features: baseline features and network features. 
Apart from these two general types, there are also other features for extensive studies.  
 
The general baseline features will include:  
  
 82 
 
'Number of tweets', 'Number of retweets', 'Number of mentions', 'Number of tweets 
with URLs', 'Number of users', 'Average user tweets', 'Average user followers' and 
'Average user friends'.  
All the above features are derived from tweets containing specific hashtags.  
And the general network features (both the static and dynamic ones) include: 
'Average indegree', 'Density', 'LSCC', 'LWCC' and 'WCC'  
As opposed to the node level features for the user prediction, these network features 
are global level ones derived from the retweet networks formed from each hashtag.  
 
Other additional features include network features from the co-occurrence (COO) 
network and text content features from hashtag tweets text.  The COO network is 
defined to capture the co-occurrence relationship among hashtags. For tweets 
including multiple hashtags, I form pairwise undirected edges between all hashtags 
showing up in the same tweets. When users put multiple hashtags in the same tweets, 
it is actually indicating those hashtags are somehow related from other, at least 
through the content of the tweets. Therefore the motivation to setup this network is to 
characterize the relationship among various hashtags. I would like to know if features 
derived from this network will have any impact on predicting the hashtag popularity. 
The network features with both static and dynamic versions include:  
'Degree', 'Pagerank', 'Eigenvector centrality', 'Closeness centrality', 'Authority score', 
'Hub score' and 'Local transitivity'.  
All these are node level features defined on the undirected COO network.  The text 
content features are derived from the tweets text—including the hashtag text itself. 
These features are derived following recommendations from the previous work (Tsur, 
2012) (Ma, 2013) and availability to me, which contain the following:  
'Hashtag character length', 'Hashtag digits indicator', 'Number of tokens', 'Average 
polarity', 'Clarity score'. The first two are simply features from the hashtag text. The 
other four are features from the text of hashtags. The polarity score is computed to 
measure how extreme a given tweet is. The score ranges from -1 to 1: -1 for 
extremely negative content, 1 for extremely positive content and 0 for neutral content. 
Average polarity is the average polarity score for each tweet from a given hashtag. 
Since I am doing log(x+1) transformation, I add 1 to all polarity scores to avoid 
getting invalid values. Clarity score is calculated in the same way as (Ma, 2013), 
which is the KL divergence between the token distribution within hashtags and the 
distribution within text of all hashtag tweets. This score serves as a quantitative 
measure to detect the distinction between document text and the background text—
thus the "clarity" of the document. These additional features are adopted for an 
extension study to identify whether they can improve the prediction in addition to the 
above two general types.  
  
 83 
 
4.2.3 Rank prediction results 
 
 
 Figure 4.5 Nepal hashtag rank prediction results varying by methods, training sets and T 
  
 84 
 
Figure 4.5 shows the results for Nepal hashtag rank prediction varying by methods, 
training sets and T, with exactly the same setting as the user case for direct 
comparison. Default values are assumed for prediction parameters not mentioned 
explicitly. All the patterns observed in the figure are similar to the user case:   
1. GLM still does the best overall job and the differences among methods are subtle 
when number of nodes involved for the rank prediction is large.  
2. The retweet network features improve the prediction when added to baseline ones.  
3. Dynamic features not only do steadily better than static ones, but also outperform 
baseline ones for themselves alone when the number of nodes involved is not large.  
4. Consistent results are observed by varying T values. 
Since Sandy 4.5 set outperforms Sandy set evidently, I will use Sandy 4.5 set as the 
default training set.  
  
Figure 4.6 lists the results from several additional comparisons. Here RT_network 
features refer to both static and dynamic network features from retweet network, and 
COO_network features refer to corresponding ones from the COO network. Unlike 
the slight improvement by using retweet features in the user case, here the 
performance of the network features derived from retweets and mentions are almost 
the same. I think the possible reasons are: the node level network features are used for 
user prediction while global level features are used for hashtag prediction, and the 
variability among node level features are much higher than global level ones, thus 
node level features seem to able to exhibit the distinction between retweets and 
mentions more obviously. The second plot shows the results using COO network 
features. Unlike the retweet network features, the COO network features alone cannot 
provide good results. The retweet network features have prominent prediction power 
for smaller number of top hashtags, when the improvement from the baseline ones is 
also greater. When the number of hashtags involved is large, where many of them are 
almost inactive (Figure 4.3), the performance of network features degrades and less 
improvement is observed. Though adding COO features to baseline ones can obtain 
intermediate improvement, using them alone has the worst performance as always. 
Moreover, the feature combination with all features only barely beats the combination 
of basline+retweet features when the number of hashtags is large, but has much worse 
performance when the number is small. Therefore the COO features are much less 
effective than retweet features, thus I do not recommend extracting them for the 
prediction since extra amount of network computing is involved and it leads to low 
benefit cost ratio. However, when computing resource is not a big concern, they can 
still be added to provide an alternative solution. As to the text content features, to my 
surprise, themselves alone are even giving much worse results than the COO features 
(about 0.1-0.2 on the rank correlation score, not shown in the plot to provide better 
resolution). What's more, the baseline+content feature combination has no 
improvement over baseline, and so does the combination of all features over 
baseline+retweet combination. This simply implies the content features have nearly 
no predicting power at all, at least to this rank prediction task.  
 
  
 85 
 
 
 
 
  Figure 4.6 Nepal hashtag rank prediction results by varying additional factors 
 
  
 86 
 
This observation is in contrast to (Ma, 2013), where some of the content features are 
considered to be effective. I speculate the main reason lies in the context of the 
prediction task: some derived content features use information from a long term 
collection of data set with a mixture of all types of hashtags, while my task is 
focusing on prediction of hashtags restricted to limited time period and a specific 
event. Regardless of the true underlying reason, I have shown that the retweet 
network features turn out be much more effective than content features in this special 
urgent diffusion prediction task. This may have profound implications: content-based 
features, while could contain rich information and be highly effective potential 
predictors, cannot fully utilize their potentials and perform well on a prediction task 
restricted with limited information; while diffusion based network features, on the 
other hand,  can still stably provide decent performance under this difficult condition.  
 
 
 
 
 Figure 4.7 Blizzard hashtag rank prediction results by varying methods and training sets 
 
  
 87 
 
Figure 4.7 shows the results for Blizzard set hashtag rank prediction for extension 
study and the default Sandy 4.5 training set is used in the first plot. Like in the user 
case, the results turn to be a little noisier, but these should be reasonable considering 
the high variability nature of the prediction task. Compared to the other two methods, 
GLM is still the best choice if I take its various advantages into consideration. The 
network features still improve the performance overall, just not to the extent of the 
Nepal case. As to the training sets, Nepal and Sandy 4.5 seem to perform slightly 
better than Sandy set, but they are close on average. This is in consistent with what I 
have observed for the user rank prediction—the differences among various training 
sets are not significant.  
 
 
 
     Figure 4.8 Nepal hashtag rank prediction GLM method stability  
 
Same as Figure 3.11, Figure 4.8 is the stability plot for GLM method for the hashtag 
rank prediction task (n=100). The result is even more stable than the user case, which 
could be simply because the hashtags have more stable rank. The other two methods 
still have much worse stability like in the user case and thus are not shown here to 
save space. Therefore I can say GLM is the best method for both user and hashtag 
prediction tasks in terms both performance and stability, which could possibly 
suggest a strong yet stable log-linear dependency between the selected features and 
the target response.  
 
The results for the classification tasks are in Figure 4.9. For the top quantile 
classification task the rank prediction has comparable performance as formal 
classification (same setting as in the user prediction). This is basically the same 
conclusion I have drawn for the user prediction. However, for the rapidly increasing 
classification task, rank prediction significantly outperforms formal classification, 
which is different from the user case.  Though not being satisfactory based on the 
PRAUC score, the performance of the rank prediction is nearly 8 times against the 
random guess across different increasing percentages, compared to only about 2 times 
in the user case. This improvement over user prediction may also result from the 
more concentrated hashtag response distribution. Recalling the previous quantile 
  
 88 
 
analysis, the same response threshold bar—10 for the rapidly increasing classification 
corresponds to about 80% quantile in the hashtag case and about 60% quantile in the 
user case respectively. Therefore the positive data points in the hashtag prediction 
case contain higher fraction of top-tier points by response than that of the user 
prediction case, which could lead to better alignment of the classification with rank 
prediction in the hashtag case over the user case.  
 
 
 
                Figure 4.9 Nepal hashtag classification results  
 
Figure 4.10 shows the variable importance ranking for the GLM method. Same to the 
user case, I am using the baseline+static+dynamic feature combination. But this time 
I reverse the order in the frequency plot since most of the features involved have been 
selected every time (n=100), more than the user case, which also explains the better 
method stability for the hashtag prediction. As to the top 10 variable magnitude rank, 
similar patterns as the user case are observed: there are constantly important baseline 
features like number of tweets and URLs, and also some varying dynamic network 
features such as the WCC values within different time windows, where the fractions 
  
 89 
 
of the selected baseline features and network features are nearly 1:1. Interestingly, for 
dynamic network features derived from the same network statistics—WCC, they can 
even have different signs in the weights; moreover they are also significant regardless 
of the signs. This again illustrates the necessity of adopting the dynamic network 
features: they can show varying effects under varying time intervals and each of them 
may turn out be significant as well; while simply using static features will not only 
hide these dynamic patterns but degrade the performance as well.   
 
   
 
 
        Figure 4.10 Nepal hashtag rank prediction GLM method variable importance 
 
  
 90 
 
4.2.4 Direct prediction results 
 
 
Figure 4.11 Nepal hashtag direct prediction results varying by methods, training sets and T 
  
 91 
 
 
 
 
Figure 4.12 Blizzard hashtag direct prediction results by varying methods and training sets 
 
Following similar settings as the user case, Figure 4.11 and Figure 4.12 show the 
direct prediction results for Nepal set and Blizzard set respectively. Compared to the 
user case, hashtag direct prediction exhibits both similarities and distinctions. In both 
cases, on average across all settings GLM has the best performance and addition of 
network features make improvement on prediction over baseline features.  Also the 
variations among methods and training sets are larger than the rank prediction. But 
unlike to the user prediction case, the variations among training sets are much smaller 
for hashtag prediction, which demonstrates more stable prediction. A possible 
explanation is that the heterogeneity among hashtag features and response across 
various data sets is much smaller compared to that in the user case. Although the 
selection of training sets is of less concern than the user case, prominent improvement, 
which is even larger than improvement from good features selection, can still be 
obtained if a proper choice is made.  
 
  
 92 
 
Set Sandy Sandy 4.5 
Log scores (top 3000) 0.722 0.746 
 
Set Sandy Sandy 4.5 Nepal 
Log scores (top 3000) 0.643 0.810 0.140 
 
Table 4.3 KS statistics of past scores for various training sets with Nepal set (top) and 
Blizzard set (bottom) 
 
 
 
 
        Figure 4.13 Nepal hashtag direct predictions GLM method variable importance  
 
In Table 4.3 the KS statistics from past scores for training sets selection are presented. 
Since now different sets have different number of data points (see Table 4.2), the 
comparison cannot be made directly with different sample sizes. Here I only use the 
top 3000 data points for the KS statistics calculation. For the Nepal set prediction, the 
two values obtained are close, thus I cannot make a good decision on which training 
set is better. For the Blizzard set prediction, though I still cannot distinguish whether 
  
 93 
 
Sandy or Sandy 4.5 is better, I can confirm with high confidence Nepal set should be 
the best choice. And the result indeed fits this well. Thus I believe the KS statistics 
with past scores can still be a good empirical criterion for training sets selection, 
given the distinction among the statistical values is large.  
 
Figure 4.13 displays the variable importance for direct prediction task for Nepal set 
using GLM method. Generally the plots look similar to Figure 4.10: most features are 
selected by the model all the time and top ranked features consist of both baseline and 
network features. One thing slightly different is that I observe higher fraction of 
dynamic network features in the top 10 ranked features by weights, as far as to 8-9 
out of 10. This demonstrates the dynamic network features are playing more 
important roles for the direct prediction task. And greater improvement effect can be 
observed for hashtags staying in the top rank, when these features are really taking 
meaningful values instead of many nulls.  
 
 
Hashtag 
(rank/log(score+1)) 
Past score  Predicted 
by SD 
Predicted 
by B 
Predicted 
by BSD 
True  
rebuild 13/7.80 458/3.14 15/6.68 95/5.17 1017/3.30 
supportnepal 138/5.29 56/4.75 77/5.06 43/5.65 13/8.72 
earthquakeagain 50/6.31 37/5.53 66/5.12 52/5.43 15/8.49 
helpnepalchildren 5970/0.69 7322/-
1.29 
4637/0.17 5887/0.15 9/9.31 
3news 4926/0.69 7032/-
1.29 
6421/-
0.22 
6961/0.006 18/8.34 
 
        Table 4.4 Prediction further investigation on selected hashtags using GLM 
 
 
Missing 
counts by 
rank 
Missed by 
SD 
Missed by B 
Missed by 
BSD 
Missed by 
all 
Count of 
the union 
Top 10 3 3 3 3 12 
Top 20 7 8 7 7 24 
Top 50 21 23 22 17 66 
Top 100 39 44 40 31 139 
 
        Table 4.5 Missing hashtags count under various feature sets using GLM  
 
Following the same ideas as Table 3.5 and Table 3.6, I further investigate the effect of 
features on prediction based on several individual hashtags. Tracing back SD 
represents feature combination static+dynamic, B represents baseline and BSD 
represents baseline+static+dynamic; the values in the table are rank and log(score+1) 
respectively. In Table 4.4, the hashtag 'rebuild' is over ranked by B but pretty well 
  
 94 
 
predicted by SD if I look at the direct prediction score vs. the true score. It is 
interesting to see here for this hashtag with log past score 7.8, the feature combination 
SD only predicts 3.14 compared to 6.68 provided by B, which clearly indicates there 
is unique information included in those network features so that they are suggesting a 
different result. For hashtags 'supportnepal' and 'earthquakeagain' the results are very 
similar for all features combinations and past scores, but they are different from the 
true values. This illustrates that for hashtags with intermediate past activity but with 
high future activity, none of the features may be able to make accurate prediction. 
The situation is much worse for hashtags 'helpnepalchildren' and '3news', where they 
showed barely activity in the past but high activity in the future, and none of the 
features can make even reasonable prediction. This is exactly the same as I have 
noticed in the user case: I cannot make good prediction out of nothing. But generally 
speaking, similar to the user case, the BSD combination will provide the most reliable 
and stable results since it can make use of the advantages of both network and 
baseline features. Table 4.5 serves the same purpose as Table 3.6 to provide a 
suggestion for maximum top hashtags inclusion. The improvement by union all 
results seems to be less, but it still depends on the actual cost function to determine 
whether it is worth to do or not.  
 
 
4.3 Prediction on latent topic popularity 
4.3.1 Features and response  
Following the same settings as mentioned earlier, I consider hashtags as documents to 
build the LDA topic model with number of topics to be 10. More specifically, I am 
using all hashtags (after content filtering) within period [0,T] to construct the LDA 
model, and then use the model to infer the topic probabilities for each hashtag 
involved. The weighted aggregation (the weights are the corresponding probabilities) 
of features and response of hashtags will be used as features and response for each 
topic. And since the content of documents (hashtags) will drift and change over time 
(will be shown later), the topics formed from them will also prosper and decay 
relatively fast. Thus I will set the default prediction period h to be 2d instead of 5d. 
The weights of features are given by the LDA model trained within period [0,T], 
while the weights for the response are inferred using the trained model on hashtags 
text within period [T,T+h]. For consistency consideration, I need to maintain the 
same set of hashtags for derivation of features and response. There may be newly 
invented hashtags in period [T,T+h] that have good alignment with existing topics, 
but I have to exclude them from the prediction task since they do not preserve 
features in the training period. The potential bias introduced by this exclusion is 
inevitable in this prediction task setting. Moreover there are other factors to be 
considered like documents drifting and noisy hashtags filtering, I will only select the 
top 10k hashtags by retweets during [0,T] for features and response extraction, which 
is the same thing as I have done for the user and hashtag prediction tasks. Thus the 
response—the topic popularity will be the weighted aggregation of top 10k hashtags 
  
 95 
 
response with weights inferred by the existing LDA model upon the text of the same 
10k hashtags during period [T,T+h].  
 
As to topic features, they also include the baseline features and network features, with 
the corresponding weights are from the LDA model during period [0,T]. The baseline 
features will simply come from the hashtag baseline features plus one additional 
'Average polarity', which I believe should be a useful feature for topics. For baseline 
features that are defined as summations: 'Number of tweets', 'Number of retweets', 
'Number of mentions', 'Number of tweets with URLs', 'Number of users', I use the 
weighted aggregation of hashtags ones to form the topic ones. For others that are 
defined as means: 'Average user tweets', 'Average user followers', 'Average user 
friends' and 'Average polarity', I use the weighted average of hashtag values as topic 
ones. But for network features I cannot do the same since network features are 
defined on one whole network and weighed aggregation of network statistics from 
various networks have no practical meanings. In this case I can construct a weighted 
network for each topic, with the edges coming from all edges of top 10k hashtags and 
the weights are corresponding probabilities from each hashtag for the same topic. 
Since there are not many well defined network statistics for weighted directed 
networks, I introduce two network statistics for this weighted network: 'Average 
indegree' and 'Median outdegree'. Other parameters not mentioned will simply adopt 
the default settings as previous sections.  
 
 
     
                 
            Figure 4.14 LDA topic response comparisons across various data sets  
 
Figure 4.14 lists the topics popularity (response with h=2d) over various T values 
across three data sets. Here the topic indices are simply ranked by increasing order of 
the topic response for better alignment comparison. The default topic indices are 
ranked by the topic coherent scores, which are the ones used in Figure 4.15. From the 
plots both within the sample range and cross samples (data sets) range of the topic 
response are about 2 on the natural logarithm scale. 
 
  
 96 
 
 
 
Figure 4.15 LDA expected versus true topic response comparison across various data sets 
(the data sets are Sandy, Nepal and Blizzard from top to bottom respectively) 
 
  
 97 
 
Since both features and response are weighted aggregation of hashtags values, they 
tend to have more homogenous within sample and cross samples distribution (only 
the distribution of the response is shown here). This is different from both user 
prediction and hashtag prediction, where the distribution is heterogeneous in either 
aspect. Thus for the topic prediction task I can simply adopt the direct prediction 
approach, where the unfavorable effect of heterogeneity is reduced significantly.   
 
Topic_5 Topic_9 Top 10 hashtags 
tokens 
probs_ex
pected 
probs
_true 
token
s 
probs_ex
pected 
probs
_true 
name 
volu
me 
'earthq
uak', 
'help', 
'quak', 
'aid', 
'toll', 
'death'
, 
'relief', 
'effort', 
'peopl', 
'victim' 
0.0537 0.279 
'mem
ber', 
'quak
ehit', 
'com', 
'click', 
'hat', 
'churc
h', 
'coord
in', 
'mort
o', 
'preac
her', 
'retur
n' 
0.341 0.0299 nepal 
166
411 
0.0282 0.367 0.544 0.0220 
nepalearthquak
e 
593
48 
0.0485 0.373 0.541 0.0216 earthquake 
236
44 
0.0362 
      
0.406 
0.549 0.0202 nepalquake 
201
05 
0.0460 0.287 0.540 0.0322 
nepalquakerelie
f 
116
10 
0.0411 0.145 0.361 0.103 helpnepal 
114
50 
0.0671 0.579 0.633 0.0686 
helpnepalchildre
n 
104
38 
0.0405 0.237 0.245 0.0240 prayfornepal 
670
9 
0.0322 0.0315 0.719 0.0279 supportnepal 
556
9 
0.0449 0.389 0.545 0.0466 
msghelpearthqu
akevictims 
521
5 
 
Table 4.6 Selected topic probabilities comparison for Nepal set with T=36h 
 
Figure 4.15 shows the response comparison using two types of weights (probabilities): 
the expected weights—ones indicated by the training model during period [0,T] and 
the true weights—ones derived from the text of same top 10k hashtags during 
[T,T+h]. The motivation behind is to investigate the document content changing over 
time. From top to bottom the sets are Sandy, Nepal and Blizzard respectively, and the 
title of each subplot corresponds to the T value. For majority of the topics across 
various T values and data sets, the expected values are close to the true values, which 
indicates the content of the corresponding documents is relatively stable at least to a 
short term future. But there are a few cases when the discrepancy is big—seen in the 
48h plot for Sandy set, the 36h and 48h plot for Nepal set. These cases imply the 
underlying hashtags (at least some major ones) are experiencing drastic changes from 
present to the near future since all topics are from existing models and they will 
  
 98 
 
remain relatively static.  Two topics that have exhibited the largest discrepancy in the 
Nepal 36h plot are further studied as examples to investigate the document drifting 
effect, and the results are listed in Table 4.6.  
 
Index Top 10 tokens 
1 'que', 'por', 'del', 'con', 'sandi', 'los', 'york', 'las', 'para', 'huracan' 
2 
'sandi', 'school', 'hurrican', 'tomorrow', 'power', 'day', 'get', 'class', 'cancel', 
'like' 
3 
'birthday', 'chill', 'candl', 'tormenta', 'cooki', 'social', 'cheek', 'cuba', 'cours', 
'tras' 
4 'blew', 'reason', 'dog', 'sandi', 'gym', 'roof', 'wors', 'dear', 'hate', 'calm' 
5 'sandi', 'hurrican', 'bitch', 'safe', 'east', 'coast', 'everyon', 'hope', 'fuck', 'stay' 
6 
'help', 'hurrican', 'sandi', 'donat', 'obama', 'victim', 'relief', 'romney', 'amp', 
'via' 
7 'hurrican', 'sandi', 'close', 'due', 'amp', 'power', 'the', 'new', 'storm', 'wind' 
8 'sandi', 'walk', 'someon', 'run', 'bore', 'alright', 'gas', 'drink', 'hurrican', 'water' 
9 
'anyth', 'spongebob', 'sandi', 'badai', 'hair', 'swim', 'twerk', 'whore', 'dan', 
'hoy' 
10 'sandi', 'hurrican', 'like', 'get', 'the', 'power', 'eat', 'came', 'room', 'are' 
 
 
Index Top 10 tokens 
1 
'yoga', 'des', 'earthquak', 'dog', 'south', 'auction', 'rescu', 'wow', 'everest', 
'surviv' 
2 'voor', 'per', 'van', 'een', 'met', 'uit', 'aan', 'het', 'aardbev', 'che' 
3 'todo', 'mit', 'ein', 'den', 'bruce', 'les', 'til', 'anim', 'pour', 'ist' 
4 'der', 'die', 'und', 'erdbeben', 'nach', 'das', 'aus', 'von', 'club', 'auf' 
5 
'rais', 'fund', 'earthquak', 'fundrais', 'relief', 'help', 'donat', 'money', 'today', 
'support' 
6 'terremoto', 'que', 'los', 'por', 'del', 'para', 'con', 'las', 'muerto', 'ayuda' 
7 'earthquak', 'quak', 'toll', 'death', 'hit', 'miss', 'helicopt', 'anoth', 'rescu', 'dead' 
8 
'gempa', 'korban', 'indonesia', 'untuk', 'wni', 'ronaldo', 'bantuan', 'tim', 
'cristiano', 'dari' 
9 
'help', 'donat', 'earthquak', 'peopl', 'pray', 'support', 'pleas', 'need', 'prayer', 
'victim' 
10 'earthquak', 'relief', 'help', 'effort', 'aid', 'via', 'disast', 'donat', 'quak', 'proud' 
 
Table 4.7 Top 10 tokens for each topic from Sandy set (top) and Nepal set (bottom) 
 
In Table 4.6 the probabilities for top 10 hashtags by retweet volume are shown to 
reflect the content variations of the topics. Clearly topic 5 is underestimated while 
topic 9 is over estimated. This suggests the content of underlying hashtags is shifting 
from topic 9 to topic 5, where I can distinguish their differences by the top 10 tokens. 
From this situation I can perceive the differences between the topics defined simply 
  
 99 
 
by hashtags and by latent topic models. While the former are annotated by a variety 
of people who believe the content should be relevant to a concentric but relatively 
vague idea, the later are solely content based and judged by computers using the 
distribution of tokens. The ideal topic model should be able to reach a consensus with 
human belief to some extent, with the extent can also be finely tuned.  
 
To further explore the topics formed by the models, I list the top ranked tokens for 
each topic formed by the full data sets in Table 4.7. There many of the topics indeed 
delivering concordant information, such as topic 2,5,6,7 for Sandy and topic 5,7,9,10 
for Nepal. Though there are also topics formed by languages other than English and 
thus cannot be understood well, this can barely be avoided since I do not have a well 
labeled training set to perform the language classification task. Generally speaking, 
the LDA model is really generating meaningful topics, although the best number of 
topics is still a concern here—larger number of topics can either split more general 
topics into smaller but also more concentric ones or capture some hidden interesting 
small topics, while smaller number of topics can aggregate fragmented yet less 
meaningful topics into consistent and more meaningful ones. But as mentioned earlier, 
the main goal of my study is not trying to build good topic models, but to investigate 
and compare how the prediction task will vary from the two topic definitions—
hashtags and latent LDA topics. Thus the quality of the topics is of less concern and 
should not affect my final conclusion.  
 
With the number of topics fixed, there will be drastic topic drifting over time due to 
content variation of the underlying hashtags. Some formed topics may only exist 
within a short period of time but some others may persist in long term. In Table 4.8, I 
select some persistent yet meaningful topics to see what kind of topics survives 
longer periods. The first two topics are from the Sandy set and the last one is from the 
Nepal set, the boldfaced tokens are ones that are shared through different periods. 
Here I choose the most persistent yet meaningful topics by measuring the fractions of 
overlapping tokens within the top 10 tokens within different periods since topic 
indices tell nothing about their relationship, this criterion can be modified subject to 
personal preference and I am simply using it for illustrative purposes. Again the 
indices of the topics correspond to the coherent ranking, thus they can serve as a rank 
of topic quality to some extent. The first Sandy topic expresses a clear theme: 
'Hurricane Sandy is approaching east coast and hope everyone stay safe'. This one 
should be the most long-lasting yet consistent topic observed for Sandy set, which 
should be reasonable considering the particular feature of events related to natural 
disasters. The second Sandy topic appears less straightforward but I can still infer the 
central idea is about school class cancellation due to the hurricane, where the 
discussion is likely to be initiated and prevailed among students.   
 
The Nepal topics seem to less persistent and the best one I have found only has 4 
tokens in common through all periods. But the idea conveyed is clear: all about relief, 
donation and aid due to deaths during the earthquake. There are other less persistent 
but maybe more centric topics which are also worth to study—about more specific 
issues such as certain rescue actions or church prayer activities. Therefore the topic 
  
 100 
 
model is achieving meaningful performance, and my main goal is to predict the future 
popularity of these topics.  
 
Collection 
period 
Index Top 10 tokens 
12h 9 
'safe', 'sandi', 'hurrican', 'everyon', 'stay', 'coast', 'east', 
'hope', 'power', 'pray' 
24h 5 
'safe', 'sandi', 'hurrican', 'everyon', 'coast', 'east', 'hope', 
'stay', 'peopl', 'pray' 
36h 7 
'safe', 'hurrican', 'sandi', 'east', 'coast', 'everyon', 'stay', 
'hope', 'peopl', 'pray' 
48h 8 
'hurrican', 'sandi', 'east', 'safe', 'coast', 'everyon', 'stay', 
'hope', 'new', 'affect' 
All 5 
'sandi', 'hurrican', 'bitch', 'safe', 'east', 'coast', 'everyon', 
'hope', 'fuck', 'stay' 
 
Collection 
period 
Index Top 10 tokens 
12h 10 
'school', 'sandi', 'tomorrow', 'cancel', 'class', 'hurrican', 
'day', 'get', 'power', 'work' 
24h 8 
'sandi', 'hurrican', 'school', 'get', 'tomorrow', 'sleep', 'like', 
'fuck', 'bitch', 'cancel' 
36h 10 
'sandi', 'hurrican', 'school', 'tomorrow', 'get', 'fuck', 'bitch', 
'day', 'got', 'class' 
48h 10 
'school', 'tomorrow', 'class', 'sandi', 'cancel', 'hurrican', 
'day', 'get', 'thank', 'power' 
All 2 
'sandi', 'school', 'hurrican', 'tomorrow', 'power', 'day', 'get', 
'class', 'cancel', 'like' 
 
Collection 
period 
Index Top 10 tokens 
12h 1 
'earthquak', 'effort', 'toll', 'call', 'quak', 'peopl', 'death', 
'suppli', 'help', 'aid' 
24h 4 
'earthquak', 'help', 'relief', 'toll', 'donat', 'quak', 'peopl', 
'death', 'effort', 'rise' 
36h 5 
'earthquak', 'help', 'quak', 'aid', 'toll', 'death', 'relief', 
'effort', 'peopl', 'victim' 
48h 8 
'earthquak', 'help', 'toll', 'quak', 'death', 'relief', 'aid', 
'donat', 'peopl', 'effort' 
All 10 
'earthquak', 'relief', 'help', 'effort', 'aid', 'via', 'disast', 
'donat', 'quak', 'proud' 
 
   Table 4.8 Selected persistent topics over various periods for Sandy and Nepal sets 
  
 101 
 
While the true future popularity of previous topics is hard to find, I use the weighted 
aggregation of hashtag popularity as the approximation, and it is the best available 
approach given the data and information in hand.  
 
4.3.2 Prediction results 
The direct prediction is expected to lead to relatively good results and it indeed does. 
Figure 4.16 shows the direct prediction results for Nepal set varying T and methods. 
The RMSE is below 1 on the log scale for the best setting, considering the 
corresponding magnitude for the response is about 10~12, I believe the results are 
satisfactory.  For the prediction methods, the GLM no longer performs the best since 
the prediction results are highly unstable. The main reason should be the singular 
fitting problems that occur occasionally. Moreover, warning messages are received to 
remind there are too few samples for the CV—the minimal requirement is 3 for each 
fold. But even the number of folds changes from the default value 10 into 3, the 
singular fitting issue is still there. There are two driving factors for this: the sample 
sparsity and the homogenous features distribution, where the latter one could be more 
fatal to the linear model. Given more samples, this issue may be remedied to some 
extent by CV and feature selection, but it is not feasible in this study due to the setting 
of my problem. Thus GLM may not be a good choice for this prediction task.  
 
 
Figure 4.16 LDA topic popularity predictions for Nepal set varying T and methods 
 
In contrast, the other two nonlinear methods perform relatively stable: with DL 
performs better during the first two periods and RF performs better during the last 
two. To achieve the best stability, I suggest the approach using the average scores of 
RF and DL methods, which gives more stable results staying below a RMSE of 1 
consistently.  As to the feature sets, generally I do not see any significant 
improvement over baseline features. Given the great computational cost (mainly for 
the weighted graph construction), network features are not recommended for this task 
since the baseline ones already provide satisfactory results. 
  
 102 
 
 
 
Figure 4.17 LDA topic popularity predictions for Blizzard set varying T, methods and 
training sets 
 
Figure 4.17 shows the topic prediction results for Blizzard set with GLM method 
removed. The results are even better than the Nepal case and RF+DL average also 
gives the most stable performance with RMSE around 0.5. For this case, the network 
features demonstrate some improvement over baseline ones, especially for the DL 
approach. However, as I mentioned earlier, it is still not necessary since the overall 
performance of all feature sets is already good.  The second plot shows the results for 
the RF+DL average approach with varying training sets. The Sandy set performs best 
on average while Nepal set does the worst on average. Although the general 
performance is still good, the variability among methods and training sets are much 
more than the feature sets, which seems to be natural when it comes to direct 
prediction instead of rank prediction. Thus, choice of training sets proves to be much 
more important than choice of features in this case. The empirical measure for 
training sets selection I have used before—the KS statistics may not be a suitable one 
  
 103 
 
here since the sample size is too small to make the comparison results powerful 
enough. Then a proper alternative would be to simply use the combination of the two 
sets: as shown in Figure 4.17, the combination can give relatively stable results even 
though no further information about which one to choose is provided.  
4.4 Discussion 
4.4.1 Summary 
For the hashtag prediction part, I intentionally follow nearly the same settings as the 
user prediction to identify their similarities and differences. It is easy to observe that 
they should share much in common from each other since both the user activity and 
hashtag activity will largely depend on single tweet popularity (number of retweets it 
obtains); especially for tweets with extremely high popularity, the corresponding 
users or hashtags (if there is any) will also be attached with large scores as well. More 
specifically, I have found the following common aspects:  
1. The response distribution for both the users and hashtags followed well with the 
power law distribution, demonstrating that it is a general phenomenon for information 
diffusion on Twitter regardless which level to look at: tweets (looking at the retweets 
distribution of each tweet), users or topics.   
 
2. The network features, especially the dynamic ones, significantly improve over 
baseline ones across different prediction tasks for both cases. To view this more 
generally, the additional dynamic information included in those features is indeed 
helpful in predicting the near future behavior, for both users and hashtags.  
 
3. For both cases the GLM method not only gives the best on average performance, 
but also possesses other advantages over the other two non-linear methods: such as 
stability, run time efficiency and well defined variable importance etc. Similar to the 
user case, there is also a strong log-linear dependency between the selected features 
and the response. The fact that no better performance is observed from the two 
nonlinear methods towards to the linear method again shows there is no significant 
nonlinear relationship between the selected features and the response. More generally, 
this can imply there is strong log-linear dependency (while no significant nonlinear 
dependency on the log scale) between past activity and near term future activity, 
where the activity could be quantified by various metrics (number of retweets, 
number of mentions, URLs etc.) based on different levels (tweets, users, topics etc.).  
 
4. Given the heterogeneity among the data sets, it is not surprising to see the rank 
prediction turns out to be more stable than direct prediction in both situations. And 
the rank prediction results can be directly used for top-tier users or hashtags 
classification as well. As to the direct prediction, the choice of training sets is a 
crucial issue to be considered in both cases, and the KS statistics proves to serve as a 
good empirical measure for both tasks.  
 
  
 104 
 
5. Since both the baseline and network features used in the prediction tasks rely 
heavily on past behavior, they will be ineffective for samples with low past activity, 
regardless the samples are users or hashtags.    
 
Provided with the above common points, there are also distinctions between the two 
scenarios:  
1. The "Matthew effect" for hashtags is much stronger than for that of users. This can 
be seen directly from the quantile distribution plots, where the distribution is more 
extreme for hashtags than users, and their maximum values are differed by about 2 
orders of magnitude.  
 
2. Perhaps due to the more extreme distribution, the prediction results exhibit better 
stability than the user case. For the rank prediction, on one hand the GLM method 
stability plots illustrate better stability for hashtag prediction directly; on the other 
hand, a larger fraction of variables are selected by the GLM model in hashtag 
prediction than user prediction, which can also be evidence of better stability. For the 
direct prediction, the distinction is more obvious. Wrong choice of training data can 
lead to totally absurd results in the user case, while still remain relatively reasonable 
ones in the hashtag case. A possible explanation for this might be: because of the 
more extreme distribution for all data sets, the distortion due to training and testing 
sets heterogeneity is somehow reduced—the distributions become more similar from 
each other in the log scale.  
 
3. Much better results exist in the rapidly increasing classification task for hashtags. 
The underlying reason may also be the more extreme distribution, where more rapidly 
increasing samples belong to top-tier hashtags and exhibit higher activity, thus can 
lead to much improved results with network features.   
 
For the latent topic prediction, I consider it as a weighted version of hashtag 
prediction: the features and response are all weighted aggregation of hashtag ones. As 
to the topics, the LDA model is doing a decent job to pick up some concentric and 
persistent topics, although the quality of topics still have much room for 
improvement. The way to quantify the topic popularity may be biased and 
inappropriate, but it is the currently best approach available to me given the 
information I have. Due to the weighted aggregation, the distribution of features and 
response become much more homogenous. This makes standardization unnecessary 
and thus I can just go ahead with the direct prediction tasks. The followings are the 
main discovery for the latent topic prediction: 
1. Generally speaking the prediction results are good in terms of RMSE measure, 
which should not be surprising given the good alignment of topic response in Figure 
4.14. The GLM method no longer performs well due to unstable results led by the 
singular fitting problem. Increasing the sample size could fix the problem but it is not 
feasible in my setting. The other two nonlinear methods are stable and using the 
average scores of RF and DL methods is the most stable approach.   
 
  
 105 
 
2. The prediction improvement of network features over baseline ones is limited. It is 
likely because these network features do not have good differential power among the 
topics, which is a result of the homogenous distribution. Consideration the 
computation cost of weighed graph construction and network features extraction, I do 
not recommend using these features for this task since the performance of baseline 
features is already good, unless better accuracy is highly preferred and rich 
computation resources are granted.  
 
3. The performance variations due to methods and training sets are much larger than 
choice of features, which is similar to my observation for the user and hashtag direct 
prediction tasks. But I do not think KS statistics is a good choice for training sets 
selection for this task due to the small sample size and limited testing power. I 
suggest using the combination sets for better stability purpose if no other information 
about the better training set is provided.   
4.4.2 Future work 
Similar to the user prediction tasks, I would like to extend the analysis to other types 
of events and platforms. Combined with the diffusion tracking and user prediction 
analysis, this topic prediction task could also be one function for the more general-
purposed event diffusion study user API granted with enough time and engineering 
effort.  Given more time for investigation to the latent topic part, I would like to 
construct an online framework that can fulfill and concatenate the following tasks 
well: topic formation or detection, topic quality identification, topic evolution monitor 
and topic popularity prediction.    
 
 
 
 
 
 
 
 
 
 
 
  
 106 
 
 
 
Chapter 5:  Conclusion 
 
 
5.1 Discoveries and contributions 
For urgent event diffusion through Twitter, my primary assumption in this study is 
that the diffusion of urgent events mainly relies on retweeting behavior. There are 
definitely other effective ways for information diffusion to occur: such as posting new 
relevant tweets, adopting hashtags or URLs, and mentions. But as discussed before, 
retweets demonstrate several advantages over the other measures: they not only 
provide solidly tractable path for the diffusion (compared to normal tweets), but also 
are stronger indictors of information diffusion (compared to mentions) and include 
more general forms of underlying tweets (compared to tweets with hashtags or 
URLs). Therefore they are the best candidate for the study of information diffusion. 
Based on this, the study of urgent event diffusion can be considered as study of the 
underlying retweet network, and it is the main reason why I emphasize much on 
network statistics in my study.  
 
In this dissertation I accomplish a comprehensive study on urgent event diffusion 
through Twitter using the retweet network statistics. My main findings are:  
1. With proper choice of category and time window size, the network statistics, 
especially the dynamic ones can reveal detailed global level and node level diffusion 
patterns. The results from hashtag K-means clustering and event hashtag binary 
classification also show these statistics can be helpful to distinguish different types of 
hashtags.    
 
2. For both the user and hashtags prediction tasks, the network statistics especially the 
dynamic ones can provide significant improvement over baseline features. The extent 
of improvement will depend on the experiment settings—including prediction types, 
choice of training and testing sets, values of prediction parameters etc. The 
effectiveness of these statistics indicates there is strong past-to-future behavior 
dependency for both user and hashtag.   
 
3. In both the user and hashtag prediction tasks similar patterns are observed for 
response distribution and prediction results. Moreover, GLM method outperforms the 
other two nonlinear methods and proves to be the best prediction method for both 
tasks on several aspects—results, reliability, interpretability etc. All these may imply 
  
 107 
 
there is strong linear dependency and no significant nonlinear dependency between 
selected features and response on the log scale, and this dependency may persist 
regardless of the scopes of prediction—whether on tweet, user or hashtag level.   
 
4. For LDA topic popularity prediction the improvement of network statistics is not 
significant. I believe the main reason is that the distribution of LDA topic popularity 
is much more homogenous than the user and hashtags cases. The GLM method does 
not perform best either, which could be due to the inherent drawback of a linear 
model in the case of limited sample size. The most reliable approach for this task will 
be to use the average scores of the other two nonlinear methods with a combination of 
training sets.  
 
With the above discoveries, my work demonstrates the following contributions:  
1. I have developed a comprehensive study of urgent event diffusion on Twitter 
including information diffusion tracking, user influence and topic popularity 
prediction. Though there have been studies about urgent events, they do not generally 
cover the scope that I do. I believe my work can serve as a foundation for further 
research on urgent event study, which can have large potential social and business 
impact.   
 
2. I constructed a new early-time prediction (recommendation) framework for both 
user influence and topic popularity. Given the nature of urgent event, prediction in the 
early stage of diffusion is a must for useful implementation in practice. I managed to 
solve the limited information problem, which is faced by every early time prediction 
approach, by using past event data as training sets for future event data sets 
prediction. The framework not only produces satisfactory yet consistent results, but is 
well scalable as well. It best utilizes the information provided by Twitter streaming 
API and can easily adapt to the "online" prediction mode which maintains continuous 
input and output streams. Moreover, diffusion tracking or monitoring can be 
embedded as an add-on function for exploratory analysis. Therefore, with proper 
additional engineering effort, my framework can provide both diffusion tracking (e.g. 
summary statistics and visualization) and diffusion prediction (e.g. user prediction 
and topic prediction) in an online mode by taking streaming input and generating 
streaming output.   
 
3. I developed innovative network statistics derived from the retweet network for both 
the monitoring and prediction study. These statistics, especially the dynamic network 
statistics including rich temporal information about the diffusion, have demonstrated 
great value in revealing detailed diffusion patterns, predicting user influence and 
predicting topic popularity. This shows the temporal properties underlying diffusion 
network can provide much insight on the diffusion study, which has not been stressed 
in previous study.  
 
4. I obtained robust prediction results from various experiment settings. The 
improvement from the network statistics is consistent and GLM method gives the best 
on average performance. This reveals common behavior patterns during the urgent 
  
 108 
 
event diffusion among different levels of entities—tweet, user and topic, which 
should be a natural but still meaningful conclusion.  
 
All in all, the urgent event diffusion exhibits distinct properties over traditional 
information diffusion study; therefore the study of urgent event diffusion also 
requires specific methodology and framework. This is a relatively new research area 
that only draws limited attention but could have a great potential impact and deserve 
further exploration. In my dissertation I have established a comprehensive and 
insightful framework to describe urgent event diffusion on Twitter. I believe my work 
can be a basis or a good reference for future related study in this area. 
 
 
      
5.2 Limitations and expectations 
Given the above discoveries and contributions, there are also following limitations 
that need to be resolved for future work:  
1. More data sets are needed to further consolidate all my conclusions. For the 
hashtag clustering and classification study in Chapter 2, the sample size is less than 
50 and this prevents me from reaching a reliable result. Moreover, a well-defined 
external label is needed to provide a solid validation for hashtag clustering and 
classification. For the prediction part, I need more data sets from categories of urgent 
events other than natural disasters—such as breaking news or other explosive events 
to further validate my findings, although I believe consistent results will be obtained. 
To go one step further, I will also need data from other platforms—such as Facebook, 
Instagram or LinkedIn to see if my conclusion is generalizable to use cases other than 
Twitter.     
 
2. Although I have shown that the network statistics derived from the retweet network 
are effective in revealing detailed diffusion patterns and predicting future influence 
(users or topics), there still lacks theoretical foundation to explain these phenomena. 
A theoretical or empirical guide is needed to figure out the optimal experiment 
settings in advance—e.g. the optimal time window size for the tracking study, the 
optimal starting time for the feature collection period (Sandy set vs. Sandy 4.5 set),  
and the optimal top users or hashtags included (the k value) in the prediction study. 
This is crucial in my framework since there are many different experimental settings 
that have to be determined beforehand, especially for the prediction task. In my 
analysis I choose a series of "standard" settings by exploratory analysis of data and a 
simple grid search, but for practical implementation it will be much more efficient if 
there is certain criterion that can be used to make the choice automatically. More 
specifically, the need for a generally applicable standard is urgent and necessary for 
the following two tasks: selection of training set and selection of feature collection 
period starting time. In my study I use KS statistics as the empirical measure for the 
former task, although it proves to be relatively effective, it still cannot be a reliable 
and qualified standard. For the later task, the currently available approach is to match 
  
 109 
 
the evolution patterns of training and testing sets, which is far from satisfactory and 
cannot be used in real-time prediction. Moreover, I have found that different feature 
sets (baseline features and network features) are good at predicting users or hashtags 
of different types, thus it would be beneficial to figure out what individual properties 
(past activity or habit etc.) are leading to the differences. Then I can incorporate these 
factors into the prediction task to achieve individual automatic feature selection for 
the best prediction performance.   
 
3. My prediction framework only gives meaningful prediction results for data points 
that exhibit certain activity during the feature collection period. This is natural since 
no prediction framework can provide meaningful output given null input. Also my 
framework is not able to effectively capture data points that experience a rapid 
response increase, this is because none of the included features can do this well. 
Generally speaking, limited sources of data should be the one to blame for all the 
above shortcomings. Given useful external sources, with a slight modification my 
framework can easily combine features from those sources and overcome the 
limitations.   
 
4. Currently my framework adopts a wide range of programming libraries across 
several open source languages. For efficient practical implementation consideration, 
more engineering effort is needed to merge all of these pipelines into one 
homogeneous package. Given more time and effort, I would like to write a composite 
package to achieve all the functionality and build a simple user interface for easy 
implementation. Ideally, my framework should be able to perform both the tracking 
and prediction tasks with an "online" mode—taking continuous data input streams, 
processing them with use specified commands, then generating continuous output 
streams.     
 
In conclusion, the main direction for the future work would be the cross validation 
and practical implementation of my whole monitoring and prediction framework. 
Ideally my framework should be able to help the stake holders who are interested in 
the diffusion process to either find interesting diffusion patterns or obtain a 
recommended list of important users or topics, meanwhile with all results 
continuously updated by feeding new data. While devoting myself to this goal, I 
would also like to warmly welcome any other interested researchers to join with me 
to accomplish this project together.    
 
 
 
 
 
 
 
 
 
 
  
 110 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 111 
 
Bibliography 
 
Babyak, M. A. (2004). hat you see may not be what you get: a brief, nontechnical 
introduction to overfitting in regression-type models. Psychosomatic 
medicine, 411-421. 
Bakshy, E. a. (2011). Everyone's an influencer: quantifying influence on twitter. In 
Proceedings of the fourth ACM international conference on Web search and 
data mining, (pp. 65-74). 
Benzi, M. a. (2013). Total communicability as a centrality measure . Journal of 
Complex Networks , 124-149. 
Bikhchandani, S. a. (1992). A theory of fads, fashion, custom, and cultural change as 
informational cascades. Journal of political Economy , 992-1026. 
Blei, D. M. (2003). Latent dirichlet allocation. Journal of machine Learning research, 
993-1022. 
Bollen, J. a. (2011). Twitter mood predicts the stock market. Journal of 
Computational Science, 1-8. 
Borge-Holthoefer, J. a.-B. (2013). Cascading behaviour in complex socio-technical 
networks. Journal of Complex Networks, 3-24. 
Borondo, J. a. (2012). Characterizing and modeling an electoral campaign in the 
context of Twitter: 2011 Spanish Presidential election as a case study. Chaos: 
an interdisciplinary journal of nonlinear science, 22. 
Castillo, C. M. (2011). Information credibility on twitter. In Proceedings of the 20th 
international conference on World wide web, (pp. 675-684). 
Cataldi, M. D. (2010). Emerging topic detection on twitter based on temporal and 
social terms evaluation. In Proceedings of the Tenth International Workshop 
on Multimedia Data Mining, (p. 4). 
Cha, M. H. (2010). Measuring user influence in twitter: The million follower fallacy. 
ICWSM, (pp. 10-17). 
Chen, W. W. (2010). Scalable influence maximization for prevalent viral marketing 
in large-scale social networks. In Proceedings of the 16th ACM SIGKDD 
international conference on Knowledge discovery and data mining, (pp. 1029-
1038). 
Cheng, J. a. (2014). Can cascades be predicted? Proceedings of the 23rd international 
conference on World wide web, (pp. 925-936). 
Costa, J. S. (2013). Defining semantic meta-hashtags for twitter classification. In 
International Conference on Adaptive and Natural Computing Algorithms. In 
International Conference on Adaptive and Natural Computing Algorithms, 
(pp. 226-235). 
Evans, D. (2010). Social media marketing: the next generation of business 
engagement. John Wiley & Sons. 
Galuba, W. a. (2010). Outtweeting the twitterers-predicting information cascades in 
microblogs. Proceedings of the 3rd conference on Online social networks.  
Godin, F. S. (2013). Using topic models for twitter hashtag recommendation. In 
Proceedings of the 22nd International Conference on World Wide Web, (pp. 
593-596). 
  
 112 
 
Grier, C. T. (2010). @ spam: the underground on 140 characters or less. In 
Proceedings of the 17th ACM conference on Computer and communications 
security, (pp. 27-37). 
Gruhl, D. G.-N. (2004). Information diffusion through blogspace. In Proceedings of 
the 13th international conference on World Wide Web, (pp. 491-501). 
Hong, L. a. (2010). Empirical study of topic modeling in twitter. In Proceedings of 
the first workshop on social media analytics, (pp. 80-88). 
Hong, L. a. (2011). Predicting popular messages in twitter. Proceedings of the 20th 
international conference companion on World wide web, (pp. 57-58). 
Hui, C. a.-I. (2012). Information cascades in social media in response to a crisis: a 
preliminary model and a case study. Proceedings of the 21st international 
conference companion on World Wide Web , (pp. 653-656). 
Jenders, M. a. (2013). Analyzing and predicting viral tweets. Proceedings of the 22nd 
international conference on World Wide Web companion, (pp. 657-664). 
Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal 
of the ACM, 604-632. 
Kupavskii, A. a. (2012). Prediction of retweet cascade size over time. Proceedings of 
the 21st ACM international conference on Information and knowledge 
management, (pp. 2335-2338). 
Kwak, H. L. (2010). What is Twitter, a social network or a news media? In 
Proceedings of the 19th international conference on World wide web. ACM., 
(pp. 591-600). 
Laflin, P. a. (2013). Discovering and validating influence in a dynamic online social 
network. Social Network Analysis and Mining, 1311-1323. 
Lambrecht, K. (1996). Information structure and sentence form: Topic, focus, and the 
mental representations of discourse referents. Cambridge university press. 
LeCun, Y. B. (2015). Deep learning. Nature, 436-444. 
Lerman, K. a. (2010). Information contagion: An empirical study of the spread of 
news on Digg and Twitter social networks. ICWSM , (pp. 90-97). 
Livne, A. a. (2011). The Party Is Over Here: Structure and Content in the 2010 
Election. ICWSM, (pp. 17-21). 
Ma, Z. a. (2013). On predicting the popularity of newly emerging hashtags in twitter. 
Journal of the American Society for Information Science and Technology, 
1399-1410. 
Mantzaris, A. V. (2013). Dynamic network centrality summarizes learning in the 
human brain . Journal of Complex Networks, 83-92. 
Mathioudakis, M. a. (2010). Twittermonitor: trend detection over the twitter stream. 
In Proceedings of the 2010 ACM SIGMOD International Conference on 
Management of data, (pp. 1155-1158). 
Mehrotra, R. S. (2013). Improving lda topic models for microblogs via tweet pooling 
and automatic labeling. In Proceedings of the 36th international ACM SIGIR 
conference on Research and development in information retrieval, (pp. 889-
892). 
Merton, R. K. (1968). The Matthew effect in science. Science, 56-63. 
  
 113 
 
Mimno, D. W. (2011). Optimizing semantic coherence in topic models. In 
Proceedings of the Conference on Empirical Methods in Natural Language 
Processing, (pp. 262-272). 
Morales, A. J. (2014). Efficiency of human activity on information spreading on 
Twitter. Social Networks, 1-11. 
Ozdikis, O. S. (2012). Semantic expansion of hashtags for enhanced event detection 
in Twitter. In Proceedings of the 1st international workshop on online social 
systems.  
Petrovic, S. a. (2011). RT to Win! Predicting Message Propagation in Twitter . 
ICWSM.  
Porteous, I. N. (2008). Fast collapsed gibbs sampling for latent dirichlet allocation. In 
Proceedings of the 14th ACM SIGKDD international conference on 
Knowledge discovery and data mining, (pp. 569-577). 
Ramage, D. D. (2010). Characterizing microblogs with topic models. ICWSM, (p. 1). 
Rattanaritnont, G. a. (2012). Analyzing patterns of information cascades based on 
users' influence and posting behaviors. Proceedings of the 2nd Temporal Web 
Analytics Workshop , (pp. 1-8). 
Romero, D. M. (2011). Differences in the mechanics of information diffusion across 
topics: idioms, political hashtags, and complex contagion on twitter. 
Proceedings of the 20th international conference on World wide web, (pp. 
695-704). 
Simmie, D. V. (2014). Ranking twitter influence by combining network centrality and 
influence observables in an evolutionary model. Journal of Complex 
Networks, 495-517. 
Stieglitz, S. &.-X. (2013). Emotions and information diffusion in social media—
sentiment of microblogs and sharing behavior. Journal of Management 
Information Systems, 217-248. 
Stringhini, G. K. (2010). Detecting spammers on social networks. In Proceedings of 
the 26th annual computer security applications conference , (pp. 1-9). 
Swaroop, P. J. (2014). Influence in Microblogs: Impact of User Behavior on Diffusion 
and Engagement.  
Taxidou, I. a. (2014). Online analysis of information diffusion in twitter. Proceedings 
of the companion publication of the 23rd international conference on World 
wide web companion, (pp. 1313-1318). 
Thomas, K. G. (2011). Suspended accounts in retrospect: an analysis of twitter spam. 
In Proceedings of the 2011 ACM SIGCOMM conference on Internet 
measurement conference, (pp. 243-258). 
Tsur, O. a. (2012). What's in a hashtag?: content based prediction of the spread of 
ideas in microblogging communities. In Proceedings of the fifth ACM 
international conference on Web search and data mining, (pp. 643-652). 
Vicient, C. a. (2014). Unsupervised semantic clustering of Twitter hashtags. In 
Proceedings of the Twenty-first European Conference on Artificial 
Intelligence, (pp. 1119-1120). 
Wu, S. a. (2011). Who says what to whom on twitter. Proceedings of the 20th 
international conference on World wide web, (pp. 705-714). 
  
 114 
 
Yang, S. H. (2014). Large-scale high-precision topic modeling on twitter. In 
Proceedings of the 20th ACM SIGKDD international conference on 
Knowledge discovery and data mining, (pp. 1907-1916). 
Zhou, Z. a. (2011). Information resonance on Twitter: watching Iran. Proceedings of 
the first workshop on social media analytics, (pp. 123-131).