ABSTRACT Title of Dissertation: COLLECTIVE ENTITY RESOLUTION IN RELATIONAL DATA Indrajit Bhattacharya, Doctor of Philosophy, 2006 Dissertation directed by: Dr. Lise Getoor Department of Computer Science Many databases contain imprecise references to real-world entities. For exam- ple, a social-network database records names of people. But di erent people can go by the same name and there may be di erent observed names referring to the same person. The goal of entity resolution is to determine the mapping from database references to discovered real-world entities. Traditional entity resolution approaches consider approximate matches be- tween attributes of individual references, but this does not always work well. In many domains, such as social networks and academic circles, the underlying entities exhibit strong ties to each other, and as a result, their references often co-occur in the data. In this dissertation, I focus on the use of such co-occurrence relationships for jointly resolving entities. I refer to this problem as ?collective entity resolution?. First, I propose a relational clustering algorithm for iteratively discovering entities by clustering references taking into account the clusters of co-occurring references. Next, I propose a probabilistic generative model for collective resolution that nds hidden group structures among the entities and uses the latent groups as evidence for entity resolution. One of my contributions is an e cient unsupervised infer- ence algorithm for this model using Gibbs Sampling techniques that discovers the most likely number of entities. Both of these approaches improve performance over attribute-only baselines in multiple real world and synthetic datasets. I also perform a theoretical analysis of how the structural properties of the data a ect collective entity resolution and verify the predicted trends experimentally. In addition, I mo- tivate the problem of query-time entity resolution. I propose an adaptive algorithm that uses collective resolution for answering queries by recursively exploring and resolving related references. This enables resolution at query-time, while preserv- ing the performance bene ts of collective resolution. Finally, as an application of entity resolution in the domain of natural language processing, I study the sense dis- ambiguation problem and propose models for collective sense disambiguation using multiple languages that outperform other unsupervised approaches. COLLECTIVE ENTITY RESOLUTION IN RELATIONAL DATA by Indrajit Bhattacharya Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial ful llment of the requirements for the degree of Doctor of Philosophy 2006 Advisory Committee: Dr. Lise Getoor, Chair/Advisor Dr. Carol Espy-Wilson, Dean?s Representative Dr. Amol Deshpande Dr. Philip Resnik Dr. Marie desJardins c Copyright by Indrajit Bhattacharya 2006 Dedication To my parents. ii Acknowledgments First and foremost, I would like to sincerely thank my advisor Lise Getoor for her help and support throughout my PhD experience. She gave me the opportunity to work on research problems that are relevant, challenging and interesting. But, more importantly, she introduced me to the world of research and has been a tutor in all the di erent aspects of it | from picking a problem to writing a paper. I have also learnt from her the importance of hard work and perseverance in the making of a successful researcher. My pursuit of a PhD has not always been smooth-sailing, and I would not have made it through, if it had not been for her patience and her support. The door to her o ce has been open for me whenever I needed to talk. I would like to thank my other committee members, Philip Resnik, Amol Deshpande, Marie desJardins and Carol Espy-Wilson, for taking the time to review my dissertation and for their help and suggestions for improving it. Philip and Amol, in particular, have always been available for advice. I am thankful for their active help during the job-hunting process and for counseling regarding career options in general. I am also thankful for the opportunity to work with Yoshua Bengio during the early years of my PhD. For the KDD Entity Resolution Challenge in the summer of ?05 and for the research on query-based entity resolution that it inspired, I worked in collaboration with my friend and fellow graduate student Louis Licamele. It has been a pleasure working with him. I could always count on his optimism when things looked bleak. Among faculty members in the department, who have not been directly asso- ciated with my dissertation, I will always remember and be thankful for the many iii hours that Bobby Bhattacharjee spent with me discussing research, career options, and sometimes cricket to take a break from all those other things in life. I have also bene ted immensely from my interactions with Hanan Samet and Amitabh Varsh- ney. Among graduate students I have collaborated with, Srinivsan Parthasarathy introduced me to many interesting problems outside the domain of my dissertation. He was the main inspiration for our work on peer-to-peer systems, for which we also had Srinivas Kashyap in our team. I have been fortunate to be a part of an excellent research group. I would like to sincerely thank all the LINQS members | Rezarta Islamaj, Prithviraj Sen, Mustafa Bilgic, Louis Licamele, Galileo Namata, Vivek Sehgal, Wontaek Tseo, John Park, Hyunmo Kang and Elena Zheleva | for providing a wonderful working atmosphere, from discussing research to re ecting on life over a cup of co ee. I would like to specially thank the fellow residents of 3228 A.V. Williams Building | Rezarta, Mustafa, Louis, Galileo, Vivek and also Nargess Memarsadeghi. It would have been lonely in there without them. I have come to know several excellent people in the department during the course of my PhD. I will be always be thankful to Rajiv Gandhi for being a mentor during the early years. Among others, Gutemberg Guerra-Filho, Vijay Gopalakrish- nan, Srinivasan Parthasarathy, Arun Vasan, Arunesh Mishra and Gaurav Aggarwal have always been good friends. When I needed company in A.V. Williams during weekends, I could count on Kaushik Mitra and Sandeep Manocha. Several sta members in the department have been exceptionally helpful. Brenda Chick and Brad Plecs were always available for ready assistance, and Fatima iv Bangura and Felicia Chelliah will always be among the people I remember from my PhD experience. As I have learnt over the last six years, there is much to the PhD experi- ence beyond research, and the quality of life outside had a direct impact on the quality of research happening within the walls of the department. I have been for- tunate to have several wonderful room-mates. Amit Roy-Chowdhury has been a proverbial friend, philosopher and guide. I owe a lot to Ayush Gupta and Supratik Datta, who have been my room-mates for the longest time. Beyond room-mates, Kaushik Chakraborty, Sharmistha Acharya and Suddhasattwa Ghosh have been friends through the highs and the lows. Finally, there are those who can never be thanked enough, and it is useless to try. It would have been impossible to be through it all without the support of my parents. v Table of Contents List of Tables ix List of Figures x 1 Introduction 1 1.1 Data Integration and Entity Resolution . . . . . . . . . . . . . . . . . 1 1.2 Collective Entity Resolution Using Relationships . . . . . . . . . . . . 4 1.3 Collective Relational Clustering . . . . . . . . . . . . . . . . . . . . . 6 1.4 Probabilistic Model for Collective Entity Resolution . . . . . . . . . . 8 1.5 Entity Resolution for Queries . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Applying Entity Resolution for Word Sense Disambiguation . . . . . 12 1.7 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.8 Speci c Contributions and Organization of the Dissertation . . . . . . 16 2 Relational Clustering for Collective Entity Resolution 19 2.1 Motivating Example for Entity Resolution Using Relationships . . . . 19 2.2 Entity Resolution Using Relationships: Problem Formulation . . . . . 23 2.3 Entity Resolution Approaches . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 Attribute-based Entity Resolution . . . . . . . . . . . . . . . . 26 2.3.2 Naive Relational Entity Resolution . . . . . . . . . . . . . . . 26 2.3.3 Collective Relational Entity Resolution . . . . . . . . . . . . . 28 2.4 Neighborhood Similarity Measures for Collective Resolution . . . . . 31 2.4.1 Common Neighbors . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.2 Jaccard Coe cient . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.3 Adamic/Adar Similarity . . . . . . . . . . . . . . . . . . . . . 33 2.4.4 Adar Similarity with Ambiguity Estimate . . . . . . . . . . . 34 2.4.5 Higher-Order Neighborhoods . . . . . . . . . . . . . . . . . . . 36 2.4.6 Negative Constraints From Relationships . . . . . . . . . . . . 37 2.5 Relational Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . 38 2.5.1 Blocking to Find Potential Resolution Candidates . . . . . . . 38 2.5.2 Relational Bootstrapping . . . . . . . . . . . . . . . . . . . . . 40 2.5.3 Merging Clusters and Updating Similarities . . . . . . . . . . 43 2.5.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 44 2.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.6.1 Evaluation on Bibliographic Data . . . . . . . . . . . . . . . . 46 2.6.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.6.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . 49 2.6.1.3 Experimental Details . . . . . . . . . . . . . . . . . . 50 2.6.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.6.1.5 Execution Time . . . . . . . . . . . . . . . . . . . . . 58 2.6.2 Experiments on Synthetic Data . . . . . . . . . . . . . . . . . 60 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 vi 3 A Latent Dirichlet Model for Unsupervised Entity Resolution 66 3.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2 LDA Model for Authors . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.3 LDA Model for Author Resolution . . . . . . . . . . . . . . . . . . . 72 3.4 Inference using Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . 74 3.5 Modeling Author Attributes . . . . . . . . . . . . . . . . . . . . . . . 76 3.6 Noise Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.7 Determining Number of Entities . . . . . . . . . . . . . . . . . . . . . 78 3.7.1 Basic Inference With Gibbs Sampling . . . . . . . . . . . . . . 78 3.7.2 Relation to the Dirichlet Process . . . . . . . . . . . . . . . . 79 3.7.3 Block Assignment for Entity Resolution . . . . . . . . . . . . 81 3.8 Determining Model Parameters . . . . . . . . . . . . . . . . . . . . . 86 3.8.1 Number of Groups . . . . . . . . . . . . . . . . . . . . . . . . 87 3.8.2 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.8.3 Noise Model Parameters . . . . . . . . . . . . . . . . . . . . . 88 3.9 Algorithm Re nements . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.9.1 Bootstrapping Author Labels . . . . . . . . . . . . . . . . . . 89 3.9.2 Group Evidence for Author Self Loops . . . . . . . . . . . . . 89 3.10 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.10.1 Results on Citation Data . . . . . . . . . . . . . . . . . . . . . 90 3.10.2 Properties of Collaborative Graphs . . . . . . . . . . . . . . . 95 3.10.3 Comparison With Collective Relational Clustering . . . . . . . 98 3.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4 Entity Resolution for Queries 101 4.1 Motivativation for Entity Resolution Queries . . . . . . . . . . . . . . 101 4.2 Entity Resolution Queries: Formulation . . . . . . . . . . . . . . . . . 104 4.3 Performance Dependencies in Relational Clustering . . . . . . . . . . 105 4.3.1 Performance Analysis of Attribute-based Resolution . . . . . . 107 4.3.2 Characterizing Relations . . . . . . . . . . . . . . . . . . . . . 108 4.4 Two-Stage Query Processing . . . . . . . . . . . . . . . . . . . . . . . 112 4.5 Adaptive Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . 117 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.6.1 Experiments on Real Data . . . . . . . . . . . . . . . . . . . . 119 4.6.2 Experiments using Synthetic Data . . . . . . . . . . . . . . . . 127 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5 Word Sense Disambiguation Using Bilingual Probabilistic Models 131 5.1 Word Sense Disambiguation: Introduction and Related Work . . . . . 131 5.2 Probabilistic Models for Parallel Corpora . . . . . . . . . . . . . . . 135 5.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.2.2 The Sense Model . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.2.3 The Concept Model . . . . . . . . . . . . . . . . . . . . . . . 137 5.3 Constructing the Senses and Concepts . . . . . . . . . . . . . . . . . 138 5.3.1 Building the Sense Model . . . . . . . . . . . . . . . . . . . . 139 vii 5.3.2 Building the Concept Model . . . . . . . . . . . . . . . . . . 140 5.4 Learning the Model Parameters . . . . . . . . . . . . . . . . . . . . . 142 5.4.1 EM for the Sense Model . . . . . . . . . . . . . . . . . . . . . 142 5.4.2 EM for the Concept Model . . . . . . . . . . . . . . . . . . . 143 5.4.3 Initialization of Model Probabilities . . . . . . . . . . . . . . 143 5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 144 5.5.1 Evaluation with Senseval Data . . . . . . . . . . . . . . . . . 145 5.5.2 Semantic Grouping of Spanish Senses . . . . . . . . . . . . . . 147 5.6 Model Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6 Related Work 152 6.1 Approximate Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.2 Theoretical Bounds for Cleaning . . . . . . . . . . . . . . . . . . . . . 153 6.3 E ciency Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.4 Probabilistic Models for Entity Resolution . . . . . . . . . . . . . . . 155 6.5 Non-probabilistic Relational Approaches . . . . . . . . . . . . . . . . 157 6.6 Group and Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . 159 6.7 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.8 Data Cleaning Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.9 Application Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.10 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7 Conclusions and Future Directions 164 A Synthetic Data Generation 169 Bibliography 175 viii List of Tables 2.1 Performance of di erent algorithms on real datasets . . . . . . . . . . 51 2.2 Performance of neighborhood sim. measures on real datasets . . . . . 56 2.3 Execution times of di erent algorithms . . . . . . . . . . . . . . . . . 59 3.1 Performance of baselines using SoftTF-IDF . . . . . . . . . . . . . . . 93 3.2 Performance of LDA-ER on real datasets . . . . . . . . . . . . . . . . 94 3.3 Performance of LDA-ER over varying number of groups . . . . . . . . 95 3.4 Comparison of LDA-ER with relational clustering . . . . . . . . . . . 98 4.1 Resolution accuracy for queries using di erent algorithms . . . . . . . 123 4.2 Resolution accuracy using di erent adaptive expansion strategies . . . 125 4.3 Comparison between unconstrained and adaptive expansion . . . . . 126 5.1 Comparison with a baseline WSD model . . . . . . . . . . . . . . . . 146 5.2 Examples of Spanish concepts and senses . . . . . . . . . . . . . . . . 148 ix List of Figures 2.1 References in the bibliographic example . . . . . . . . . . . . . . . . . 20 2.2 Example of (a) reference graph and (b) entity graph . . . . . . . . . . 22 2.3 Abstract representation of (a) reference graph and (b) entity graph . 24 2.4 The relational clustering algorithm . . . . . . . . . . . . . . . . . . . 39 2.5 Resolution performance over varying on real datasets . . . . . . . . 54 2.6 Comparison of di erent relational similarity measures . . . . . . . . . 57 2.7 Execution time of CR with increasing references . . . . . . . . . . . . 60 2.8 Comparison of performance on synthetically generated data . . . . . 61 3.1 Illustration of author entities and collaboration groups . . . . . . . . 68 3.2 Plate representation for (a) author model and (b) reference model . . 71 3.3 Comparison of LDA-ER with baselines on synthetic data . . . . . . . 96 4.1 Illustration of (a) identifying relation and (b) ambiguous relation . . . 109 4.2 Construction of relevant set for a query . . . . . . . . . . . . . . . . . 114 4.3 Growth of relevant set size and query processing time . . . . . . . . . 120 4.4 E ect of identifying and ambiguous relations . . . . . . . . . . . . . . 127 4.5 E ect of using increasing levels of co-occurrence . . . . . . . . . . . . 128 5.1 Graphical representations of Sense and Concept models . . . . . . . . 136 5.2 Examples of Sense and Concept Models . . . . . . . . . . . . . . . . . 139 5.3 Comparison with Senseval2 WSD Systems . . . . . . . . . . . . . . . 147 A.1 The synthetic data generation algorithm . . . . . . . . . . . . . . . . 171 x Chapter 1 Introduction 1.1 Data Integration and Entity Resolution The phenomenal expansion of the world wide web, improved acquisition tech- nology and increasing a ordability of storage media have all contributed to an explo- sive growth in the volume of publicly accessible data in digital form. This immense volume of data, which was unimaginable a decade ago, brings with it new possibil- ities for automated reasoning and knowledge discovery. As an illustration of how large volumes of data can help, there have recent news reports of doctors using web search technology for diagnosing complex symptoms in patients [84]. In this study, doctors searched using patient symptoms without knowing the right diagnosis and then selected the most relevant diagnosis from the top three results. This returned the correct diagnosis for more than 50% of the patients studied. One of the biggest hurdles for analyzing available data is that information is most often dispersed over several data sources. It is often possible to make use of all the available data most e ectively if it is acquired (or indexed) and integrated into a central repository. There are several information repositories, such as CiteSeer for computer science publications, that automatically acquire data from di erent information sources us- 1 ing improved crawling technologies. However integration of the acquired data into a consistent and coherent form is a more challenging problem. In the medical diagnosis example, the doctors had to manually nd the most relevant match for each query, but completely manual curation is impossible in all but the smallest databases. As a result, alongside automated acquisition, there has been an increasing dependence on automated techniques for integrating the data and for maintaining quality of information. While we have seen a surge in research interest in this area over the last decade, the problems are expectedly quite daunting. Accuracy being critical in many applications, there is need for further research in this area. Entity resolution is an important component of data integration that comes up frequently. In many databases, records refer to real-world entities, and as such databases grow, there can many di erent records that refer to the same entity. For example, a social network database can have di erent records with names ?J. Doe?, ?Jonathan Doe? and ?Jon Doe? that refer to the same person. In the absence of keys such as social security numbers, this duplication issue [52, 78] leads to several di erent problems, such as redundant records, data inconsistencies, incorrectness of computed statistics, and many others. From a knowledge discovery perspective, mining data that has unresolved duplicates is very likely to yield patterns that are both inaccurate and incomplete. This issue also comes up when integrating data from di erent heterogeneous sources without shared keys and sometimes di erent schemas as well [28]. Broadly, I call such database records references to real world entities, and the entity resolution problem involves (a) nding the underlying enti- ties in the domain and (b) tagging the references in the database with the entities 2 to which they correspond. Most often, these two problems cannot be solved inde- pendently and need to be addressed at the same time. In addition to databases, entity resolution is a common problem that comes in di erent guises (and is given di erent names) in other computer science domains. Examples include computer vision, where we need to gure out when features in two di erent images refer to the same underlying object (the correspondence problem); natural language processing where we would like to determine which noun phrases refer to the same underlying entity (co-reference resolution); data fusion or con ation in the geospatial domain, where spatial entities from di erent maps or images need to be matched. The problem goes by di erent names even inside the database community. Deduplication [52, 78] refers to problem of determining which records or tuples within the same database or relational table correspond to the same real world object. For data integration, approximate joins [28] are used for consolidating information from multiple sources. Entity resolution is a di cult problem and cannot be solved using exact matches on tuple attributes, due to two main reasons that we discuss in greater detail in Section 2.1. First, there is the identi cation problem, when di erent rep- resentations arising from recording errors or abbreviations refer to the same entity. In the earlier example, guring out that ?Jon Doe? and ?Jonathan Doe? are the same person is an instance of this problem. Failure in identi cation a ects recall. The second issue is disambiguation. It is possible that two records with name ?J. Doe?, the same address and the same age refer to two brothers and not to the same person. This a ects precision. 3 As I discuss in more detail in the chapter on related work (Chapter 6), the study of entity resolution goes back a long way to Newcombe et al. [83] who in- troduced the record linkage problem and Fellegi and Sunter [44] who formalized it. Early approaches to entity resolution prescribed fuzzy attribute matches and many sophisticated techniques have since been developed. However, approaches based solely on attributes still cannot satisfactorily deal with the problem of false attribute matches and, in general, it may be hard to improve precision and recall at the same time using just attributes of individual records. The problem is compounded by the lack of readily available training data. Preparing ground truth by manually tagging pairs of references as matches or non-matches can be very painstaking, and typically, labeled samples are sparse for most domains, if available at all. To get around this issue, the focus has often been on developing unsupervised approaches for resolving entities. 1.2 Collective Entity Resolution Using Relationships In many domains, there may be additional information that we can use for re- solving entities. The underlying entities often exhibit strong relational ties to certain other entities. For instance, in social networks, people interact frequently with their close friends, while in academic circles, researchers collaborate frequently with their close associates. When such ties exist between entities, co-occurrences between the references to these entities can be observed in the data. Names of colleagues co-occur as author names in publications and names of friends are often found to co-occur in 4 emails. In the social network example, we may have records showing that ?J. Doe? communicates most with ?D. Smith? while ?Jon Doe? has frequent communications with ?Don Smith?. The goal is to make use of such relationships between references to improve entity resolution performance. One way is to use the attributes of related records as well when computing fuzzy matches. To compare the ?J. Doe? and ?Jon Doe? references, we may also compare the names of their associates. While this may lead to an improvement, it will not always work. For example, we do not want to merge two person records simply because their friends have similar names. The correct evidence to use for the ?J. Doe? and ?Jon Doe? references is whether their best friends are in fact the same entity. This implies that in order to decide about one reference, we need to know about the entities for some other references. However, the problem is that we do not know the entities for these related records either. So how can we use these relations then? I use the idea of collective entity resolution, where the entities for related references need to be determined jointly rather than independently. Collective entity resolution using relations is a challenging problem for two main reasons. This problem is an instance of collective clustering, where cluster membership for any reference depends on cluster memberships of all related refer- ences. Collective classi cation has been studied extensively in the machine learning literature in recent years [25, 85, 62, 48, 64, 99]. But little work has been done for collective clustering using relationships. In order for distance or similarity based ap- proaches to work for this problem, the relationships need to be incorporated into the similarity / distance measure. On the other hand, for model-based clustering to be 5 used, we need to come up with representations for relationships between the under- lying entities or clusters. Aside from the modeling issues, the other huge challenge is in terms of computation. The di erent clusters or entities cannot be determined independently, but instead the space of joint cluster assignments for related refer- ences needs to be explored. In contrast to attribute-based resolution, the database cannot be cleaned with a single-pass approach anymore. It is necessary to resort to iterative approaches, where each resolution that we make potentially provides new evidence for determining the entities for other related references. E ciency in han- dling these dependencies is a key concern in designing algorithms for this problem. The challenges for collective clustering are daunting, but there is the promise that resolution accuracy can be signi cantly improved over traditional techniques. In this dissertation, I rst address the problem of collective entity resolution using relationships and propose two novel approaches for solving it. The rst is a greedy agglomerative clustering approach called collective relational clustering [8, 7, 9, 11, 10] and the second is a probabilistic generative model which I call LDA-ER [12, 10]. 1.3 Collective Relational Clustering In essence, collective relational clustering is a hierarchical clustering algorithm, where I start from an initial cluster assignment and then proceed by merging pairs of clusters. This is similar to greedy agglomerative clustering with a key di erence. The similarity measure for cluster pairs accounts for relationships between di erent 6 references, and as a result of this, each merge operation a ects similarities for re- lated cluster pairs. In the social network example, suppose ?J. Doe? is known to be friends with ?D. Smith? and ?K. Zabrinsky?, while ?Jon Doe? has ?Don Smith? and ?Kate Zabrinsky? as friends. Then, if the relational clustering algorithm assigns the two Zabrinsky references to the same cluster in some iteration, that increases the similarity for the two ?Smith?s and for the two ?Doe?s, since they are now found to be friends with the same person. The merge operations continue until the similarity between the closest cluster pair drops below some threshold. The cluster similarity measure combines attribute similarity of the clusters with their relational similar- ity. The relational similarity is determined by the neighborhood of each cluster, and I explore di erent ways for measuring shared neighborhood between clusters. In parallel, I address the computational challenge arising from the dependencies by e ciently nding the most similar cluster pair and updating similarities in each iteration using novel indexing mechanisms. The relational clustering algorithm has many attractive features. First, it is simple to understand and the incremental evidence is easy to interpret. It is easily customizable for speci c domains by picking the best attribute similarity measures for that domain. Due to the monotonicity of the process where clusters always merge and also to the e cient implementation, it is quite fast. At the same time it works very well in practice. However, it has a few short-comings as well. First, as with traditional agglomerative clustering approaches, a similarity threshold needs to be speci ed as a termination condition. Secondly, clusters are only merged in this algorithm. Two clusters once merged can never split back to account for evidence 7 that might be found in some later iteration. This speeds up the algorithm but also restricts the space of joint cluster assignments that it can explore. Also, di erent merge sequences can lead to di erent resolution results. The second approach that I propose is designed to deal with these speci c issues. 1.4 Probabilistic Model for Collective Entity Resolution The second approach that I propose for collective entity resolution using re- lationships is a non-parametric probabilistic model. It is a probabilistic generative model that describes how references for related entities might co-occur in the data. It represents relationships between underlying entities using the novel idea of groups of entities. References to entities that are members of the same group are more likely to co-occur in the data. In the social network example, this captures the notion of circles of friends. If the person entities ?Jonathan Doe?, ?Donald Smith? and ?Kather- ine Zabrinsky? are members of the same group of friends, then they are expected to participate in conversations more frequently. This is similar to the idea of Latent Dirichlet Allocation or LDA that is used for topic mixtures in document modeling | hence the name LDA-ER. The critical di erence with LDA is that the entities are not observed directly. Instead, only the references to entities, for example the names of the people, are observed in the data. The entities need to be determined using the additional group evidence that is discovered from the co-occurrences. The next challenge for the LDA-ER model is the design of tractable inference algorithms. As in the LDA model, exactly inferring the groups and entities for the 8 observed references is intractable for LDA-ER. I propose an approximate inference strategy based on Gibbs Sampling that infers the most likely group and entity for each reference. Additionally, it automatically discovers the most likely number of entities for the references without requiring any user speci ed threshold to be spec- i ed. This overcomes the rst short-coming of the relational clustering approach, namely that of threshold selection. In order to address the second issue | that of clusters only being allowed to merge | I propose a sampling strategy for inference, where a reference may be assigned to a new entity at any iteration of the algorithm depending of current evidence. I also improve the computational complexity of the inference algorithm by proposing an improved merge-split sampling strategy where entities make random decisions to merge or split depending on available evidence. In addition to non-parametric resolution of the references, as an interesting by- product, the model returns hidden group structures among the underlying entities that provide additional structural insight about the domain. However, these ben- e ts of LDA-ER come at a price | added computational overhead. Since it does not take the greedy route of merging the closest clusters, but instead looks at all the entities, each iteration is more expensive. Additionally, it is not possible to set a worst-case upper-bound for the number of iterations, as with a greedy approach. In summary, both of the proposed approaches for collective entity resolution have their advantages and disadvantages. For domains where quick results are nec- essary and termination thresholds can somehow be determined, the relational clus- tering approach should be preferable. In other cases, where specifying a termination threshold or an approximate number of entities apriori is di cult, LDA-ER should 9 be more useful. Also, in domains such as collaboration and social networks, where discovery of hidden group structures is relevant, LDA-ER may be the preferred approach. 1.5 Entity Resolution for Queries The resolution approaches that I propose in the rst part of this dissertation are collective in nature | they resolve the references for an entire database as a whole. This works well for o ine cleaning, but is not very useful for processing queries. Users query the web and di erent online databases everyday, and expect to get answers that are entity resolved, either directly or indirectly. For example, we may query the CiteSeer database of computer science publications looking for books by ?S Russell?. This query would be easy to answer if all author names in CiteSeer were correctly mapped to their entities. But, unfortunately, this is not the case. Going by CiteSeer records, Stuart Russell and Peter Norvig have written more than 100 di erent books together [86]. Alternatively, in our social network example, we may be searching di erent social network communities for a person named ?Jon Doe?. In this case, each online community may individually have records that are clean. But query results that return records from all of them together may have unresolved entities. Additionally, in both cases, it is not su cient to simply return records that match the query name, ?S. Russell? or ?Jon Doe? exactly. We need to retrieve records with similar names as well, but, more importantly, partition the records that are returned according to their entities. 10 In this thesis, I motivate the problem of query-time entity resolution and ap- ply collective resolution techniques for the problem of answering queries [13]. The biggest issue with this approach is the dependency structure of collective resolution. In order to reason about the query records, it is necessary to reason about their related records, which in turn require reasoning about their related records and so on. I rst formally analyze how accuracies for resolving di erent entities depend on each other and on di erent structural characteristics of the data as a result of collective resolution. Then I propose a two stage strategy for localizing collective res- olution. First, the relevant records necessary for answering the query are extracted by a recursive expansion process and then collective resolution is performed on the extracted records only. Using formal analysis, I show that the recursive expansion process can be terminated at reasonably small depths for accurately answering any query; the returns fall o exponentially as neighbors that are further away are con- sidered. However, the problem with this unconstrained expansion process is that it may return too many records even at small depths that are impossible to resolve at query time. I address this issue using an adaptive strategy that only considers the most informative of the related records for answering any query. This signi cantly reduces the number of records that need to be investigated at query time, but, most importantly, does not compromise on the resolution accuracy for the query. In summary, the adaptive expansion strategy enables query-time resolution while preserving the performance bene ts of collective resolution. 11 1.6 Applying Entity Resolution for Word Sense Disambiguation As an application of entity resolution in the domain of natural language pro- cessing, I consider the problem of word sense disambiguation and investigate how collective entity resolution using relationships can be useful for this problem [14]. Words in natural language documents are often ambiguous in terms of their senses. The identi cation and disambiguation issues that I mentioned in the context of en- tity resolution are well studied in the area of linguistics as well. Identi cation is necessary for the problem of synonymy, where di erent words can be used to refer to the same sense. On the other hand, we also have polysemous words that can cor- respond to multiple senses. Sense disambiguation deals with this second aspect of the problem. Consider the word ?bank? in English. According to the WordNet sense hierarchy, this word has 10 possible senses | nancial institution, shore and re- serve/stockpile being the three most common ones. Given two di erent occurrences of the word ?bank? in a natural language corpus, we need to decide whether they refer to the same sense or to di erent senses. This version of the problem is usually referred to as sense discrimination [95]. For the sense disambiguation problem, sense de nitions are used from available sense hierarchies such as WordNet and then each occurrence of an ambiguous word needs to be tagged with one of its possible senses. Traditional approaches to sense disambiguation make use of the context around a word. For example, the occurrence of ?breeze?, ?sand? or ?water? is very likely to suggest the shore sense of bank and ?transaction? would suggest the nancial institution sense. This is very similar to using attributes of a speci c reference 12 to determine the corresponding entity. More recent approaches make use of co- occurrence relationships in the form of translations. It is known that translations can help disambiguate senses [22, 33, 34, 55, 91]. For example, when ?bank? is translated in Spanish as ?orilla?, it most likely to mean shore. Following Diab and Resnik [38], I make use of parallel corpora where the same document is available in multiple languages, for example in English and French as in the Canadian Hansards. Then aligned translation threads spanning documents in multiple languages serve as co-occurrence relations that we can use for resolving senses. As for entity resolution, I explore the problem of collective sense disambiguation, where senses are resolved for multiple languages simultaneously. Despite the striking similarities with the entity resolution problem, the word sense disambiguation problem has certain interesting features that set it apart. First, we can make use of the sense de nitions available for English words from the Word- Net hierarchy. E ectively, these provide us with the domain entities for English words and we do not need to discover them. While this simpli es part of the problem, certain other aspects make it more challenging. Each translation thread spans words from multiple languages, each of which has its own de ned senses. In essence, we can imagine this as an instance of multi-type entity resolution, where the co-occurrence relations connect entities of multiple types, each of which can be ambiguous. The other interesting aspect is the availability of the WordNet ontology for English (and more recently some other languages as well). On one hand, this provides sense de nitions for words. On the other, it opens up possibilities for reso- lution approaches that can make use of the information-rich hierarchy, for instance 13 in de ning similarity measures between senses [89]. I propose two di erent proba- bilistic generative models for collective sense disambiguation from translations. The approach that I propose makes use of the WordNet structure to resolve senses in En- glish and additionally to construct a semantic hierarchy for any secondary language for which translations are available. 1.7 Terminology Before moving on to the main chapters of the dissertation, I review the termi- nology that I have established in this introductory chapter and will be using in the rest of the dissertation: Entity: An entity is a real world object, such as a person, place, organization, event, etc. that is easily recognized by a human being. Entities can also be abstract, such as a sense in the context of linguistics. The entities may be known for some domains. In others, they need to be discovered. Reference: A reference or a record is an observation or a mention of an entity, such as names of persons or places. A tuple in a census database is an example of a reference. In many cases, references need to be extracted from textual documents. The mapping from references to entities is often uncertain. Attribute: An attribute is an observed property of an individual reference, for example the recorded name, address or phone number of a person reference in a social network database. Attributes of a reference are mostly derived 14 from corresponding attributes of the underlying entities. However, in many applications we do not have attributes which serve as identi ers for entities and this leads to uncertainty in the mapping from references to entities. Relationship: When multiple references are observed together, or in the same context, that forms a co-occurrence relationship between those refer- ences. For example, we have names of di erent people occurring in the same email or names of researchers occurring as co-author names in publications. When viewing the data as a graph, we alternatively use the term hyper-edge to refer to a co-occurrence relationship that connects many reference nodes. Co-occurrences usually happen as a result of ties or links between the under- lying entities, such as a friendships between people. I sometimes use the term relationship to refer to these ties between entities as well. But, unless explic- itly mentioned, relationship will be used to mean a co-occurrence relationship between multiple references. Group: A group is a collection of entities that have close ties between them- selves. For example, we can have a group of friends or a group of colleagues who inter-act frequently. Entities can belong to multiple groups at the same time. Groups are only observed indirectly through co-occurrences that mostly happen between references to entities that belong to the same group. The observed co-occurrence relations in the data provide evidence for discovering the group structures among the entities, and the group evidence in turn helps in improved resolution of the references. 15 1.8 Speci c Contributions and Organization of the Dissertation The speci c contributions of this dissertation are as follows: 1. In this dissertation, I de ne the problem of collective entity resolution us- ing relationships between references. I introduce two di erent approaches for unsupervised collective entity resolution. The relational clustering algorithm combines attributes with relationships in a novel way to measure similari- ties between clusters. The probabilistic LDA-ER model uses group structures among underlying entities to resolve references. I propose novel inference al- gorithms for this model using sampling approaches. 2. I perform extensive experiments on multiple real and synthetic datasets and compare against various baselines to demonstrate that collective entity resolu- tion signi cantly improves performance over traditional approaches that make use of attributes of references. Using synthetically generated data, I also ex- plore structural and other properties of datasets to investigate characteristics that favor or adversely a ect collective resolution. 3. I motivate the problem of query-time entity resolution where entities are re- solved on the y for answering queries over unresolved databases. I formally analyze the dependencies arising from collective resolution and show the va- lidity of a limited-depth recursive expansion process for answering queries. I propose adaptive algorithms that identify the most informative related ref- erences for resolving queries collectively. This enables query-time resolution 16 while preserving the performance bene ts of collective resolution. 4. As an application of entity resolution in the domain of computational lin- guistics, I investigate the problem of word sense disambiguation in natural language documents. Using aligned translation threads from parallel texts, I focus on collective sense resolution in multiple languages. I propose two probabilistic models for word sense disambiguation using translations that outperform existing unsupervised approaches for this problem. There is a large body of related work on entity resolution, as I discuss in Chapter 6. But this dissertation stands out in more ways than one. Though en- tity resolution problem has been around for many years, my relational clustering approach [8] is one of the rst to make use of relationships for collective or joint resolution. Since then, the use of relationships and even collective solutions for this problem have gained in popularity, and both probabilistic and non-probabilistic ap- proaches have been proposed by other researchers. Therefore, it is important to appreciate the contributions and the novelty of this dissertation in the light of this related research. The probabilistic model that I propose is one of the very few gener- ative models for noisy and uncertain co-occurrence relations, and unlike most other models, my learning algorithm is completely unsupervised. LDA-ER is also unique in that it uses a group variable to model relationships between entities, thereby avoiding expensive pair-wise relationship variables. The other approach based on relational clustering is unique in that it poses collective relational entity resolution as a distance/similarity-based clustering problem. The problem of collective clustering 17 has previously received little attention in the literature, and my proposed approach of using similarity-measures that accommodate collective decisions is one of the rst solutions to be proposed. Unlike most other work on entity resolution, e ciency is a key concern for all my approaches. This nally culminates in the formulation of the query-time entity resolution problem. Looking beyond entity resolution, clustering at query time in the presence of relationships has not been studied in the literature to the best of my knowledge. Motivating this problem and proposing a working solution for it also counts as a signi cant contribution of this dissertation. The rest of the dissertation is organized as follows. The next two chapters, Chapter 2 and Chapter 3, discuss the two approaches to collective entity resolution. In Chapter 2, I rst motivate the entity resolution problem using a bibliographic example and formulate the problem. I also discuss di erent approaches based on attributes and relationships that may be used to address the entity resolution prob- lem before going into the details of the relational clustering algorithm. Next, in Chapter 3, I describe and evaluate the probabilistic approach to collective entity resolution. Then I move on to the problem of entity resolution for queries in Chap- ter 4, where I rst motivate the problem and then discuss, analyze and evaluate algorithms for query-time entity resolution. Chapter 5 discusses the word sense dis- ambiguation problem. I review related work in entity resolution in Chapter 6 and then nally discuss potential future directions and conclude in Chapter 7. 18 Chapter 2 Relational Clustering for Collective Entity Resolution In this chapter, I propose the rst solution to the collective entity resolution problem, which is based on a novel unsupervised relational clustering algorithm. Before describing the proposed approach, I rst present a more realistic motivating example for entity resolution using the relations between references in Section 2.1 and formalize the relational entity resolution problem in Section 2.2. I explore and compare di erent approaches for entity resolution and formulate collective re- lational entity resolution as a clustering problem in Section 2.3. I propose novel relational similarity measures for collective relational clustering in Section 2.4. I discuss the clustering algorithm in further detail in Section 2.5. In Section 2.6, I describe experimental results using the di erent similarity measures on multiple real-world datasets. I also present detailed experiments on synthetically generated data to identify data characteristics that indicate when collective resolution should be favored over the more naive approaches and nally conclude in Section 2.7. 2.1 Motivating Example for Entity Resolution Using Relationships I consider as our motivating example the problem of resolving the authors in a database of academic publications similar to DBLP, CiteSeer or PubMed. Consider 19 W Wang A Ansari W Wang A Ansari A AnsariW W Wang a0a1a0a1a0a1a0 a0a1a0a1a0a1a0 a0a1a0a1a0a1a0 a2a1a2a1a2a1a2 a2a1a2a1a2a1a2 a2a1a2a1a2a1a2 a3a4a3a4a3a4a3a4a3a4a3 a3a4a3a4a3a4a3a4a3a4a3 a3a4a3a4a3a4a3a4a3a4a3 a3a4a3a4a3a4a3a4a3a4a3 a3a4a3a4a3a4a3a4a3a4a3 a3a4a3a4a3a4a3a4a3a4a3 a5a4a5a4a5a4a5a4a5a4a5 a5a4a5a4a5a4a5a4a5a4a5 a5a4a5a4a5a4a5a4a5a4a5 a5a4a5a4a5a4a5a4a5a4a5 a5a4a5a4a5a4a5a4a5a4a5 A Mouse Immunity Model A Better Mouse Immunity Model Autoimmunity in Biliary CirrhosisMeasuring Protien?bound Fluxetine C ChenL Li W Wang C Chen Paper 2 Paper 4Paper 3 Paper 1 Figure 2.1: The references in di erent papers in the bibliographic example. Refer- ences to di erent entities are shaded di erently. the following set of four papers, which I will use as a running example: 1. W. Wang, C. Chen, A. Ansari, \A mouse immunity model" 2. W. Wang, A. Ansari, \A better mouse immunity model" 3. L. Li, C. Chen, W. Wang,\Measuring protein-bound uxetine" 4. W. W. Wang, A. Ansari, \Autoimmunity in biliary cirrhosis" Now imagine that we would like to nd out, given these four papers, which of these author names refer to the same author entities. This involves determining whether paper 1 and paper 2 are written by the same author named Wang, or whether they are di erent authors. We need to make similar decisions about the Wang from paper 3 and the Wang from paper 4, and all pairwise combinations. We need to answer similar questions about the other author names Ansari and Chen as well. In this example, it turns out there are six underlying author entities, Wang1 and Wang2, Chen1 and Chen2, Ansari and Li. The three references with the 20 name ?A. Ansari? correspond to author Ansari and the reference with name ?L. Li? to author Li. However, the two references with name ?C. Chen? map to two di erent authors Chen1 and Chen2. Similarly, the four references with name ?W. Wang? or ?W. W. Wang? map to two di erent authors. The ?Wang? references from the rst, second, and fourth papers correspond to author Wang1, while that from the third paper maps to a di erent author Wang2. This is shown pictorially in Figure 2.1, where references which correspond to the same authors are shaded identically. There are two di erent subproblems that are of interest in solving the entity resolution problem. One is guring out for any author entity the set of di erent name references which may be used to refer to the author. I refer to this as the identi cation problem. For example, for a real-world entity with the name ?Wei Wei Wang?, her name may come up as ?Wei Wang?, ?Wei W. Wang?, ?W. W. Wang?, ?Wang, W. W.? and so on. There may also be errors in the data entry process, so that the name may be incorrectly recorded as ?W. Wong? or ?We Wang? etc. In addition to the reconciliation of di erent looking names which refer to the same underlying entity, a second aspect of entity resolution problem is distinguishing references that have very similar and sometimes exactly the same name and yet refer to di erent underlying entities. I refer to this as the disambiguation problem. An example of this is determining that the ?W. Wang? of paper 1 is distinct from the ?W. Wang? of paper 3. The extent of the disambiguation problem depends on the domain. The problem can be exacerbated by the use of abbreviations; many databases (for example PubMed) store only abbreviated names. 21 A Ansari A AnsariW W Wang W Wang W Wang A Ansari W Wang C Chen C Chen L Li (a) Chen2 Wang2Li Ansari Chen1 Wang1 (b) Figure 2.2: (a) The reference graph and (b) the entity graph for the author resolution example. The aim is to make use of the relationships that hold among the observed references to resolve them better, and to solve both the identi cation and disam- biguation problem at the same time. As in the case of the census example, we can represent the relationships as a graph where the vertices represent the author refer- ences and the hyper-edges represent the co-authorship relations that hold between them in the dataset. Figure 2.2(a) shows the reference graph for the bibliographic example, where the nodes are the references and hyper-edges in the graph indicate which references co-occur. Given this graph representation for the data, the goal is to take the hyper-edges into account to better partition the references into entities. Now, in addition to the similarity of the attributes of the references, I consider their relationships as well. In terms of the graph representation, two references that have similar attributes are more likely to be the same entity if their hyper-edges connect to the same entities as well. To see how this can help, observe in Figure 2.1(a) that the Wang references in papers 1, 2 and 4 collaborate with Ansari?s who correspond to the same author. This makes it more likely that they are the same entity. In con- trast, the ?Wang? from paper 3 collaborates with di erent authors, which suggests 22 that it does not refer to the same person as the other cases. But it seems that we are stuck with a ?chicken-and-egg? problem. The iden- tity of a reference depends on those of its collaborators, and the identity of the collaborators depends on the identity of the reference itself. So where do we begin? Intuitively, we start with the resolutions that we are most con dent about. For instance, two references with the name ?A. Ansari? are more likely to be the same because ?Ansari? is a common name, in contrast to references with common names such as ?Chen?, ?Li? or ?Wang?. This then provides additional evidence for merging other references. In the example after consolidating the ?Ansari?s, the ?Wang? ref- erences from paper 1, 2 and 4 have a common co-author, which provides provides evidence for consolidating them. The entity resolution algorithm incrementally con- structs the entity graph by considering as evidence the entity relationships that it has already discovered in earlier iterations. Figure 2.2(b) shows the resulting entity graph for the example after all the references have been correctly resolved. 2.2 Entity Resolution Using Relationships: Problem Formulation In this section, I describe the notation I use for describing the relational entity resolution problem. In the entity resolution problem, we are given a set of references R = frig, where each reference r has attributes r:A1; r:A2; : : :; r:Ak. The references correspond to some set of unknown entities E = feig. I introduce the notation r:E to refer to the entity to which reference r corresponds. The problem is to recover the hidden set of entities E = feig and the entity labels r:E for individual references 23 h h h h 1 2 43 r r rrr r r r r r1 2 3 4 5 6 7 8 9 10 (a) r 2 r 4 r 10r 1 r 5 r 9 h3 h1 h4 h2 r 6 r 7 r 8 r 3 Wang1: Ansari: Li: Chen2: Wang2: Chen1: (b) Figure 2.3: (a) A more abstract representation of the reference graph for the author resolution example; the r?s are references and the h?s are hyper-edges. (b) An abstract representation for the entity graph for the author resolution example; the nodes are the entities, the set of references they correspond to are listed, and the h?s are hyper-edges. given the observed attributes of the references. In addition to the attributes, I assume that the references are not observed independently, but that they co-occur. I describe the co-occurrence with a set of hyper-edges H = fhig. Each hyper-edge h may have attributes as well, which I denote h:A1; h:A2; : : :; h:Al, and I use h:R to denote the set of references that it connects. A reference r can belong to zero or more hyper-edges and I use r:H to denote the set of hyper-edges in which r participates. In this paper, I only discuss entity resolution when each reference is associated with zero or one hyper-edge, but in other domains it is possible for multiple hyper-edges to share references. For example, if we have paper, author and venue references, then a paper reference may be connected to multiple author references and also to a venue reference. Let us now illustrate how the running example is represented in this notation. Figure 2.3(a) shows the references and hyper-edges. Each observed author name corresponds to a reference, so there are ten references r1 through r10. In this case, 24 the names are the only attributes of the references, so for example r1:A is \W. Wang", r2:A is \C. Chen" and r3:A is \A. Ansari". The set of true entities E is fAnsari; Wang1; Wang2; Chen1; Chen2; Lig as shown in Figure 2.3(b). References r1, r5 and r9 correspond to Wang1, so that r1:E = r5:E = r9:E = Wang1. Similarly, r2:E = r4:E = r10:E = Ansari and r3:E = Chen1 and so on. There are also the hyper-edges H = fh1; h2; h3; h4g, one for each paper. The attributes of the hyper- edges in this domain are the paper titles; for example, h1:A1=\A Mouse Immunity Model". The references r1 through r3 are associated with hyper-edge h1, since they are the observed author references in the rst paper. This is represented as h1:R = fr1; r2; r3g. Also, this is the only hyper-edge that each of these references participate in. So r1:H = r2:H = r3:H = fh1g. I similarly represent the hyper-edge associations of the other references. 2.3 Entity Resolution Approaches In this section, I compare and contrast existing entity resolution approaches. I distinguish between attribute-based, naive relational and collective relational entity resolution. While the attribute-based approach considers only the attributes of the references to be matched, the naive relational approach considers attribute similar- ities for related references as well. In contrast, the collective relational approach resolves related references jointly. I consider each approach in detail one by one. 25 2.3.1 Attribute-based Entity Resolution This is the traditional approach [44, 30] where similarity simA(ri; rj) is com- puted for each pair of references ri; rj based on their attributes and only those pairs that have similarity above some threshold are considered co-referent. I use the ab- breviation A to refer to the attribute-based approach. Additionally, transitive closure may be taken over the pair-wise decisions. I denote this approach as A*. Several sophisticated similarity measures have been developed for names, and popular TF-IDF schemes may be used for other textual attributes like keywords. The measure that works best for each attribute can be used. Finally, a weighted combination of the similarities over the di erent attributes for each reference can be taken for the combined attribute similarity between two references. In our example, the approach A may allow us to decide that the ?W. Wang? references (r1; r5) are co-referent. I may also decide using A that ?W. Wang? and ?W.W. Wang? (r1; r9) are co-referent, but not as con dently. However, as I have already discussed, attributes are often insu cient for entity resolution, particularly for the disambiguation aspect of the problem. In our example, A is almost certain to mark the two ?W. Wang? references (r1; r7) as co-referent, which is incorrect. 2.3.2 Naive Relational Entity Resolution The simplest way to use relationships to resolve entities is to treat related references as additional attributes for matching. For instance, to determine if two author references in two di erent papers are co-referent, we can compare the names 26 of their co-authors. In our running example, the naive relational decision about the references ?W. Wang? and ?W. W. Wang?, would consider that both have co-authors with the name ?A. Ansari?. I refer to this approach as NR. As before, transitive closure can be taken over the pair-wise decisions for NR. I refer to the transitive closure as NR*. A similar idea has been used in the context of matching in dimensional hierar- chies [3]. I generalize the idea for unordered relationships and de ne naive relational similarity simNR(hi; hj) between two hyper-edges hi and hj as the best pair-wise attribute match between their references. Since the references in any hyper-edge are not ordered, each reference r 2 hi can be matched to any reference r0 2 hj. So for each reference r 2 hi I nd the best match to hj: simA(r; hj) = maxr02hjsimA(r; r0) For symmetry, I also compute the best match to hyper-edge hi for each reference in hj and then take the average over all of the references in the two hyper-edges to get simNR(hi; hj). I then use this similarity measure between two hyper-edges to nd the naive relational similarity simNR(ri; rj) between two references ri and rj by matching their hyper-edges. When each reference belongs to just one hyper-edge, simNR(ri; rj) can be computed simply as simNR(ri:H; rj:H). Otherwise, I need to make pair-wise comparisons between their hyper-edges. Finally, I take a simple linear combination of the attribute match simA(ri; rj) and the naive relational match simNR(ri; rj) to get combined similarity for two references ri and rj: sim(ri; rj) = (1 ) simA(ri; rj) + simNR(ri; rj); 0 1 (2.1) 27 While the naive relational approach improves signi cantly on the attribute- based approach, it can be misled in domains where most names are frequent and hyper-edges are dense. In our example, the two ?W. Wang? references, r1 and r7 are not co-referent, though they have co-authors with matching names ?C. Chen?. Since I only match the strings, naive relational similarity returns a high match value. This may incorrectly lead to the decision that r1 and r7 are co-referent. 2.3.3 Collective Relational Entity Resolution The problem with the naive relational approach is that it does not reason about the identities of the related references. For the two ?Wang? references in the earlier example, the two ?C. Chen? co-authors match regardless of whether they refer to Chen1 or Chen2. The correct evidence to use here is that the ?Chen?s are not co-referent. In such a setting, in order to resolve the ?W. Wang? references, it is necessary to resolve the ?C Chen? references as well, and not just consider their name similarity. This is the goal of collective relational entity resolution CR, where resolutions are not made independently, but instead one resolution decision a ects other resolutions via hyper-edges. I now motivate entity resolution as a clustering problem and propose a relational clustering algorithm for collective relational entity resolution. Given any similarity measure between pairs of references, entity resolution can be posed as a clustering problem where the goal is to cluster the references so that only those that correspond to the same entity are assigned to the same cluster. A 28 greedy agglomerative clustering algorithm is often used, where at any stage of the process, the current set C = fcig of entity clusters re ects the current belief about the mapping of the references to entities. I use r:C to denote the current cluster label for a reference; references that have the same cluster label correspond to the same entity. So far, I have discussed similarity measures for attributes; for a clustering algorithm, I need to de ne similarities between clusters of references. The goal is to use clustering for collective entity resolution. I now look at how we can de ne similarity measures between clusters for this purpose. I de ne the similarity of two clusters ci and cj as: sim(ci; cj) = (1 ) simA(ci; cj) + simR(ci; cj); 0 1 (2.2) where simA() is the similarity of the attributes and simR() is the relational similarity between the references in the two entity clusters. On analyzing Eq. (2.2), we can see that it reduces to attribute-based similarity for = 0. Also, the relational aspect of the similarity measures distinguishes it from the naive relational similarity measure from Eq. (2.1). While naive relational similarity measures the attribute similarity of the related references, here I consider the labels of related clusters that represent entities. This similarity is dynamic in nature, which is one of most important and interesting aspects of the collective approach. In contrast to attribute-based and naive relational resolution, where the similarity between two references is xed, for collective resolution, the similarity of two references depends on the current cluster labels of the related references and therefore changes as the labels are updated. In our example, the similarity of the two references ?W. Wang? and ?W. W. Wang? 29 increase once the Ansari references are given the same cluster label. As I have mentioned earlier, similarity measures for attributes have been stud- ied in great detail. The focus is on measuring relational similarity between two clus- ters of references. The references in each cluster c are connected to other references via hyper-edges. For collective entity resolution, relational similarity considers the cluster labels of all these connected references. Recall that each reference r is as- sociated with one or more hyper-edges in H. Therefore, the set of hyper-edges c:H that I need to consider for an entity cluster c is de ned as c:H = [ r2R^r:C=c fh j h 2 H ^ r 2 h:Rg These hyper-edges connect c to other clusters. The relational similarity for two clusters c1 and c2 needs to compare this connectivity pattern to other clusters for c1 and c2. For any cluster c, the set of other clusters to which c is connected via its hyper-edge set c:H form the neighborhood Nbr(c) of cluster c: Nbr(c) = [ h2c:H;r2h:R fcj j cj = r:Cg This de nes the neighborhood as a set of related clusters, but the neighborhood can also be de ned as a bag or multi-set, in which the multiplicity of the di erent neighboring clusters is preserved. I will use NbrB(ci) to denote the bag of neighbor- ing clusters. In our example in Figure 2.3(b), the neighborhood of the cluster for Wang1 consists of the clusters for Ansari and Chen1; alternatively it is the bag of clusters fAnsari; Ansari; Ansari; Chen1g. Note that I do not constrain the de ni- 30 tion of the neighborhood of a cluster to exclude the cluster itself. In Section 2.4.6, I discuss how such constraints can be handled when required by the domain. For the relational similarity between two clusters, I look for commonness in their neighborhoods. This can be done in many di erent ways, as I explore in the following section. 2.4 Neighborhood Similarity Measures for Collective Resolution We have seen how the neighborhood of a cluster of references can be repre- sented as a set (or alternatively as a multi-set) of cluster labels and that we can compute relational similarity between two clusters by considering the similarity of their neighborhoods. Many di erent metrics have been proposed and evaluated in the literature for measuring commonness between sets; for example Liben-nowell and Kleinberg [68] study their use for prediction tasks in social networks. Here I adapt and modify some of these measures and study their applicability for entity resolution. 2.4.1 Common Neighbors This is the simplest approach for measuring commonness between sets and counts the number of elements that occur in both. For two clusters ci and cj, their common neighbor score is de ned as CommonNbrScore(ci; cj) = 1K jNbr(ci)\Nbr(cj)j (2.3) 31 where K is a large enough constant such that the measure is less than 1 for all pairs of clusters. For two references ?John Smith? and ?J. Smith?, where attribute similarity is not very informative, this score measures the overlap in their connected entities. The greater the number of common entities, the higher the possibility that the two references refer to the same entity as well. This de nition ignores the frequency of connectivity to a neighbor. Suppose ?John Smith? has collaborated with the entity ?J. Brown? several times, while ?J. Smith? has done so only once. To investigate if this information is relevant for entity resolution, I also de ne a common neighbor score with frequencies that takes into account multiple occurrences of common clusters in the neighborhoods: CommonNbrScore + Fr(ci; cj) = 1K0 jNbrB(ci)\NbrB(cj)j (2.4) 2.4.2 Jaccard Coe cient The main shortcoming of the common neighbor score is the normalizing con- stant K which is the same over all pairs of clusters. Consider the situation where we have two ?John Smith? clusters, c1 and c2, both of which have the same number of neighbors in common with the ?J. Smith? cluster c3. Then they are equally similar to c3 in terms of the common neighbor score. Suppose that all of c1?s neighbors are shared with c3, while c2 has a very large neighborhood and only a small fraction of it is shared with c3. When entities have large neighborhoods, nding shared neighbors by chance becomes more likely. In this case, we may want the similarity between c1 and c3 to be greater than the similarity between c2 and c3. We can get around this 32 issue by taking into account the size of neighborhood. This gives us the Jaccard coe cient for two clusters: JaccardCoeff(ci; cj) = jNbr(ci) TNbr(c j)j jNbr(ci)SNbr(cj)j (2.5) As before, we may consider neighbor counts to de ne the Jaccard coe cient with frequencies, JaccardCoeff + Fr(ci; cj), by using NbrB(ci) and NbrB(cj) in the de nition. 2.4.3 Adamic/Adar Similarity Both the common neighborhood measure and Jaccard coe cient consider all cluster labels in the neighborhood as equally important and signi cant for deter- mining co-reference. However this is not always desirable. If a cluster is frequently linked with many di erent clusters, then its presence in a shared neighborhood is not as signi cant as a cluster which is less frequent. This is similar to the idea behind ?inverse document frequency? in the commonly used TF-IDF scheme in information retrieval. Adamic and Adar [1] use this idea for predicting friendship from web-page features. They proposed a similarity measure between two web-pages X and Y that individually considers the signi cance of each element that they share and assigns weights to them accordingly. This has come to be called the Adar / Adamic score: similarity(X; Y ) = X shared feature z 1 log(frequency(z)) Liben-nowell and Kleinberg [68] adapted this idea for the task of link prediction in social networks considering node neighborhoods, where they used the size of a 33 node?s neighborhood for measuring frequency or commonness. I generalize this idea to propose a class of Adar/Adamic measures for entity resolution. If the ?uniqueness? of a cluster label c (or a shared feature, in general) is denoted as u(c), then I de ne the Adar similarity score of two clusters ci and cj as Adar(ci; cj) = P c2Nbr(ci)\Nbr(cj) u(c)P c2Nbr(ci)[Nbr(cj) u(c) (2.6) where the denominator normalizes the score. Now the Jaccard coe cient can be viewed as a special case of the Adar score when all nodes are equally unique. Also, observe that without the normalization Eq. (2.6) reduces to the similarity score of Liben-nowell and Kleinberg [68] for u(c) = 1log(jNbr(c)j) (2.7) I refer to Adar score that uses this de nition of uniqueness as the AdarNbr score. As before, I evaluate two versions, AdarNbr that considers the set of neighbors and AdarNbr+Fr that takes into account the multiplicity of the neighbors. 2.4.4 Adar Similarity with Ambiguity Estimate While using the neighborhood size of a cluster to measure its uniqueness has been shown to work well in link prediction applications, it may not be appropriate for entity resolution. For entity resolution applications, we do not directly know the neighbors for each entity from the data. The true neighborhood size for any entity cluster is known only after the entity graph has been correctly reconstructed. So using the neighborhood size as a measure of uniqueness at any intermediate 34 stage of the resolution algorithm is incorrect, and is an overestimate of the actual neighborhood size. As an alternative, we can use a de nition of uniqueness which incorporates a notion of the ambiguity of the names found in the shared neighborhood. To understand what this means, consider two references with name ?A. Aho?. Since ?Aho? can be considered as an ?uncommon? name, they are very likely to be the same person. In contrast, two other references with a common name such as ?L. Li? are less likely to be the same person. So I de ne the ambiguity Amb(r:Name) of a reference name as the probability that multiple entities share that particular name. Intuitively, clusters which share neighbors with uncommon names are more likely to refer to the same entity and should be considered more similar. I de ne the uniqueness of a cluster c as inversely proportional to the average ambiguity of its references: u(c) = 1Avg r2c(Amb(r:Name)) (2.8) In general, this approach is not speci c to names and can be used with any attribute of the references. I refer to an Adar similarity score which uses this de nition of uniqueness as AdarName when applied to the set of neighbors and AdarName+Fr to refer to the measure applied to the bag of neighbors. The Adar-Name measure is de ned in terms of the ambiguity of a reference?s name. There are a number of ways to estimate the ambiguity of a name. One scheme that works quite well in our domains is to estimate the probability that two randomly picked references with Name = n correspond to di erent entities. For a 35 reference attribute A1, denoted R:A1, a naive estimate for the ambiguity of a value of n for the attribute is: Amb(r:A1) = j R:A1=r:A1(R)jjRj ; where j R:A1=r:A1(R)jdenotes the number of references with value r:A1 for A1. This estimate is clearly not good since the number of references with a certain attribute value does not always match the number of di erent entity labels for that attribute. We can do much better if we have an additional attribute A2. Given A2, the ambi- guity for value of A1 can be estimated as Amb(r:A1 j r:A2) = j ( R:A2( R:A1=r:A1(R)))jjRj ; where j ( R:A2( R:A1=r:A1(R)))j is the number of distinct values observed for A2 in references with R:A1 = r:A1. For example, we can estimate the ambiguity of a last name by counting the number of di erent rst names observed for it. This provides a better estimate of the ambiguity of any value of an attribute A1, when A2 is not correlated with A1. When multiple such uncorrelated attributes Ai are available for references, this approach can be generalized to obtain better ambiguity estimates. 2.4.5 Higher-Order Neighborhoods Analysis of the commonness of neighborhoods can be viewed as an inves- tigation of paths of length two between two clusters. I also investigate whether higher-order neighborhoods play a role in detecting co-reference. In addition to the neighborhood similarity measures described, I also evaluate measures which take 36 into account collaboration paths of length three. As the clusters change, it be- comes computationally infeasible to recompute all paths between all cluster pairs. Instead, I calculate the second order neighborhood Nbr2(c) for a cluster c by recur- sively taking the set union (alternatively, multi-set union) of the neighborhoods of all neighboring clusters: Nbr2(c) = Sc02Nbr(c) Nbr(c0). For paths of length three to be present between two clusters ci and cj, there must be intersections between the Nbr(ci) and Nbr2(cj), or vice versa. Then, to nd the similarity over paths of length 3 or less for ci and cj, I take the average of the similarities over length-2 paths and length-3 paths: Path3Sim(ci; cj) = 13[Jaccard(Nbr(ci); Nbr(cj)) + Jaccard(Nbr2(ci); Nbr(cj)) +Jaccard(Nbr(ci); Nbr2(cj))] (2.9) 2.4.6 Negative Constraints From Relationships The common relational structure I have considered so far can be seen as pos- itive evidence for inferring that two author references refer to the same underlying author entity. Additionally, there may be negative constraints as well for entity resolution arising from relationships. For example, in many relational domains, two references appearing in the same hyper-edge cannot refer to the same entity. As a real bibliographic example, consider a paper with co-authors ?M. Faloutsos?, ?P. Faloutsos? and ?C. Faloutsos?. Despite the similarity of the uncommon last name, in reality these references correspond to distinct author entities. So, for bibliographic domains, we can add a constraint that two references which co-occur cannot refer 37 to the same entity. In domains other than citation data, there may be di erent re- lational constraints. In general, we can have a set of negative relational constraints that clusters need to satisfy. I take these into account by setting the similarity between two cluster pairs in Eq. (2.2) to zero if merging them violates any of the relational constraints. 2.5 Relational Clustering Algorithm Given the similarity measure for a pair of reference clusters, I use a greedy agglomerative clustering algorithm that nds the closest cluster pair at each step and merges them. High level pseudo-code for the algorithm is provided in Figure 2.4. In this section, I discuss several important implementation and performance issues regarding relational clustering algorithms for entity resolution. 2.5.1 Blocking to Find Potential Resolution Candidates Unless the datasets are small, it is impractical to consider all possible pairs as potential candidates for merging. Apart from the scaling issue, most pairs checked by an O(n2) approach will be rejected since usually only about 1% of all pairs are true matches. Blocking techniques [52, 79, 72] are usually employed to rule out pairs which are certain to be non-matches. The goal is to separate references into possibly overlapping buckets and only pairs of references within each bucket are considered as potential matches. The relational clustering algorithm uses the blocking method as a black-box and any method that can quickly identify potential matches minimizing 38 1. Find similar references using blocking 2. Initialize clusters using bootstrapping 3. For clusters ci; cj such that similar(ci; cj) 4. Insert hsim(ci; cj); cj; cji into priority queue 5. While priority queue not empty 6. Extract hsim(ci; cj); ci; cji from queue 7. If sim(ci; cj) less than threshold, then stop 8. Merge ci and cj to new cluster cij 9. Remove entries for ci and cj from queue 10. For each cluster ck such that similar(cij; ck) 11. Insert hsim(cij; ck); cij; cki into queue 12. For each cluster cn neighbor of cij 13. For ck such that similar(ck; cn) 14. Update sim(ck; cn) in queue Figure 2.4: High-level description of the relational clustering algorithm 39 false negatives can be used. I use a variant of an algorithm proposed by McCallum et al. [72] that I brie y describe below. The algorithm makes a single pass over the list of references and assigns them to buckets using an attribute similarity measure. Each bucket has a representative reference that is the most similar to all references currently in the bucket. For assigning any reference, it is compared to the representative for each bucket. It is assigned to all buckets for which the similarity is above a threshold. If no similar bucket is found, a new bucket is created for this reference. A naive implementation yields a O(n(b + f)) algorithm for n references and b buckets and when a reference is assigned to at most f buckets. This can be improved by maintaining an inverted index over buckets. For example, when dealing with names, for each character I maintain the list of buckets storing last names starting with that character. Then the buckets can be looked up in constant time for each reference leading to an O(nf) algorithm. 2.5.2 Relational Bootstrapping Each iteration of the relational clustering algorithm makes use of clustering decisions made in previous iterations. This is achieved by measuring the shared neighborhood for similar clusters, as explained in Subsection 2.3.3. But if we begin with each reference in a distinct cluster, then initially there are no shared neigh- bors for references that belong to di erent hyper-edges. So the initial iterations of the algorithm have no relational evidence to depend on. As a result, the relational 40 component of the similarity between clusters would be zero and merges would occur based on attribute similarity alone. Many of such initial merges can be inaccurate, particularly for the references with ambiguous attribute values. To get around this, we need to bootstrap the clustering algorithm such that each reference is not as- signed to a distinct cluster. Speci cally, if we are con dent that some reference pair is coreferent, then they should be assigned to the same initial cluster. However, precision is crucial for the bootstrap process, since the algorithm cannot undo any of these initial merge operations. Observe that this bootstrapping is not necessary for approaches that are not collective. For such approaches, the decision for any ref- erence pair is the same irrespective of the decisions for other pairs. So bootstrapping does not have any e ect on subsequent decisions. In this subsection, I describe the bootstrapping scheme for relational clustering that makes use of the hyper-edges for improved bootstrap performance. The basic idea is very similar to the naive relational approach described in Subsection 2.3.2, with the di erence that I use ex- act matches instead of similarity for attributes. To determine if any two references should be assigned to the same initial cluster, I check if their attributes match ex- actly. For references with ambiguous attributes, I also check if the attributes of their related references match. I now discuss this in greater detail. The bootstrap scheme goes over each reference pair that is potentially coref- erent (as determined by blocking) and determines if it is a bootstrap candidate. First, consider the simple bootstrap scheme that looks only at the attributes of two references. It determines which attribute values are ambiguous and which are not using a data-based ambiguity estimate, as described in Subsection 2.4.4. References 41 with ambiguous attribute values are assigned to distinct clusters. Any reference pair whose attribute values match and are not ambiguous is considered to be a bootstrap candidate. The problem with this simple approach is that it assigns all references with ambiguous attributes to distinct clusters leading to poor recall in datasets with high ambiguity. When hyper-edges are available, they can be used as evidence for bootstrapping of ambiguous references. A pair of ambiguous references form a bootstrap candidate if their hyper-edges match. Two hyper-edges h1 and h2 are said to have a k-exact-match if there are at least k pairs of references (ri; rj), ri 2 h1:R, rj 2 h2:R with exactly matching attributes, i.e. ri:A = rj:A. Two references r1 and r2 are bootstrap candidates if any pair of their hyper-edges have a k-exact-match. As a bibliographic example, two references with name ?W. Wang? will not be merged during bootstrapping on the basis of the name alone. However, if the rst Wang reference has co-authors ?A. Ansari? and ?C. Chen?, and the second Johnson has coauthor ?A. Ansari?, then they have a 1-exact-match and, depending on a threshold for k, they would be merged. The value of k for the hyper-edge test depends on the ambiguity of the domain. A higher value of k should be used for domains with high ambiguity. Also, when matching hyper-edges, references with ambiguous attributes are not considered for matches in high ambiguity domains. For example, ?C. Chen? may not be considered for a co-author match, since it is a common name. Other attributes of the references, and also of the hyper-edges, when available, can be used to further constrain bootstrap candidates. Two references are considered only if these other attributes do not con ict. In the bibliographic domain, author 42 references from two di erent papers can be merged only if their languages and correspondence addresses match. After the bootstrap candidates are identi ed, the initial clusters are created using the union- nd approach so that any two references that are bootstrap can- didates are assigned to the same initial cluster. In addition to improving accuracy of the relational clustering algorithm, bootstrapping reduces execution time by sig- ni cantly lowering the initial number of clusters without nding the most similar cluster-pairs or performing expensive similarity computations. 2.5.3 Merging Clusters and Updating Similarities Once the similar clusters have been identi ed and bootstrapping has been per- formed, the algorithm iteratively merges the most similar cluster pair and updates similarities until the similarity drops below some speci ed threshold. This is shown in lines 5-14 of Figure 2.4. The similarity update steps for related clusters in lines 12-14 are the key steps that distinguish collective relational clustering from a tra- ditional agglomerative clustering algorithm. In order to perform the update steps e ciently, indexes need to maintained for each cluster. In this section, I describe the data structure that I maintain for this purpose. In addition to its list of references, I maintain three additional lists with each cluster. First, I maintain the list of similar clusters for each cluster. The second list keeps track of all neighboring clusters. Finally, I keep track of all the queue entries that involve this cluster. For a cluster that has a single reference r, the similar 43 clusters are those that contain references in the same bucket as r after blocking. Also, the neighbors for this cluster are the clusters containing references that share a hyper-edge with r. Then, as two clusters merge to form a new cluster, all of these lists can be constructed locally for the new cluster from those of its parents. All of the update operations from lines 9-14 can be performed e ciently using these lists. For example, updates for related clusters are done by rst accessing the neighbor list and then traversing the similar list for each of them. 2.5.4 Complexity Analysis Now that I have described each component of the relational clustering algo- rithm, let us analyze its time complexity. First, I look at how the number of sim- ilarity computations required in lines 3-4 of Figure 2.4 is reduced by the blocking method. I consider the worst case scenario where the bootstrapping approach does not reduce the number of clusters at all. We need to compare every pair of references within each bucket. Suppose we have n references that are assigned to b buckets with each reference being assigned to at most f buckets. Then using an optimistic estimate, we have nf=b references in each bucket, leading to O((nf=b)2) compar- isons per bucket and a total of O(n2f2=b) comparisons. In all of this discussion, I assume that the number of buckets is proportional to the number of references, i.e., b is O(n). Additionally, assuming that f is a small constant independent of n, we have O(n) computations. Now, let us look at the time taken by each iteration of the algorithm. To 44 analyze how many update/insert operations are required, I assume that for each bucket that is a ected by a merge operation, all the O((nf=b)2) computations need to be redone. Then we need to nd out how many buckets may be a ected by a merge operation. I say that two buckets are connected if any hyper-edge connects two references in the two buckets. Then if any bucket is connected to k other buckets, each merge operation leads to O(k(nf=b)2) update/insert operations. This is still only O(k) operations when f is a constant independent of n and b is O(n). Using a binary-heap implementation for the priority queue, the extract-max and each insert and update operation take O(log q) time, where q is the number of entries in the queue. So the total cost of each iteration of the algorithm is O(k log q). Next, I count the total number of iterations that the algorithm may require. In the worst case, the algorithm may have to exhaust the priority queue before the similarity falls below the threshold. So we need to consider the number of merge operations that are required to exhaust a queue that has q entries. If the merge tree is perfectly balanced, then the size of each cluster is doubled by each merge operation and as few as O(logq) merges are required. However, in the worst case, the merge tree may be q deep requiring as many as O(q) merges. With each merge operation requiring O(k log q) time, the total cost of the iterative process is O(qk log q). Finally, in order to put a bound on the initial size q of the priority queue, I again consider the worst case scenario where bootstrapping does not reduce the number of initial clusters. This resulting in O(n2f2=b) entries in the queue as shown earlier. Since this is again O(n), the total cost of the algorithm can be bounded by O(nk log n). The one cost that I have not considered so far is that of bootstrapping. 45 We can analyze the bootstrapping by considering it as a sequence of cluster merge operations that do not require any updates or inserts to the priority queue. Then the worst case analysis of the number of iterations accounts for the bootstrapping as well. To see how this compares against the attribute and naive relational baselines, observe that they need to take a decision for each pair of references in a bucket. This leads to a worst case analysis of O(n) using the same assumptions as before. However, each similarity computation is more expensive for the naive relational approach since it requires a pair-wise match to be computed between two hyper- edges. 2.6 Experimental Evaluation I evaluated the proposed relational entity resolution algorithm on several real- world and synthetic datasets. I begin with a description of the experiments on real bibliographic datasets. 2.6.1 Evaluation on Bibliographic Data The real-world datasets describe publications in several di erent scienti c re- search areas. As in our running example, the goal is to use co-author relationships in the papers to help discover the underlying author entities in the domain and map the author references to the discovered author entities. I rst describe the datasets in more detail, and then describe the evaluation and results. 46 2.6.1.1 Datasets CiteSeer: The CiteSeer dataset contains 1,504 machine learning documents with 2,892 author references to 1,165 author entities. For this dataset, the only attribute information available is author name. The full last name is always given, and in some cases the author?s full rst name and middle name are given and other times only the initials are given. The dataset was originally created by Giles et al. [49] and the version which I use includes the author entity ground truth provided by Aron Culotta and Andrew McCallum, University of Massachusetts, Amherst. arXiv: The arXiv dataset describes high energy physics publications. It was orig- inally used in KDD Cup 20031. It contains 29,555 papers with 58,515 references to 9,200 authors. The attribute information available for this dataset is also just the author name, with the same variations in form as described above. The author entity ground truth for this data set was provided by David Jensen, University of Massachusetts, Amherst. BioBase: The third dataset, describing biology publications, is the Elsevier BioBase dataset2 which was used in a recent IBM KDD-Challenge competition. It was cre- ated by selecting all Elsevier publications on ?Immunology and Infectious Diseases? between years 1998 and 2001. It contains 156,156 publications with 831,991 author references. Unlike arXiv and CiteSeer that have complete as well as initialed author names, in BioBase, all of the rst names and middle names are abbreviated. How- 1http://www.cs.cornell.edu/projects/kddcup/index.html 2http://help.sciencedirect.com/robo/projects/sdhelp/about biobase.htm 47 ever the BioBase dataset has other attributes which I use for resolution including: keywords, topic classi cation, language, country of correspondence and a liation of the corresponding author. There is a wide variety in the data with 20 languages, 136 countries, 1,282 topic classi cations and 7,798 keywords. Entity labels are available only for the top 100 author names with the highest number of references. I evaluate entity resolution performance for BioBase over 10,595 references that have these 100 names, although the collective resolution algorithm requires resolving many of the other references as well. Ground truth was determined for all of these datasets by the owners using a combination of automatic and manual strategies. The process is not completely free from errors and I had to perform additional cleaning for some CiteSeer and arXiv references in the course of the experiments. For BioBase, 97% of the labels are estimated to be correct. Despite the common underlying domain, these datasets vary in a number of important ways. The most important di erence is in the inherent uncertainty in the name references. I introduce two measures, which I refer to as ambiguity (corre- sponding to the disambiguation aspect of resolution) and dispersion (corresponding to the identi cation aspect), to measure the uncertainty in the data. I consider a name (last name and rst initial) to be ambiguous if multiple entities share that name. In CiteSeer dataset, only 3 out of 1185 names are ambiguous and the average number of entities per ambiguous name is 2.33 (the maximum is 3). For arXiv, 374 of the 8737 names are ambiguous, and the average number of entities for these ambiguous names is 2.41 (the maximum is 11). For BioBase, the ambiguity is much 48 higher | 84 of the 100 names are ambiguous. The number of entities for each name ranges from 1 to 100 with an average of 32. I introduce dispersion as another mea- sure of the inherent di culty of the entity resolution problem for a domain. The dispersion for an entity is the number of distinct observed names for each entity. For CiteSeer, 202 out of the 1164 entities have multiple recorded names, the average and maximum dispersion are 1.24 and 8 respectively. In contrast, 3083 out of 8967 entities for arXiv are dispersed over multiple names, and the average dispersion is 1.44 and the maximum is 10. Since I do not have complete ground truth for the BioBase dataset, dispersion cannot be directly measured. Apart from the level of uncertainty, BioBase di ers signi cantly from the other two datasets in terms of its hyper-edge structure. For BioBase, the number of author references per publication ranges from 1 to 100 with the average being 5.3. In comparison, the averages are 1.92 and 1.98 respectively for CiteSeer and arXiv, the range being 1 to 10 for both. 2.6.1.2 Evaluation I compare attribute-based entity resolution (A), naive relational entity reso- lution (NR) that uses attributes of related references, and the proposed collective relational entity resolution (CR). For the rst two algorithms, I also consider vari- ants which perform transitive closures over the pair-wise decisions (A* and NR*). In order to measure the performance of the algorithms I consider the cor- rectness of the pair-wise co-reference decisions over all references. I evaluate the pair-wise decisions using the F1 measure, which is the harmonic mean of precision 49 and recall. For a fair comparison, I consider the best F1 for each of these algorithms over all possible thresholds for determining matches. 2.6.1.3 Experimental Details For comparing attributes, which is required for all of the algorithms, I use the Soft TF-IDF similarity for names [30, 16] since it has been shown to perform well for name-based entity resolution. Essentially, Soft TF-IDF augments the TF-IDF similarity for matching token sets with approximate token matching using a sec- ondary string similarity measure. Jaro-Winkler is reported to be the best secondary similarity measure for Soft TF-IDF, but for completeness, I also experiment with the Jaro and the Scaled Levenstein measures. Scaled Levenstein belongs to the edit- distance family of similarity measures that assigns unit cost to each edit operation and normalizes the result. Jaro and Jaro-Winkler do not belong to the edit-distance family. The measure the number and order of common characters between strings. Jaro-Winkler is a variant of Jaro that also considers the longest common pre x [30]. They are both well suited for short strings such as personal names. In the case of BioBase, where I had other multi-valued attributes to make use of besides names, I used TF-IDF similarity. Since for CiteSeer and arXiv it is infeasible to consider all pairs as potential duplicates, blocking is employed to extract the potential matches. This approach retains 99% of the true duplicates for both CiteSeer and arXiv. I use bootstrapping for the relational clustering algorithm (CR) for all three datasets. I use bootstrap 50 Table 2.1: Performance of di erent algorithms on the CiteSeer, arXiv and BioBase datasets. I report the mean and the standard deviations (within parenthesis) of the F1 scores obtained using Scaled Levenstein, Jaro and Jaro-Winkler as secondary similarity measure within Soft TF-IDF. CiteSeer arXiv BioBase A 0.980 (0.001) 0.974 (0.002) 0.568 (0) A* 0.990 (0.001) 0.967 (0.003) 0.559 (0) NR 0.981 (0.006) 0.975 (0.016) 0.710 (0) NR* 0.991 (0.002) 0.972 (0.039) 0.753 (0) Bootstrap H-Amb 0.217 (0) 0.119 (0) 0.452 (0) Bootstrap L-Amb 0.942 (0) 0.977 (0) 0.317 (0) CR 0.995 (0) 0.985 (0) 0.819 (0) for low ambiguity domains with k = 1 for CiteSeer and arXiv and bootstrap for high ambiguity domains with k = 2 for BioBase. Recall that the relational clustering algorithm (CR) and the naive relational approach (NR and NR*) both use a combination weight . I measure performance of both algorithms at 20 di erent values of between 0 and 1 and report the best performance for each of them over this range. For estimating ambiguity of references for AdarName, I use last names with rst initial as the secondary attribute. This resulted in very good estimates of ambiguity | the ambiguity estimate for a name is strongly correlated (correlation coe . 0.8) with the number of entities for that name. 2.6.1.4 Results Table 2.1 gives an overview of the F1 results of the various algorithms on the three datasets. Recall that the collective relational clustering uses bootstrapping to initialize the clusters. In addition to the three entity resolution approaches that I 51 have discussed, I also include for comparison the two boot strapping approaches, one for low ambiguity domains (Bootstrap L-Amb) that is used by CR for CiteSeer and arXiv, and the other for high ambiguity data (Bootstrap H-Amb) that is employed for BioBase. For CR, Table 2.1 records the performance for the best neighborhood similarity measure, which is Jaccard for CiteSeer and arXiv, and AdarName for BioBase. As mentioned earlier, there are several possible choices for the secondary string metric used with the Soft TD-IDF similarity for comparing the names. The results above are the averages using three choices | Scaled Levenstein, Jaro and Jaro-Winkler, with the standard deviation shown in parenthesis. First, note that the standard deviation in Table 2.1 measures sensitivity of entity resolution performance in terms of the similarity measure used for names. We can see that the results are not very sensitive to the secondary string metric choice. In fact, for collective relational entity resolution (CR), the choice is irrelevant and for the BioBase dataset, in which I have additional features besides the names, the choice is also irrelevant. For the cases in which there were some small di erences, Scaled Levenstein was most often, but not always, the best. Second, looking at the rst line in Table 2.1, note the di erences in per- formance for attribute-based entity resolution (A) for the three datasets. The attribute-based algorithm performs remarkably well for the CiteSeer database, and its performance on the arXiv dataset is also respectable. This is in keeping with our earlier observation about the ?hardness? of the datasets in terms of ambiguity and dispersion. The CiteSeer dataset has very little ambiguity and arXiv is only moderately more ambiguous. When datasets are not ambiguous, all dispersed en- 52 tities can be successfully identi ed simply by raising the discrimination threshold for determining duplicates. This increases recall without generating false positives. However, this is not possible when there is signi cant ambiguity in the data, as we see in the case of BioBase. The performance of the bootstrapping algorithms high- light the same trend. For CiteSeer and arXiv, the low ambiguity version (Bootstrap L-Amb) performs almost as well as the attribute baseline. In a higher ambiguity dataset such as BioBase, it performs many incorrect matches. The high ambiguity bootstrap strategy (Bootstrap H-Amb), which is more cautious for ambiguous references, performs poorly for CiteSeer and arXiv due to low recall but improves performance over Bootstrap L-Amb for BioBase by increasing precision. Next, observe that the naive relational entity resolution algorithm (NR) which uses attributes of related references in its similarity calculations improves perfor- mance over A only marginally for CiteSeer and arXiv, while the improvement is quite signi cant in the case of BioBase. This suggests that while the attributes of related entries can help in disambiguation in domains with high ambiguity, there may not be much improvement for less ambiguous domains. The table also shows that the e ect of transitive closure on entity resolution performance varies over the datasets. While it improves performance for both A and NR for CiteSeer and arXiv, in the case of BioBase, it helps NR but not A. A possible explanation is that transitive closure helps in domains with low ambiguity, but it may result in false identi cations in higher ambiguity domains. Finally, note that across all three datasets, the collective relational entity resolution algorithm (CR) performs the best. The gains for the less ambiguous 53 0.9 0.95 1 0 0.2 0.4 0.6 0.8 1 F1 alpha A A* NR NR* CR (a) 0.8 0.85 0.9 0.95 1 0 0.2 0.4 0.6 0.8 1 F1 alpha A A* NR NR* CR (b) 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0 0.2 0.4 0.6 0.8 1 F1 alpha A A* NR NR* CR (c) Figure 2.5: Entity resolution performance at di erent values of for (a) CiteSeer, (b) arXiv and (c) BioBase domains are more modest, while in the most ambiguous domain, the gain is quite signi cant. In addition, the performance improvements of CR over NR highlights the importance over considering the identities of related references rather than just their attributes. Also, since the performance is insensitive to the choice of attribute similarity used, overall CR is more robust than A and NR. Recall that CR, NR and NR* involve a weighting parameter for combining attribute and relational similarity. As mentioned earlier, the numbers in Table 2.1 54 record the best performance over di erent values of for each of these algorithms. The best performance is not always achieved for the same value of for di erent datasets or for the 100 di erent reference names in BioBase. In Figure 2.5 we see how the performance of the di erent algorithms changes over di erent values of for the three datasets. For BioBase, I plot the average performance over all 100 reference names for a particular value of . As a reference, I also show the performances of A and A* which do not depend on . We can make two interesting observations from the plots. First, the relational clustering algorithm CR consistently outperforms the naive relational baselines (NR) and (NR*) for all values of for all three datasets. Secondly, for CiteSeer and arXiv, the naive relational approach outperforms the attribute-only baseline only marginally for small values of and then its performance drops signi cantly at higher values. It is more stable for BioBase but performs still drops below the attribute-only baseline for high values of . The performance of CR is signi cantly more stable over varying for all three datasets. This is another validation of the usefulness of resolving related references instead of considering their similarities. Now I explore CR in more depth, by comparing the performance of the al- gorithm using di erent graph-based similarity measures. Table 2.2 shows the per- formance of the collective relational entity resolution with the di erent proposed measures on the three datasets. There is little di erence in performance on the CiteSeer and arXiv datasets. The simplest measure, Common, correctly retrieves almost all duplicates in CiteSeer. Recall that due to the ?blocking? approach, 100% recall | and therefore an F1 score of 1.0 | is not attainable for these two datasets. 55 Table 2.2: Performance (F1) for collective relational entity resolution using di erent neighborhood similarity measures in the three bibliographic datasets. CiteSeer arXiv BioBase Common 0.994 0.984 0.814 Common+Fr 0.994 0.984 0.816 Jaccard 0.994 0.985 0.818 Jaccard+Fr 0.995 0.985 0.818 AdarNbr 0.994 0.984 0.815 AdarNbr+Fr 0.994 0.984 0.816 AdarName 0.994 0.985 0.819 AdarName+Fr 0.995 0.984 0.817 Path3Sim 0.994 0.984 0.812 There is a bit more of an impact on the BioBase results. The numbers do not provide enough evidence to validate the use of frequencies (+Fr) for comparing neighborhoods. It improves performance in some cases and a ects it adversely in others. So in the following discussion, I concentrate on the basic similarity measures where the cluster neighborhood is treated as a set, rather than as a multi-set. We can make four observations: Jaccard similarity improves performance over Common neighbors. Recall that the di erence between the two is in the normalization. This shows the importance of considering the size of the common neighborhood as a fraction of the entire neighborhood. AdarNbr performs worse than Jaccard. Recall that Adar similarity consid- ers the importance or uniqueness of each cluster in the shared neighborhood. I pointed out that the ?connectedness? of a shared neighbor is not a reliable indicator in our case, since the graph is consolidated over iterations and new 56 Best Baseline Common AdarNbr AdarName Path3Sim -0.1 -0.075 -0.05 -0.025 0 C h a n g e i n F 1 Figure 2.6: Comparison of di erent relational similarity measures against Jaccard over only the a ected instances in BioBase in each case hyper-edges are added to each cluster. This is validated by the drop in per- formance as we move to AdarNbr from Jaccard. AdarName performs the best over all the graph-based similarity measures. Recall that AdarName attempts to capture the ?uniqueness? of a cluster of references, and this, combined with Adar similarity, works the best of all the neighborhood similarity measures on BioBase. Path3Sim has the lowest performance of all the graph-based measures. Recall that Path3Sim explores second order neighborhoods for detecting co-reference. This suggests that in dense collaboration graphs with many ambiguous enti- ties, where distinct entities with similar attributes have common higher order neighbors, going beyond immediate neighborhood can hurt entity resolution performance. The numbers in Table 2.2 show the average performance of the di erent mea- sures over all 100 instances in BioBase. However, it is not the case that performance 57 is a ected for every instance by changing the similarity measure. For example, per- formance changes in only 22 of the 100 instances when using Jaccard similarity instead of AdarName similarity, as compared to 80 for Jaccard compared to the baseline attribute-based similarity measure. In Figure 2.6, I compare the other measures with Jaccard similarity by measuring the average change in F1-measure over only the a ected instances. We see the same trends as discussed above, but di erence between the measures become more pronounced. 2.6.1.5 Execution Time As we have seen, the use of collective relational entity resolution improves entity resolution performance over attribute-based baselines. However it is more expensive computationally. Table 2.3 records the execution times in CPU seconds of the baseline algorithms and CR on the three datasets. All execution times are reported on a Dell Precision 870 server with 3.2GHz Intel Xeon processor and 3GB of memory. Let us rst consider the execution times for CiteSeer and arXiv. As expected, CR takes more time than the baseline but it is still quite fast. It takes less than 3 secs for the 2,982 references in CiteSeer and less than 5 minutes for the 58,515 references in arXiv. This is around 9 times as long as the baseline for CiteSeer and 17 times for arXiv. Recall that the complexity of neighborhood similarity is linear in the average connectivity between similar names. The average number of neighbors per entity for CiteSeer is 2:15 and for arXiv it is 4:5. So this di erence in the degree of relational connectivity explains the di erence in execution times for the two datasets. Also, the available attribute for these two datasets is the author 58 Table 2.3: Execution times of di erent algorithms in CPU seconds CiteSeer arXiv BioBase A 0.1 11.3 3.9 NR 0.1 11.5 19.1 CR 2.7 299.00 45.6 name and the average number of authors per publication is very small (1:9) for both. So very little extra computation is needed for the naive relational approach over the attribute baseline. Now let us consider BioBase. The time recorded for BioBase in Table 2.3 is not for cleaning the entire dataset. Rather, it is the average time for collectively resolving references with each of the 100 labeled names. I picked each of the 100 names in the BioBase dataset and extracted all the references relevant for resolving references with that name collectively. The time recorded for BioBase in Table 2.3 is the average time taken by di erent algorithms to resolve these ?relevant references? for each name. The ?relevant references? for each name are found iteratively by including all references that are reachable from the ?base references? that have this name in k steps. The average number of relevant references for each of the 100 instances in 5,510. Table 2.3 shows that the di erence in execution time between CR and the baselines is smaller for BioBase. One reason for this is that BioBase has many attributes in addition to author name that the attribute-only baseline also need to take into account. Also, the average number of authors per publication is 5:3 for BioBase as compared to 1:9 for the other two datasets. This makes the naive relational approach signi cantly more expensive than the attribute-only baseline. 59 0 100 200 300 400 500 600 700 800 900 0 10 20 30 40 50 60 70 time (secs) #references (in thousands) BioBase arXiv Figure 2.7: Execution time of CR with increasing number of references I also used this iterative setup to explore how the collective relational en- tity resolution algorithm scales with increasing number of references. I created 4 datasets of varying sizes from arXiv and BioBase. Figure 2.7 shows how CR scales linearly with increasing number of references in the dataset. Recall that the com- plexity of CR is O(nk log n) for n input references where k represents the degree of connectivity among the references. 2.6.2 Experiments on Synthetic Data As we saw in the previous section, the bene t of using collective relational entity resolution varied across the di erent datasets. I attribute this performance di erence to the di erences in structural properties of the datasets, such as the fraction of references that are ambiguous, the number of neighbors per entity, etc. To better understand how these di erent structural characteristics a ect the perfor- mance of collective relational entity resolution, I also experiment with synthetically generated data. I explain the synthetic data generation process in Appendix A. Using this generator, I can control the di erent structural characteristics, such as 60 0.75 0.8 0.85 0.9 0.95 2.5 2.75 3 3.25 3.5 3.75 F1 avg #references / hyper-edge A A* NR NR* CR (a) 0.6 0.7 0.8 0.9 1 0.05 0.15 0.25 0.35 0.45 F1 Percentage of ambiguous attributes A A* NR NR* CR (b) 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 0 1 2 3 4 5 6 7 8 F1 avg #neighbors / entity A A* NR NR* CR (c) Figure 2.8: Performance of di erent entity resolution algorithms on data syntheti- cally generated by varying di erent structural parameters such as (a) the average size of hyper-edges, (b) the percentage of ambiguous references and (c) the average number of neighbors per entity. Standard deviations are shown using error bars. the average number of references in each hyper-edge, the percentage of ambiguous references in the data and the density of relationships among the underlying entities. As I explain in Appendix A, the synthetic data generator is not tailored solely to bibliographic data, but can model general relationships between entities, as in social network data or email data. I performed three sets of experiments on synthetically generated data. In all of the experiments I consider average performance over 200 di erent runs. In the 61 rst experiment, I studied the e ect of the number of references in each hyper-edge. The objective of this experiment is two-fold. Consider a collaborative graph, where an entity e has many neighbors. If hyper-edges are small in size then two hyper- edges involving a references r1 and r2 corresponding to e may not have any other entities in common. Then it is not possible to identify r1 and r2 even using relational similarity. Secondly, in ambiguous domains, a single shared neighbor may not be enough to distinguish between two entities. In both cases, collective resolution is expected to bene t from larger hyper-edge sizes. In each run for this experiment, I rst created an entity graph by adding 100 entities and 200 binary relationships. Then I created di erent reference datasets, each with 500 hyper-edges. I varied pc which led to di erent number of references in the edges. Figure 2.8(a) shows the performance of the di erent entity resolution algorithms on these datasets. We see that while the performances of the attribute baselines (A and A*) does not change, the performance of CR improves with increasing number of references per hyper- edge. Interestingly, performance of the naive relational approach (NR*) degrades with increasing number of references. This demonstrates the importance of resolving related names instead of considering their attribute similarities only. In the second experiment, I varied the number of ambiguous references in the data. Collective resolution is particularly useful for ambiguous references. It may be possible to address the problem of identifying dispersed references for any entity by using a smaller similarity threshold with the attribute-only baseline. In contrast, disambiguation cannot be done using attributes but is often possible using relation- ships. So I expected the collective resolution approach to show larger improvements 62 over the baseline for more ambiguous data. I created ve sets of datasets, each with 100 entities, but with di erent ambiguous attribute probability pa. Then I added 200 binary relations between these entities and generated 500 hyper-edges with an average of 2 references per hyper-edge. Figure 2.8(b) compares entity resolution performance for the di erent algorithms on the datasets. As expected, the perfor- mance of all algorithms drops with increasing percentage of ambiguous references. However, the performance drop for CR is signi cantly slower than those for the attribute and naive relational baselines since the entity relationships helps to make the algorithm more robust. As a result, the gap between CR and the baselines increases as the percentage of ambiguous references in the data increases. In the nal experiment, I explored the impact of varying the number of re- lationships between the underlying entities. In the extreme situation, where there are no relationships between entities, clearly no improvement can be obtained us- ing collective resolution. At the other extreme, when all entities are connected to each other, there is no pattern in the relationships that collective resolution can exploit. The objective of this experiment was to explore how increased connectivity among entities a ects collective resolution. I rst created a set of 100 entities. Then I created di erent entity graph structures by adding di erent number of relations between the entities. As before, I generated 500 hyper-edges (with an average of 2 references per hyper-edge) from each of these di erent entity-graphs and compared performances of the di erent algorithms for the di erent datasets. The results are shown in Figure 2.8(c). First note that, as expected, the performances of the at- tribute baselines (A and A*) do not change signi cantly since they do not depend 63 on the relationships. The naive relational approaches (NR and NR*) degrade in performance with higher neighborhood sizes, again highlighting the importance of resolving related references. The performance of CR increases initially as the number of relationships increases. However it peaks when the average number of neighbors per entity is around 2 and then it starts falling o . In fact, it falls below the attribute-baseline when the neighborhood size increases to 8. This is an inter- esting result that shows that increasing number of relationships does not always help collective entity resolution. As more relationships get added between entities, relationship patterns between entities are less informative, and may actually hurt performance. In this experiment, the probability of ambiguous attributes pa was 0:3. We observe the same trend for other values of pa, the only change is the position of the peak. The peak occurs earlier as pa is increased. 2.7 Conclusion In summary, the relational clustering algorithm for collective entity resolution using relations is a promising approach that improves resolution performance over attribute-based and naive relational baselines. It has several nice properties in that it is simple to understand and the incremental evidence in each iteration is easy to interpret. It can be customized for di erent domains by using the attribute similarity measure that works best. It is also fast, owing to e cient implementation; it can resolve databases with 60,000 references in less than 5 minutes. All of these features make it an attractive tool to use for domains that require fast resolution 64 with interpretable evidence. On the other hand, this approach has certain limitations as well. For the rst limitation, recall that the similarity measure in Eqn. 2.2 involves a weighting parameter for combining attribute and relational similarity. It is not clear how the optimal value for should be chosen for each case and, for most of the comparisons, I consider the best F1 score over all values of . Figure 2.5 shows the performance for a xed value of in contrast to a di erent optimal value for each case. It demonstrates that there are signi cant performance improvements using CR for any value of over its entire range. Recall that the similarity measure uses only attributes when = 0 and only relations when = 1. For CiteSeer and arXiv, performance does not vary signi cantly with . Since BioBase has much higher ambiguity in terms of attributes (many references have exactly the same name and therefore mostly the same countries, and all papers are from the same area), resolution performance improves with increasing . Secondly, as with any clustering algorithm, determination of the termination threshold is an issue. Note that this comes up for all of the baselines as well, and here I report best accuracy over all thresholds. This is an area of ongoing research. One other issue with the agglomerative clustering approach is that it is greedy in terms of selecting clusters to merge and the nal solution would be di erent for a di erent merge sequence. Also, this approach is monotonic in that clusters only merge. Two clusters once merged cannot split back later to account for any new evidence that might be discovered. I address these issues with the probabilistic generative model, which I describe in the next chapter. 65 Chapter 3 A Latent Dirichlet Model for Unsupervised Entity Resolution In this chapter, I introduce a non-parametric probabilistic approach for solv- ing the collective entity resolution, that addresses some of the limitations of the relational clustering approach. This chapter is organized as follows. First, I present a motivating example in Section 3.1. In Section 3.2, I rst adapt the LDA model for document authors and extend it for entity resolution in Section 3.3. The sampling framework for inference is presented in Section 3.4. In Section 3.5 and Section 3.6, I describe how entity attributes are modeled. Section 3.7 describes a novel algorithm for determining the number of entities and in Section 3.8 and Section 3.9 I explore parameter choices and algorithmic improvements . Finally, I present experimental results on real and synthetic data in Section 3.10 and conclude in Section 3.11. 3.1 A Motivating Example In this section, I introduce a concrete bibliographic example to explain the entity resolution problem for authors and motivate the proposed approach. Consider as an example six real paper citations P1 through P6 from CiteSeer: P1: \JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines" C. Walshaw, M. Cross, M. G. Everett, S. Johnson 66 P2: \Partitioning Mapping of Unstructured Meshes to Parallel Machine Topolo- gies", C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus P3: \Dynamic Mesh Partitioning: A Uni ed Optimisation and Load-Balancing Algorithm", C. Walshaw, M. Cross, M. G. Everett P4: \Code Generation for Machines with Multiregister Operations", Alfred V. Aho, Stephen C. Johnson, Je erey D. Ullman P5: \Deterministic Parsing of Ambiguous Grammars", A. V. Aho, S. C. John- son, J. D. Ullman P6: \Compilers: Principles, Techniques, and Tools", A. Aho, R. Sethi, J. Ullman Each of the 6 papers has its own author references. For instance, the rst paper P1 has four references ?C. Walshaw?, ?M. Cross?, ?M. G. Everett? and ?S. Johnson?. In all we have 21 references in the 6 papers. The goal is to nd out how many di erent author entities these references correspond to and which reference maps to which entity. Ground truth tells us that all of the Aho?s map to the same author entity, as do the Everret?s and the Ullman?s. The interesting case here is that of Johnson. The four Johnson references correspond to two Johnson entities: those in papers P4 and P5 correspond to Stephen C. Johnson from Bell Labs, while those in papers P1 and P2 map to Steve P. Johnson from University of Greenwich, London. However, going by just the names of the references it is not clear why ?Stephen C. Johnson? is not ?S. Johnson?, when ?Alfred V. Aho? is the same as ?A. Aho?. The 67 A. V. Aho S. C. Johnson J. D. Ullman G1 Alfred V. Aho Jeffrey D. Ullman Ravi Sethi Stephen C. Johnson P5 S. Johnson M. CrossC. Walshaw M. G. Everett G2 Chris Walshaw Mark Cross Kevin McManus Martin EverettSteve P. Johnson P1 Figure 3.1: Author entities in two di erent collaboration groups and two generated papers. The ovals are the entities belonging to groups shown as encapsulating rectangles. Dotted rectangles represent papers with author references shown as smaller solid rectangles. Each paper is generated by the group above it. goal will be to make use of the collaboration relationships to make these contrasting inferences simultaneously. We would like to be able to infer from the collaborations that there are two di erent collaboration groups in this example and authors are more likely to publish with other authors from the same group. As illustrated in Fig. 3.1, the rst group G1 has Aho, Ullman and Sethi as member authors. The other group G2 has Walshaw, Cross, Everett and McManus. Stephen C. Johnson is associated with the rst collaboration group, while S Johnson from papers P1 and P2 is a di erent person since he is associated with the second collaboration group. In order to make these inferences, the model introduces an entity label and a group label for each reference, both of which are hidden and need to be inferred. The inference procedure is collective in that they cannot be made independently for each reference | their relationships to other references need to be considered as well. Also, the group and the entity labels are inter-dependent. The entity labels for the two Johnson?s depend on their group labels, as we just saw. Also, the group 68 labels depend on the entity labels in turn. Sethi from paper P6 and Johnson from paper P5 belong to the same group since they are tied by the identical entity labels for the Aho?s and Ullman?s in the two papers. These two hidden variables are the key distinctions of the model in comparison to some other recent ones that have been proposed. Most other approaches introduce a decision variable for each potential duplicate pair to infer whether or not they correspond to the same entity, while I introduce two variables for each reference in the data. As data sizes grow, I believe that this distinction has a signi cant impact. It is interesting to note the role of papers P3 and P6 in this collective inference for the Johnson?s though none of them contain a Johnson reference. They help to reinforce our belief that there are two distinct tightly knit groups or communities where member authors collaborate strongly with each other. Observe that frequent collaborations between Walshaw and Aho, and Everett and Ullman for example would have the opposite e ect. Then we would think there is one collaboration group, as opposed to two, and therefore all Johnson?s are more likely to be the same author. Not surprisingly, inferring the entity labels exactly turns out to be intractable. For this model, I propose an e ective Gibbs sampling approach for approximate inference. Also, one critical aspect of the inference procedure is discovering the likely number of entity labels, since the actual entities are hidden from us. I show how the number of entities can be inferred as well. Though I use the bibliographic domain of papers and authors, the model is applicable in a straight-forward manner for other domains where noisy references to 69 person entities are observed together. Examples include names of people traveling together on the same ight, names appearing together in the same email or groups of people attending the same meeting. Furthermore, this approach can be generalized to model other resolution problems. A very similar model may be used for word sense resolution in natural language documents, where the references are word occurrences and the senses are the entities to be resolved. 3.2 LDA Model for Authors The idea of groups has been used for probabilistic modeling of natural lan- guage documents. Most commonly, documents are viewed as bags of words and the probabilistic approaches aim to represent the documents as mixtures over underly- ing ?topics?, which can be imagined as probabilistic groups of semantically related words. This is very similar to the idea of modeling groups of underlying entities allowing entities that belong to the same group to co-occur in the data. The Latent Dirichlet Allocation (LDA) model was proposed by Blei et. al. [21] as a mixture model for textual documents. In essense, it is a three level model where each document (or, co-occurrence relation, in our terminology) is modeled as a mixture over topics (or, groups, in our discussion) and then each word for that document is generated in two steps, by rst sampling a topic and then a word from that topic. Though performing exact inference is not tractable in this model, it is a popular approach for group modeling topics in machine learning literature. Many authors have built on it since [93, 71] and e cient strategies for approximate 70 a q b T f a6a7a6a7a6 a6a7a6a7a6 a6a7a6a7a6 a6a7a6a7a6 a6a7a6a7a6 a6a7a6a7a6 a8a7a8a7a8 a8a7a8a7a8 a8a7a8a7a8 a8a7a8a7a8 a8a7a8a7a8 a8a7a8a7a8 a D Rd z (a) a q Rd b T f a9a7a9a7a9 a9a7a9a7a9 a9a7a9a7a9 a9a7a9a7a9 a9a7a9a7a9 a9a7a9a7a9 a10a7a10a7a10 a10a7a10a7a10 a10a7a10a7a10 a10a7a10a7a10 a10a7a10a7a10 a10a7a10a7a10 a r z A D V (b) Figure 3.2: Plate representation for (a) group mixture model for authors and (b) group mixture model for author resolution from ambiguous references. Observed variables are shaded. inference have also been proposed [21, 51, 77]. In this section, I adapt the LDA model to a group mixture model for author entities. In this chapter will slightly modify the notation to suit the context. Entities are authors here, so I use the symbol a instead of e to denote author entities. Also, since each document acts as a link between its co-author references, I will use the symbol d instead of h for (hyper) link labels. I start with the simple case where there is no ambiguity in the author refer- ences. In the next section, I will expand the model to handle ambiguous author references and propose inference algorithms suited to the new model. 71 Consider a collection of D documents and a set of A authors corresponding to the authors of the documents. We have a set of R author references, fa1; : : :; aRg. Each document can have multiple authors and for now, I assume the authors of each document are observed. For an author reference ai, I use di to denote the document in which it occurs. Further I introduce the notion of collaborative author groups. These are groups of authors which tend to co-author together. I will assume that there are T di erent groups. Each author reference ai has an associated group label zi that takes values from 1 through T. The probabilistic model is given using plate notation in Figure 3.2(a). The probability distribution over authors for each group is represented as a multinomial with parameters j, so the probability P(a = i j z = j) of the ith author in the database being chosen for the jth group is ji. We have T di erent multinomials, one for each group. Each paper d is modeled as a mixture over the T groups. The distribution used is again a multinomial with parameters d, so the probability Pd(z = j) of the jth group being chosen for document d is dj . Each d is drawn from a Dirichlet distribution with hyperparameters ; similarly each j is drawn from a Dirichlet distribution with hyperparameters . 3.3 LDA Model for Author Resolution In the previous section, I assumed that the author identity can be determined unambiguously from each author reference. However, when we are dealing with author names, this is typically not the case. The same author may be represented 72 in a variety of ways: ?Alfred V. Aho?, ?Alfred Aho?, ?AV Aho?, etc. There may be mistakes due to typos or extraction errors. Finally, two ?S. Johnson?s may not refer to the same author entity. One may refer to ?Stephen C. Johnson? and another may refer to ?Steve P. Johnson?. The result is that we are no longer sure of the mapping from the author reference to the author entity. We must resort to inference to identify the true author for each reference. To capture this, I will associate an attribute va with each author a. In addition, I add an extra level to the model that probabilistically modi es the author attributes Va to generate the references r = fr1; r2; : : :; rRg. Each reference is generated by rst sampling a group z and then an author entity a as before. Then, the author reference r is generated from a by modifying the attribute va according to a noise model N. I use a relatively sophisticated noise model that I explain in Section 3.6. The probability of generating an author reference r from a particular author entity is de ned as P(rjva). The conditional probabilities for each reference are normalized to sum to 1 over all author entities. It is the reference r that is observed, while the entity a and group label z are hidden variables. The LDA-ER model is represented in Figure 3.2(b). Illustrating this in the context of our motivating example in Fig. 3.1, we have already seen how the three author entities are chosen for paper P1. The attributes va for the three authors are ?Alfred V. Aho?, ?Stephen C. Johnson? and ?Je rey D. Ullman?. However the complete/correct names do not always appear in papers or citations. In this case, the noise process modi es the attributes of the three selected entities to generate ?A. V. Aho?, ?S. C. Johnson? and ?J. D. Ullman? as the three 73 author references in the paper. The probability of generating the attributes r for the set r of references for a corpus given parameters , and V can be expressed as P(r; ; ;V) = Y d P(rd; ; ;V) = Y d X ad P(rd j ad;V)P(ad; ; ) = Z P( ; )Y d X rd P(rd j ad;V) Z P( ; )P(ad j ; )d d = Z P( ; )Y d Z P( ; ) Y r2d X a P(r j va)P(a j ; )d d = Z P( ; )Y d Z P( ; ) Y r2d X a P(r j va)X j P(z = j j )P(a j j)d d = Z P( ; )Y d Z P( ; ) Y r2d X a P(r j va)X j Y i ( j ji) i(a)d d (3.1) where i(a) is 1 if a = i and 0 otherwise. 3.4 Inference using Gibbs Sampling In general, the integral in Eq. (3.1) is intractable due to coupling between and . Di erent approximations have been proposed, including variational methods [21], Gibbs Sampling [51] and Expectation Propagation [77]. I follow the approach proposed by Gri ths and Steyvers [51] where and are not directly estimated as parameters. Instead, the posterior distribution P(z;a j r) is rst constructed and then and are estimated from this posterior distribution. Now, the joint probability can be derived from Eq. (3.1) as: P(z;a;r) = P(z)P(a j z)P(r j a) (3.2) 74 where P(z) = ( (T ) ( )T )D DY d=1 Q t ( + CDTdt ) (T + CDTd ) (3.3) is the probability of the joint group assignment to all references and P(a j z) = ( (A ) ( )A )T TY t=1 Q a ( + CATat ) (A + CAT t ) (3.4) is the conditional probability of the references given the groups and P(r j a) = RY i=1 P(r j vai) is the conditional probability of the references given the authors. CDTdt is the number of times group t has been observed for the references in document d and CDTd = P t CDTdt . Similarly, CATat is the number of times references to author a have been observed with group label t in all documents. I construct a Markov chain that converges to the posterior distribution P(z;a j r) and then draw samples from this Markov chain. Each state in the Markov chain is an assignment of a group label and an author label to all R references. In the Gibbs Sampling approach, the labels for each reference are sequentially sampled conditioned on the current labels of all other references. By construction, this Markov chain converges to the target posterior distribution. However, I rst need to de ne the full conditional distribution P(zi = t; ai = a j z i;a i;r), where z i is the set of all but the ith group label and a i all but the ith author label. In words, this is the probability that the ith reference comes from the tth group considering the current group and author assignment to all other references. 75 I derive this full conditional distribution as P(zi = t; ai = a j z i;a i;r) / C DT ( i)dit + CDT( i)di + T CAT( i)at + CAT( i) t + A P(ri j va) The factorization makes intuitive sense. The rst term is the probability of group t in document di, the second is the probability of author a in group t and the third is the probability of the author attribute va being corrupted into the ith reference attribute. Instead of sampling zi and ai as a block, they can be sampled separately: P(zi = t j z i;a;r) / C DT ( i)dit + CDT( i)di + T CAT( i)ait + CAT( i) t + A (3.5) P(ai = a j z;a i;r) / C AT ( i)ati + CAT( i) ti + A P(ri j va) (3.6) 3.5 Modeling Author Attributes In the previous section, I assumed that the author attribute values va are known. But in general, the author attributes will not be known and will need to be inferred from the references. The conditional distribution for sampling groups zi is not directly a ected by the attributes. However, the attributes in uence the assignment of author labels ai, since a reference ri is more likely to be assigned to an author with similar attributes. Conversely, any author attribute vi depends on the references that have author label i. Incorporating a prior P(v) = QAi=1 P(vi) into the joint distribution in Eq. (3.2), I derive the conditional distribution for assigning 76 a value v to vi given all author labels and references as: P(vi = v j a;r) / P(v) RY j=1 P(rj j v) i(aj) Intuitively, vi should be set to the most likely value that explains the generation of the references assigned to author i. For example, if multiple \J.S. Smith" and \John Smith" references have been assigned author label i along with the reference \Jhon Smth", then the author attribute vi is most likely to be \John S. Smith". The sampling algorithm now also samples the author attributes vi iteratively, conditioned on the references and current author assignments, along with sampling the group and entity labels for each reference. For ?free authors? to which no references are currently assigned, I set the attributes to a special value ???. In order to make the model to prefer free authors over assigned authors, I assign a higher prior probability P(?) than all other attributes. 3.6 Noise Model The di erent ways for distorting or modifying an author attribute to a ref- erence in a document is captured by the noise model N. It handles rst, middle and last names independently. The rst name can be initialed with probability pFI, dropped with probability pFD or retained as a whole with probability pFR, where pFI + pFD + pFR = 1. There are similar parameters pMI, pMD and pMR for the middle name. The probabilities for the rst and middle initials being incorrect are pFIr and pMIr. These are expected to be lower than pR. Last names and retained rst or middle names may be corrupted by characters being inserted, deleted or 77 replaced with probabilities pI, pD and pR respectively. The minimum numbers of insertion (nI), deletion (nD)and replacement (nR) operations for mutating an au- thor attribute va to a reference r are obtained using edit-distance for strings. Then the mutation probability is P(rjva) = pnII pnDD pnRR . 3.7 Determining Number of Entities In the development up until now, I have considered the number of authors A to be given, when in practice this needs to be estimated. One of the contributions of this work is an unsupervised method for determining the number of entities. I propose a novel approach that avoids searching explicitly over the possible number of author entities and instead adapts it within the sampling framework. 3.7.1 Basic Inference With Gibbs Sampling I rst describe a novel but simple Gibbs sampling algorithm for iteratively sampling the values of the hidden group and entity labels for each reference condi- tioned on the existing labels of all other references. Equations 3.5, 3.6 and 3.5 form the basis of this algorithm. I rst sample a group label for each reference according to Eq. (3.5). Next, I sample an entity label for each reference according to Eq. (3.6). The di erence for the entities is that the number of entity labels is unknown and needs to be inferred by the algorithm. So I either choose an existing entity label or alternatively a hitherto unused one. For a new entity label, its observed occur- rence count CAT( i)ati is 0. But the parameter ensures a non-zero probability of a 78 new label being chosen. Also, the attribute va for a new entity is unknown. So I use a xed value for the probability P(rijva) for a new entity a that controls how frequently new entity labels are sampled. Once all the entity labels are sampled, in the third step the attribute values are sampled for each of the existing entities according to Eq. (3.5). The iterations continue till convergence. There is a connec- tion between this avor of Gibbs sampling inference for number of entities and the Dirichlet process which I describe in the next subsection. 3.7.2 Relation to the Dirichlet Process The Dirichlet process was introduced by Ferguson [45] and Antoniak [4] as a non-parametric statistical approach that allows the complexity of the model to grow with increasing size of the data. In the context of our application, we would like the number of entities to be inferred in model rather than it being a xed parameter, and we would like the model to be able to accommodate a greater number of entities as the number of references in the data grows. The Dirichlet process can be imag- ined as a distribution over discrete distributions and is used as follows for choosing the number of components in a mixture model. A distribution (or a component) is rst drawn from the Dirichlet process, the parameters are then sampled from this distribution and nally the data is drawn using these parameters. Drawing a par- allel with the application here, I can sample an entity rst, choose the parameters (the attribute) for that entity and then nally generate the reference using the en- tity parameters. When the Dirichlet process is integrated out, a clustering e ect is 79 observed in the conditional distribution for choosing the nth component given n 1 previous component draws. The probability of choosing one of the existing com- ponents is proportional to the number of times it has been chosen in the previous n 1 draws, while a new component has a nonzero probability of being sampled. In particular, let G0 be the baseline probability distribution over discrete components and be a scalar. Then, given the n 1 draws 1:n 1, the distribution for the nth component is given by n = 8 >>> < >>> : i with prob nin 1+ ; G0 with prob n 1+ where ni is the number of times i has occurred in 1:n 1. Exact inference is intractable in the Dirichlet process mixture model but ap- proximate inference techniques have been proposed [81, 19]. Of particular interest is the Gibbs sampling strategy proposed by Neal [81]. This algorithm iteratively samples the component label ai for the ith data object ri from the conditional dis- tribution given the other labels: P(ai = k j r;a i; ) (3.7) = P(ai = k j a i; )P(ri j r i;a i; ai = k) For an existing component k P(ai = k j a i; ) = C A ( i)k + N 1 (3.8) where CA( i)k is the number of previous assignments to the kth component without counting the ith assignment. For a component k that has not been used before P(ai = k j a i; ) = + N 1 (3.9) 80 We may imagine LDA-ER as the Dirichlet process mixture model augmented with a group structure above it that enables it to capture relations between the com- ponents or entities. In LDA-ER, a group zi = t is rst sampled for the ith reference from the distribution over groups for the document and then an entity is sampled from it. In the Dirichlet process, any previously existing entity may be chosen in this step depending on their prior counts. But in LDA-ER, the choice is controlled by the sampled group t. Entities that have previously been associated with this sampled group are much more likely to be chosen. This distinction allows LDA-ER to model relations between entities. As in the Dirichlet process, alternatively a new entity may be selected in LDA-ER. However, this new entity now becomes associ- ated with group t and may be chosen for future references from this group. This di erence is clearly observable from the conditional distributions in Eq. (3.8) and Eq. (3.6). While the probability for choosing the kth entity in Eq. (3.8) depends on CA( i)a which is the number of previous occurrences of entity a, in Eq. (3.6) it depends on CAT( i)at which is the number of joint occurrences of group t and entity a. This coupling of the group and entity labels distinguishes the LDA-ER model from the Dirichlet process mixture model. 3.7.3 Block Assignment for Entity Resolution As has been noted in the case of naive Gibbs sampling for inference in the Dirichlet process mixture model [19], iteratively estimating the group and entity label for each reference separately, as described in Sec. 3.7.1 can be prohibitively 81 slow. I now describe a novel algorithm that overcomes this problem by reassigning entity labels for a set of entities at the same time. This achieves an agglomerative clustering e ect on the references. Observe that for any assignment of entity labels to references, each entity label de nes a cluster | all references that have this entity label belong to this cluster. Sampling a new label for each reference separately is equivalent to an individual reference migrating from one cluster to another. Ag- glomerative clustering is signi cantly faster since pairs of clusters merge into one. I achieve the same e ect with the new sampling algorithm that I propose. In addition, I allow existing clusters to split. The conditional probabilities for these choices for any particular entity cluster given the entity and group labels for all other references are derived from the joint distribution in Eq. (3.2). As in traditional Gibbs sampling, these probabilities then form the transition probabilities in a Markov process. I de ne a cluster by picking an author label j and consider the set s of reference indices that have j as their author label: s = fi j ai = jg. I assign new author labels to all references indexed by cluster s simultaneously. In general, the number of possible author assignments to s is exponential in jsj and it is virtually impossible to enumerate all these di erent probabilities for sampling. Instead, in the algorithm I restrict the space of candidates such that the cluster of references assigned to a particular author label may (a) merge with a cluster currently assigned to another author label, (b) stay unchanged or (c) split and have a part assigned to a hitherto unassigned author label j0. Case (a) is similar to two author clusters merging and the number of authors is e ectively decreased by one. In case (c), an author cluster splits into two and the number of authors is e ectively 82 increased by one. However, the number of possible partitions of s into j and j0 is still 2jsj. The simple but restricted solution that I use is splitting to the set that last merged into label j via option (a). I rst consider assigning a single author label to all of cluster s. The full con- ditional distribution I need to derive is P(as = i j z;a s;r) which is the probability of all the labels as in cluster s being set to i conditioned on all references and group labels and all other author labels. Note that P(as = i j z;a s;r) = P(as = i;a s;z;r)P(a s;z;r) = K P(as = i;a s;z;r) where K is the same for all values of i. Note that all topic labels z and remaining author labels a (s) are xed across all assignments i to as. The full conditional distribution can be factored so that it has a term for each document, group and reference. Looking at Eqn. 3.3, we can see that the documents terms are identical for all i. The CDTdt counts are the same for all group labels t and documents d since the all group labels are held xed. On the other hand, the reference terms matter only for those references indexed by s, since they are assigned new author labels. termR = Y j2s P(rjjvi) The group term is however nontrivial and needs careful consideration. From Eqn. 3.4, termG = TY t=1 Q a ( + CAT(s)at + CAT( s)at) (A + CAT(s) t + CAT( s) t) where CAT(s)at is the number of times author a and group t have been jointly assigned to references in s, and CAT( s)at is the number of such assignments outside s. Let zs 83 be the set of groups currently assigned to the references indexed by s. For groups t =2 zs and authors a 6= i, the count C(s)ATat is 0 and the corresponding terms are independent of the assignment i. Therefore termG = K Y t2zs ( + CAT(s)it + CAT( s)it) (A + CAT(s) t + CAT( s) t) = K Y t2zs ( + CAT( s)it + CAT(s)it 1) : : :( + CAT( s)it) ( + CAT( s)it) (A + CAT( s) t + CAT(s) t 1) : : :(A + CAT( s) t) (A + CAT( s) t) = K0 Y t2zs ( + CAT( s)it + CAT(s)it 1) : : :( + CAT( s)it) (A + CAT( s) t + CAT(s) t 1) : : :(A + CAT( s) t) Denoting T(t; i) = CAT(s)itY n=1 ( + CAT( s)it + CAT(s)it n) (3.10) T(t; ) = CAT(s) tY n=1 (A + CAT( s) t + CAT(s) t n) the group term can be written as termG / Y t2zs T(t; i) T(t; ) Putting everything together, the conditional distribution can be written as P(as = i j z;a s;r) / Y t2zs T(t; i) T(t; ) Y j2s P(rj j vi) (3.11) An Interpretation of Block Assignment: Here I show how the terms in this conditional probability can be rearranged so that the result makes intuitive sense. Let j be an index into cluster s and tj be the group label for that reference. Also, consider cluster s to be an ordered set and denote by s