ABSTRACT Title of Dissertation: WORD SENSE DISAMBIGUATION WITHIN A MULTILINGUAL FRAMEWORK Mona Talat Diab, Doctor of Philosophy, 2003 Dissertation directed by: Professor Philip Resnik Department of Linguistics & UMIACS Word Sense Disambiguation (WSD) is the process of resolving the meaning of a word unambiguously in a given natural language context. Within the scope of this thesis, it is the process of marking text with explicit sense labels. What constitutes a sense is a subject of great debate. An appealing perspective, aims to define senses in terms of their multilingual correspondences, an idea explored by several researchers, Dyvik (1998), Ide (1999), Resnik & Yarowsky (1999), and Chugur, Gonzalo & Verdejo (2002) but to date it has not been given any practical demonstration. This thesis is an empirical validation of these ideas of characterizing word meaning using cross-linguistic correspondences. The idea is that word meaning or word sense is quantifiable as much as it is uniquely translated in some language or set of languages. Consequently, we address the problem of WSD from a multilingual perspective; we expand the notion of context to encompass multilingual evidence. We devise a new approach to resolve word sense ambiguity in natural language, using a source of information that was never exploited on a large scale for WSD before. The core of the work presented builds on exploiting word correspondences across languages for sense distinction. In essence, it is a practical and functional imple- mentation of a basic idea common to research interest in defining word meanings in cross-linguistic terms. We devise an algorithm, SALAAM for Sense Assignment Leveraging Alignment And Multilinguality, that empirically investigates the feasibility and the validity of uti- lizing translations for WSD. SALAAM is an unsupervised approach for word sense tagging of large amounts of text given a parallel corpus ? texts in translation ? and a sense inventory for one of the languages in the corpus. Using SALAAM, we obtain large amounts of sense annotated data in both languages of the parallel corpus, simul- taneously. The quality of the tagging is rigorously evaluated for both languages of the corpora. The automatic unsupervised tagged data produced by SALAAM is further utilized to bootstrap a supervised learning WSD system, in essence, combining supervised and unsupervised approaches in an intelligent way to alleviate the resources acquisition bottleneck for supervised methods. Essentially, SALAAM is extended as an unsuper- vised approach for WSD within a learning framework; in many of the cases of the words disambiguated, SALAAM coupled with the machine learning system rivals the performance of a canonical supervised WSD system that relies on human tagged data for training. Realizing the fundamental role of similarity for SALAAM, we investigate differ- ent dimensions of semantic similarity as it applies to verbs since they are relatively more complex than nouns, which are the focus of the previous evaluations. We design a human judgment experiment to obtain human ratings on verbs? semantic similar- ity. The obtained human ratings are cast as a reference point for comparing different automated similarity measures that crucially rely on various sources of information. Finally, a cognitively salient model integrating human judgments in SALAAM is pro- posed as a means of improving its performance on sense disambiguation for verbs in particular and other word types in general. WORD SENSE DISAMBIGUATION WITHIN A MULTILINGUAL FRAMEWORK by Mona Talat Diab Dissertation submitted to the Faculty of the Graduate School of the University of Maryland at College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2003 Advisory Committee: Professor Philip Resnik, Chairperson/Advisor Professor Bonnie Dorr Professor Paul Pietroski Professor Amy Weinberg Professor David Yarowsky c Copyright by Mona Talat Diab 2003 DEDICATION           In loving memory of my most beloved father, Dr. Talat Diab (June 13, 1941 - January 27, 2003), who lived a life of dignity, courage, wisdom, patience and above all love, and who will remain my personal hero and inspiration for ever. May God bless his soul, Amen. ii ACKNOWLEDGEMENTS This thesis would not have materialized without the help and support of many people. I believe I am very blessed to be surrounded by such needed encouragement. So here it goes: I would like to start with acknowledging my parents and my brother?s love, trust, guidance, support and confidence in me throughout the years of my studies and research; Without them, I doubt that I would have achieved what I have today. I would like to thank Philip Resnik, my research advisor, for his understanding and his support through all the good times and especially through the bad times. He was always there with great advice and a listening ear. I am grateful to the committee members on my thesis for all their insightful comments and remarks about this research. I would also like to acknowledge the constant encouragement I received from Doug Oard and Mari Olsen. I would like to iii thank Peter Bock for grounding me in scientific methodology. I am very grateful to Julio Gonzalo and Irina Chugur for there help with Spanish data. I would like to acknowledge the support afforded my by Thierry Paquet from the University of Rouen during a very tough period of my life. Thanks a lot clippers for being there for me all the time, special thanks to Nizar Habash, Okan Kolak, David Zajic, Clara Cabezas, Grazia Lassner, Michael Nossal, Rebecca Hwa, Fazil Ayan, Dina Demner, Adam Lopez and last but not least to Laura Bright. I would also like to acknowledge the support of Louiqa Rachid, thanks for listening. I am eternally grateful to Mohamed Zahran, Kobi Snitz and Nizar Habash for help with thesis formatting and editing. Last but not least, my dearest circle of friends who were always support- ing me and showering me with their love and support through all the good times and bad times, may God bless you all: Selda Kapan, Doaa Taha, Muna Yousef, Khaled El Gindy, Margaret Zaknoen, Mandy Chan, Amany El Anshasy, Dahliah Hawary, Tamer Nadeem, Ingy Bakir, Heba Zaghloul, Hela Zouari, Tamer Sharnouby, Svend White, Shabana Mir, Nabilah Haque, Mike Sanford, Mustafa Tikir, Burcu Ayan, Betul Attalay, Anuradha Shenoy and Terry Riopka. Finally, this research was partly supported by National Science Founda- iv tion grant EIA0130422, DARPA/ITO Contract N66001-97-C-8540, DARPA/ITO Cooperative Agreement N660010028910, Department of Defense con- tract RD-02-5700, and ONR MURI Contract FCPO.810548265. v TABLE OF CONTENTS List of Tables xiii List of Figures xvii 1 Thesis Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research contributions . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Thesis Layout & Brief Overview of Chapters . . . . . . . . . . . . . 7 2 Related Work 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Pre-SENSEVAL WSD: Historical Perspective . . . . . . . . . . . . . 11 2.3 The SENSEVAL Era . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Multilingual Approaches to WSD . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Word Sense Disambiguation Using Statistical Methods: Brown et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 vi 2.4.2 Using Bilingual Materials to Develop Word Sense Disambigua- tion Methods: Gale et al. . . . . . . . . . . . . . . . . . . . 22 2.4.3 Word Sense Disambiguation Using a Second Language Mono- lingual Corpus: Ido Dagan & Alon Itai . . . . . . . . . . . . 25 2.4.4 Resolving Translation Ambiguity Using Non-Parallel Bilin- gual Corpora: Kikui . . . . . . . . . . . . . . . . . . . . . . 32 2.4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 Word Sense Tagging Using Parallel Corpora: SALAAM 36 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Relevant Background . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.1 A Translational Basis for Semantics. Helge Dyvik . . . . . . 41 3.4.2 Distinguishing Systems and Distinguishing Senses: New Eval- uation Methods for Word Sense Disambiguation. Philip Resnik & David Yarowsky . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.3 Polysemy and Sense Proximity in the SENSEVAL-2 Test Suite. Irina Chugur, Julio Gonzalo, Felisa Verdejo . . . . . . . . . . 46 3.4.4 Cross-lingual Sense Discrimination: Can it work? Nancy Ide . 49 3.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 vii 3.5.1 General Hypothesis Statement . . . . . . . . . . . . . . . . . 54 3.6 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.6.1 General Method Description . . . . . . . . . . . . . . . . . . 55 3.6.2 Required Resources . . . . . . . . . . . . . . . . . . . . . . . 57 3.6.3 Detailed Method Description . . . . . . . . . . . . . . . . . . 58 3.6.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 67 3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.7.1 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.7.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.7.3 Sense Selection and Similarity Measure . . . . . . . . . . . . 78 3.7.4 Development and Testing Environment . . . . . . . . . . . . 80 3.7.5 Evaluation Measure . . . . . . . . . . . . . . . . . . . . . . . 80 3.7.6 Evaluation Parameters . . . . . . . . . . . . . . . . . . . . . 81 3.7.7 Evaluation Factors . . . . . . . . . . . . . . . . . . . . . . . 82 3.7.8 Evaluation Conditions . . . . . . . . . . . . . . . . . . . . . 83 3.7.9 Experimental Hypotheses . . . . . . . . . . . . . . . . . . . 86 3.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.8.1 Testing Hypothesis 1 . . . . . . . . . . . . . . . . . . . . . . 88 3.8.2 Testing Hypothesis 2 . . . . . . . . . . . . . . . . . . . . . . 93 3.8.3 Testing Hypothesis 3 . . . . . . . . . . . . . . . . . . . . . . 94 3.8.4 Testing Hypothesis 4 . . . . . . . . . . . . . . . . . . . . . . 97 viii 3.8.5 Testing Hypothesis 5 . . . . . . . . . . . . . . . . . . . . . . 98 3.8.6 Testing Hypothesis 6 . . . . . . . . . . . . . . . . . . . . . . 100 3.8.7 Overall results . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.9.1 Summary of the Results . . . . . . . . . . . . . . . . . . . . . 104 3.9.2 Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . 105 3.9.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.9.4 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.9.5 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.9.6 Complementarity with Other WSD Systems . . . . . . . . . . 111 3.9.7 Evaluation of Target Language Tagging . . . . . . . . . . . . 111 3.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4 Extensions to SALAAM 114 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.2 Using Human Translations ? Naturally- Occurring Parallel Corpora . . . . . . . . . . . . . . . . . . . . . . . 115 4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.2.3 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 ix 4.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.3 Target Language Tagging Evaluation . . . . . . . . . . . . . . . . . . 129 4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.3.3 General Hypothesis . . . . . . . . . . . . . . . . . . . . . . . 130 4.3.4 Required Resources . . . . . . . . . . . . . . . . . . . . . . . 131 4.3.5 Projected Sense Tagging on Arabic Data . . . . . . . . . . . . 132 4.3.6 Projected Sense Tagging on Spanish Data . . . . . . . . . . . 139 4.3.7 General Discussion . . . . . . . . . . . . . . . . . . . . . . . 157 4.3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 4.4 Feasibility of bootstrapping a WordNet style ontology for Arabic . . . 159 4.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 4.4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 4.4.3 Levels of representation . . . . . . . . . . . . . . . . . . . . 165 4.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 5 Exploration into Bootstrapping Supervised WSD 169 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 5.4 Empirical Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 5.5 University of Maryland Supervised Sense Tagging system (UMSST) . 176 x 5.6 Bootstrapping Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 179 5.6.1 Test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 5.6.2 Hand-Tagged Training Data . . . . . . . . . . . . . . . . . . 181 5.6.3 Gold Standard . . . . . . . . . . . . . . . . . . . . . . . . . . 183 5.6.4 SALAAM Training Data Corpora . . . . . . . . . . . . . . . 184 5.6.5 SALAAM-tagged Training Data Creation . . . . . . . . . . . 185 5.6.6 Experimental Conditions . . . . . . . . . . . . . . . . . . . . 185 5.6.7 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . 188 5.6.8 General Experimental Hypothesis . . . . . . . . . . . . . . . 188 5.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 5.8.1 Analysis of factors affecting PR . . . . . . . . . . . . . . . . 196 5.9 Combining factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 6 Facets of Similarity 230 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 6.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 6.3 Models of Verb Similarity . . . . . . . . . . . . . . . . . . . . . . . . 234 6.3.1 Class 1: Taxonomic Models . . . . . . . . . . . . . . . . . . 235 6.3.2 Class 2: Distributional Co-occurrence Model . . . . . . . . . 239 6.3.3 Class 3: Semantic Structure Model . . . . . . . . . . . . . . . 240 xi 6.4 Human Judgment Experiment . . . . . . . . . . . . . . . . . . . . . 244 6.4.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 6.4.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 6.4.3 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 6.4.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 6.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 6.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 6.5 Application to SALAAM . . . . . . . . . . . . . . . . . . . . . . . . 255 6.5.1 Integrating Human Ratings in SALAAM: A Cognitive Based Feasibility Study . . . . . . . . . . . . . . . . . . . . . . . . 256 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 7 Conclusions & Future Directions 268 7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 7.2 Thesis Problems & Limitations . . . . . . . . . . . . . . . . . . . . . 274 7.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 275 7.4 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Bibliography 292 xii LIST OF TABLES 2.1 Summary of Systems? Required Resources . . . . . . . . . . . . . . . 34 2.2 Summary of Systems? Evaluation . . . . . . . . . . . . . . . . . . . . 35 3.1 Relative sizes of corpora used for evaluating SALAAM on SV2AW test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2 SALAAM performance results on English source SV2AW test data in the default conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.3 SALAAM performance with pre-alignment French target pseudo-translation merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.4 SALAAM performance in default condition vs. intralanguage post- alignment merge condition at MAX sense selection criterion . . . . . 95 3.5 SALAAM performance in intralanguage post-alignment merge condi- tions with sense selection criterion MAX vs. THRESH . . . . . . . . 97 xiii 3.6 SALAAM performance for conditions 4 where evidence is obtained from monolingual intralanguage pseudo-translation merge vs. evidence obtained from interlanguage pseudo-translation intersection merge in condition 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.7 SALAAM performance for conditions 4 where evidence is obtained from monolingual intralanguage pseudo-translation merge vs. evidence obtained from interlanguage pseudo-translation union merge in Condi- tion 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.1 Relative sizes of the English side of corpora used in HT Evaluations . 119 4.2 SALAAM Results on SV2AW for MT & HT parallel corpora indepen- dently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.3 SALAAM results using both HT and MT for augmenting the test corpus 125 4.4 Accuracy results of projected tagging onto Arabic SV2AW data mea- sured against English WN17pre sense definitions . . . . . . . . . . . 138 4.5 Relative sizes of corpora used in projected Spanish tagging evaluation 143 4.6 Results in % for RBL, and the different evaluation conditions of test set SPSV2AW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.1 Comparative results obtained by Mihalcea?s bootstrapping system when training an instance-based learning supervised WSD system using both human tagged data and GenCor tagged data as training examples . . . 176 xiv 5.2 Test items for SV2LS-test . . . . . . . . . . . . . . . . . . . . . . . 216 5.3 Characteristics of hand-tagged training data for the SENSEVAL2 En- glish Lexical Sample task . . . . . . . . . . . . . . . . . . . . . . . . 217 5.4 Current evaluation gold standard precision results obtained by UMSST- human . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 5.5 SALAAM-tagged training corpora sizes . . . . . . . . . . . . . . . . 219 5.6 Precision scores and PR of UMSST-SALAAM and UMSST-human on SV2LS-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 5.7 PRs of the best individual conditions using UMSST-SALAAM train- ing data on the top 16 test nouns in Table 5.6: 1 is condition MT+HT+SV2LS- TR THRESH SP; 2 is condition HT+SV2LS-TR THRESH ML I; 3 is condition HT+SV2LS-TR THRESH ML U; 4 is condition MT+HT+SV2LS- TR MAX ML I; and 5 is condition HT+SV2LS-TR MAX ML I . . 221 5.8 Precision % scores obtained for SALAAM-SV2LS-TR and UMSST- SALAAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 5.9 List of test nouns with their corresponding number of senses and sense contexts in the SALAAM-tagged training data . . . . . . . . . . . . . 223 5.10 Test nouns, the corresponding number of senses, perplexity values and PR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 5.11 Test nouns with their corresponding Semantic Translation Entropy val- ues and performance ratios, PR . . . . . . . . . . . . . . . . . . . . . 225 xv 5.12 Test nouns with their corresponding SDC values and PRs . . . . . . . 226 5.13 Test noun items with the absolute difference between SALAAM-tagged perplexity and test data perplexity, PerpDiff, against the performance ratios, PR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 5.14 Test nouns with SDC, manually grouped similar senses, and perfor- mance ratios, PR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 5.15 Characteristics of the nouns stress and church . . . . . . . . . . . . . 229 6.1 Aspectual features determining aspectual class for verbs . . . . . . . 245 6.2 The final verb pairs used in the human judgments experiment . . . . . 249 6.3 Comparing the different automated similarity measures to the two hu- man conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 6.4 Regression Coefficients for the automatic similarity measures . . . . 263 6.5 SALAAM performance results with different similarity measure con- ditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 7.1 Summary of Multilingual WSD Systems? Required Resources . . . . 271 7.2 Summary of Multilingual WSD Systems? Evaluation . . . . . . . . . 271 xvi LIST OF FIGURES 3.1 Common Senses Shared Between Polysemous Words . . . . . . . . . 53 3.2 Flow chart demonstrating process flow in SALAAM method . . . . . 58 3.3 A sample token alignment in a parallel corpus . . . . . . . . . . . . . 59 3.4 Tokens aligned in a parallel corpus . . . . . . . . . . . . . . . . . . . 60 3.5 Aligned token instances from target to source . . . . . . . . . . . . . 62 3.6 Target word types and their corresponding source token sets . . . . . 63 3.7 Source type sets for the target words RIVE and BANQUE . . . . . . 63 3.8 Sense Tagged Type Source Sets . . . . . . . . . . . . . . . . . . . . 66 3.9 Sense Tagged Token Source Sets . . . . . . . . . . . . . . . . . . . . 66 3.10 Projecting source inventory senses onto target language instances . . . 68 3.11 An excerpt from the noun database of WN17pre . . . . . . . . . . . 73 3.12 SALAAM performance precision & recall results in the default condi- tions plotted against state-of-the-art WSD systems on the same test set SV2AW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 xvii 3.13 SALAAM F-Measure results in the default condition measured against state-of-the-art WST systems on test set SV2AW . . . . . . . . . . . 92 3.14 SALAAM F-Measure results in the default condition measured against MAX intralanguage condition for the three languages: AR, FR, SP . . 96 3.15 SALAAM F-Measure results on test set in SV2AW in the highest yielding conditions depicted against state-of-the-art WSD systems . . 103 4.1 An example of a transliterated Arabic sentence and its tokenization . 134 4.2 WN17pre entries for evening . . . . . . . . . . . . . . . . . . . . . . 137 4.3 English WN17pre entries for care . . . . . . . . . . . . . . . . . . . 162 4.4 Metonymic sense of tea in WN17pre . . . . . . . . . . . . . . . . . 163 4.5 English WN17pre senses for ceiling . . . . . . . . . . . . . . . . . . 163 4.6 Homonymic sense for tower in WN17pre . . . . . . . . . . . . . . . 164 4.7 WN17pre senses for experience . . . . . . . . . . . . . . . . . . . . . 165 5.1 Sense distribution correlations across different nouns in the test data and hand-tagged training examples . . . . . . . . . . . . . . . . . . . 182 5.2 Trend lines of the perplexity measure for test data and hand-tagged training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 5.3 Comparison between Mihalcea?s results and SALAAM results on the same test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 xviii 5.4 Trend lines for the precision obtained by SALAAM-SV2LS-TR and UMSST-SALAAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 5.5 A plot of the distribution of the senses? contexts of bar and day . . . . 199 5.6 A plot of the SDC and performance ratio on the 29 nouns . . . . . . . 204 5.7 A comparative view of the different perplexity measures in SALAAM- tagged training data and the test data for the 29 nouns . . . . . . . . . 206 6.1 Random sample of verb source type sets yielded by SALAAM . . . . 267 7.1 SALAAM F-Measure results depicted against state-of-art WSD sys- tems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 7.2 Comparison between Mihalcea?s results and SALAAM results on the same test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 xix Chapter 1 Thesis Introduction 1.1 Introduction Ambiguity is an inherent characteristic of natural language, permeating its various levels of representation. From a human language processing perspective, ambiguity is not a severe problem. However, from a machine processing perspective, the story is quite different. Resolving ambiguity in natural language has been of central interest to researchers from the early 1950?s. In particular, Word Sense Disambiguation (WSD) has occupied center stage in the early work on Natural Language Processing (NLP). Bar-Hillel (1960) claimed that word sense ambiguity is the main impediment facing the field of Machine Translation (MT); in his famous treatise on MT, he describes the problem of WSD as insurmountable which leads to the abandonment of MT all together in the late 60s [33]. Fortunately, we have come a long way from the 1960s. With the on-going surge in 1 machinery allowing for the development of sophisticated techniques and algorithms, WSD is experiencing a revival of interest especially with the belief that it has the potential of improving several central tasks in NLP. Owing to WSD?s acknowledged significance in the field of computational linguistics, the community organized the first SENSEVAL which took place 3 years ago. It was succeeded by SENSEVAL 2 in 2001. As the name indicates, SENSEVAL is a defined protocol for developing, testing and evaluating WSD systems. SENSEVAL provides the opportunity, for the first time, for researchers working in the area of WSD to investigate common material, share expe- riences and exchange ideas within a defined framework. An important contribution by SENSEVAL is the creation of standardized tests and tools for measuring systems? performance and comparing notes. But what is WSD? WSD is the process of resolving the meaning of a word unambiguously in a given natural language context. It is the process of marking text with explicit sense labels. What constitutes a sense in natural language is a subject of vast debate, both in the areas of lexical semantics and computational linguistics. The study of word meaning is at the core of research in the field of lexical semantics. Researchers such as Cruse (1986), Pustejovsky (1995) and Levin (1990), among others, investigate word meaning within the same language ? monolingually ? with the goal of quantifying meaning dimensions. An alternative approach is to use cross-linguistic correspondences for characterizing word meanings in natural language. This idea is explored by several 2 researchers, Dyvik (1998), Ide (1999), Resnik & Yarowsky (1999), and Chugur, Gon- zalo & Verdejo (2002) but to date, it has not been given any practical demonstration. This thesis is an empirical validation of this very notion of characterizing word mean- ing using cross-linguistic correspondences. The idea is that a word meaning or a word sense is quantifiable as much as it is uniquely translated in some language or set of languages. To date, most large scale WSD methods have defined context for sense defini- tion within a monolingual framework; the evidence for sense choice is typically from within the same language. In this thesis, we address the problem of WSD from a multilingual perspective; we expand the notion of context to encompass multilingual evidence. We devise a new approach to resolve word sense ambiguity in natural lan- guage in a novel way, using a source of information that was never exploited on a large scale for WSD before. The core of the work presented in this thesis builds on exploiting word correspon- dences across languages for sense distinction. In essence, it is a practical and func- tional implementation of the basic idea common to the research interest which defines word meanings in cross-linguistic terms. We devise an algorithm that empirically in- vestigates the feasibility and the validity of utilizing translations for WSD. The algo- rithm presented is an unsupervised approach for word sense tagging (WST) of large amounts of text given a parallel corpus and a sense inventory for one of the languages 3 in the corpus.1 We refer to the presented algorithm as SALAAM for Sense Assign- ment Leveraging Alignments and Multilinguality. A parallel corpus is defined as texts in translation comprising a source language and a target language. The approach is unsupervised inasmuch as it does not require sense-annotated data at the onset. Availability of automated knowledge resources for different languages is a seri- ous obstacle for the study of language within a computational framework. In a more globalized community, the need for knowledge resources in different languages is ever more pressing. Yet, the distribution of tools and resources is asymmetric with rich lan- guages such as English possessing the lion?s share. In this thesis, we address this issue with a new technique for leveraging off of the rich languages to help create resources for poor languages with minimal automated linguistic resources. Within the scope of this thesis, a language is defined as rich or poor depending on the amount of automated resources available for it. Furthermore, we investigate the impact of bootstrapping supervised WSD systems with large amounts of noisy sense-annotated data produced by exploiting multilingual evidence by SALAAM. Typically in the area of bootstrapping supervised systems, re- searchers have relied on clean knowledge resources to create training examples for su- pervised systems. Given that cleanly tagged data is hard to come by for rich languages ? let alone poor languages ? this thesis explores to what extent bootstrapping off of noisy data is a feasible enterprise. In the process, we create a novel unsupervised learn- 1We use WST and WSD interchangeably throughout the thesis 4 ing technique for WSD that does not rely on the availability of manually tagged data, whose absence is a severe bottleneck for canonical supervised learning approaches addressing sense ambiguity. Acknowledging the central role played by similarity in the field of WSD, we ex- amine different facets of quantifiable semantic similarity. We are very interested in how such similarity measures compare against human similarity judgments. So far, in the thesis, the focus is on nouns due to their immediate relevance to several ap- plication areas such as Information Retrieval and Information Extraction, etc., but we realize the complexity of annotating verbs. 2 Owing to the endemic complex multi- dimensional nature of verbs, they present themselves as luring entities for exploring the various dimensions of semantic similarity; moreover, all WSD systems encounter problems when dealing with verbs. We posit that the crux of the problem lies primarily in the similarity measure at the core of the WSD system. Most approaches to similarity within WSD are monolithic in essence, relying on one source of information. In this thesis, we empirically establish the merit of combining evidence from different com- plementary sources of information based on a cognitively based functional study of verb similarity and how it relates to different automated similarity measures, thereby, serving as the motivation for enhancing similarity measurement within the scope of WSD in general and SALAAM in particular. 2None of the algorithms presented in this thesis has an inherent restriction on part of speech. 5 1.2 Research contributions This thesis contributes the following to the field of computational linguistics: The contribution of a novel robust unsupervised approach to WSD which consti- tutes a significant departure from the traditional monolingual approaches. The approach is a validation of a sound linguistic assumption that meaning charac- terizations can be captured cross-linguistically. We contribute a novel multilin- gual perspective on the notion of context for addressing the problem of WSD. The context scope is no longer confined monolingually. (See Chapter 3 for de- scription of the basic method; See Section 4.2 Chapter 4 for a discussion on robustness of the approach) The provision of a detailed description of an end-to-end fully operational, modu- larly designed system for producing large amounts of good quality sense-annotated data in both source and target languages for a parallel corpus. (See Chapter 3, Section 3.6, as well as Chapter 4, Section 4.3) The investigation of the quality of automatic sense-annotations for a language with few computerized linguistic resources such as Arabic. (See Chapter 4, Section 4.3.5) The provision of an operational end-to-end robust automatic framework for test- ing the quality of projected automatic sense-annotations for Spanish. (See Chap- ter 4, Section 4.3.6) 6 The examination of the feasibility of automatically bootstrapping a WordNet style ontology for Arabic via projected sense tags from English. (See Chapter 4, Section 4.4) The investigation of the feasibility of bootstrapping WSD within a supervised paradigm using noisy data based on the results obtained using the novel unsu- pervised method for WSD described in Chapter 3. Simultaneously, presenting a novel unsupervised learning technique for WSD. (See Chapter 5) The provision of a novel experimental design for attaining human judgments on semantic similarity for verb pairs using contextual and non-contextual data. The thesis compares the results obtained by several automated semantic similarity measures against the human similarity ratings. (See Chapter 6) The thesis utilization of insights derived from the human similarity judgment experiment to motivate an operational cognitively based framework for utilizing similarity in a novel way for improving WSD results obtained for verbs. (See Chapter 6). 1.3 Thesis Layout & Brief Overview of Chapters This thesis comprises 6 chapters briefly described as follows: In Chapter 2 briefly surveys earlier work in the field of WSD. We look at three different components: History of the field of WSD (Pre-SENSEVAL), SENSE- 7 VAL Era, and then we describe earlier related work of systems addressing the problem of WSD from within a multilingual framework, which are all, inciden- tally, Pre-SENSEVAL. Chapter 3 presents the underlying hypothesis driving the research theme of the thesis. In this chapter, we describe the relevant background which lends pre- liminary theoretical support to the developed approach. We then present an un- supervised method for word sense tagging (SALAAM) based on multilingual evidence. We describe the method and system in detail and present rigorous evaluation of the approach against state-of-the-art WSD systems. SALAAM is evaluated using machine translated parallel corpora (pseudo-translated corpora). The evaluation is confined to the nouns in the source language only. Chapter 4 presents further evaluation of the robustness of SALAAM by extend- ing the utilized corpora to naturally-occurring parallel corpora of non overlap- ping genres. In this chapter, projected sense tagging on the translation language of a parallel corpus is evaluated. We investigate the quality of the projected sense tagging on both Arabic and Spanish. Furthermore, we explore the potential of automatically bootstrapping an ontology for Arabic. Chapter 5 investigates the feasibility of bootstrapping a supervised WSD system using noisy data as training examples. In this chapter, data is obtained from the SALAAM sense tagging system for the source language, English; such data 8 is used to train a machine learning algorithm for WSD. Furthermore, in this chapter, we explore different factors affecting the bootstrapping performance by comparing the performance of the learning system when trained on SALAAM tagged data against it when trained on manually tagged data. Chapter 6 presents a novel experimental design for obtaining verb semantic sim- ilarity judgments. We compare the results obtained by different automated sim- ilarity measures against human similarity judgments. We lay out a cognitively based framework for integrating different automated similarity measures in or- der to approximate human judgments in the similarity component of SALAAM. Chapter 7 concludes the thesis with overall observations and lessons learned. A close look at the limitations of the different proposals contributed in the thesis is rendered. We reiterate the contributions of this research to the field of compu- tational linguistics. Finally, we conclude with a peek into the future with some suggested directions. 9 Chapter 2 Related Work 2.1 Introduction In the literature, typically WSD/WST systems associate labels with discovered senses;1 the labels may be words from a different language, sense codes or definitions from an ontology, or artificial codes. WSD, in this view, is a classification problem where the sense labels are the classes to which the WSD/WST process assigns the discovered senses. 1WSD, when used to refer to discovering senses without labelling them, it is known as Word Sense Discrimination. Word Sense Discrimination is a completely unsupervised approach that is not subject to label granularity restrictions as WSD/WST is. For purposes of this thesis, we are not concerned with Word Sense Discrimination. 10 2.2 Pre-SENSEVAL WSD: Historical Perspective WSD within that framework has been a problem of central interest to computer sci- entists in general, and Artificial Intelligence practitioners in particular since the early fifties.2 In fact, with the inception of the field of computer science, the question of ad- dressing ambiguity in language assumed center stage, after-all the goal was to create machines that understand language the way humans do. In the early fifties, the effort was consecrated to WSD within a Machine Transla- tion, MT, framework. The earliest approach was by Weaver in 1949 [80]. He argued the need for WSD in MT, as described in his Memorandum. He investigated the size of context needed to resolve the ambiguity of a word. He concluded that there is no difference in disambiguation power between a context of two words versus the con- text of the entire sentence. His observations were further confirmed by several other researchers in the 70s and 80s. A very important notion was supported by researchers in the field, which was that possible senses of a polysemous word are bound by the domain of the document a polysemous word appears in. Interestingly, over forty years later, Gale et. al. [27], further emphasized by Yarowsky [85], use this same idea of a single sense per discourse in an axiomatic manner, guiding in their view, the sense distribution of polysemous words in documents. Most of the methods that tackled the problem of WSD afterwards were AI ? Artificial Intelligence ? based. In the 70s and 80s, the majority of the approaches 2For an excellent survey, see the paper by Ide & Ve?ronis 1998 [33] 11 were grounded in language understanding theories where the systems tried to model deep knowledge of linguistic theory, especially syntax and semantics. A wave of word representation techniques was developed in an attempt to capture the relevant facets of meaning in order to solve word ambiguity issues. It was during this period of time that Semantic Networks by Quillian [65],3 Frames by Hayes [30],4 and Preference Se- mantics by Wilks [81], appeared on the scene of WSD. These three methods assumed the crux of symbolic approaches to WSD in that era. These techniques contrasted with more data driven solutions of the time. Examples of which are Small et al. [77] devel- oped intricate representations of words referred to as word experts. Such experts were complex in nature but the approach in general constituted a departure from the rule de- pendent perpective to the more word oriented one. His methods were similar in spirit to those of Kelly & Stone [38] who also focused on word oriented approaches, but in Small?s case his aim was broad natural language understanding in contrast to Kelly & Stone who had the specific intent of word sense disambiguation from the onset. However, such methods lost their appeal by the late 80s due to the intensive labour involved in the creation of the required intricate knowledge representations, which bound the number of words and senses that could undergo analysis and disambigua- tion. This was coupled by a surge in machinery and the beginnings of the availability of Machine Readable Dictionaries (MRD) and Lexica. One of the first attempts to 3Each word is represented as a node in an interconnected web 4Words are represented as entities with their roles and their connections to other words in the sen- tence explicitly defined 12 utilize such resources was Lesk [43]. He devised an algorithm that chooses the appro- priate sense of a polysemous word by calculating the word overlap between the context sentence of the word in question and the word?s definition in an MRD. Most of the al- gorithms that followed were in tune with that spirit from then onwards, more corpus based and knowledge based in nature where the role of the surrounding monolingual context is paramount in providing the needed evidence for a word sense. To date, most existing algorithms are a variant on the Lesk algorithm in their view of context and resolution of the WSD problem in general. It was in the 90s that that perspective of WSD was shifted to being regarded as an enabling technology with a lot of prospect for NLP applications, if it were to be resolved once and for all. Yet, with the abundance of algorithms, it became extremely difficult to assess their quality or results. Every algorithm was evaluated on a different test set and with different evaluation criteria and metrics. It became ever more difficult to establish a sense for the state-of-the-art in WSD. 2.3 The SENSEVAL Era Driven by the lack of common standards for evaluation and the need to assess dif- ferent systems? performance, while simultaneously getting a feel for the myriad of approaches to WSD, the computational linguistics community decided to create a stan- dardized test bed or yard stick through which it can facilitate communication and col- laboration among researchers in the field as well as establish a rigorous means through 13 which the community can evaluate the state-of-the-art in the field of WSD [41]. This was the marking of a new era in WSD, the SENSEVAL era. As the name indi- cates, SENSEVAL is an evaluation framework a? la? the different Information Retrieval type evaluation paradigms such as TREC and MUC, where the community decides and creates standardized data sets and test beds with well defined metrics for state-of- the-art assessment. The inception of SENSEVAL was triggered by a position paper by Resnik and Yarowsky presented in the Special Interest Group for Lexical Semantics (SIGLEX) workshop in DC, in 1997. They outlined a set of proposals and discussed different problems faced by practitioners in the field and how they believe these im- pediments may be overcome.[74] In their article, Resnik & Yarowsky make a set of observations about state-of-the-art WSD by comparing advancements in the field with progress levels achieved by other enabling technologies in NLP. They offer several proposals with the goal of improving the evaluation criteria for automatic WSD, im- proving the process of acquiring training and testing materials as well as defining sense inventories. They devise an iterative protocol for dealing with major issues that hinder the creation of a standardized benchmark for comparing the performance of WSD sys- tems. The results of their proposals are adopted in both SENSEVAL 1 and SENSEVAL 2 exercises. The following is an outline of their proposals: Evaluation criteria The authors criticize exact match as an evaluation metric that was the common practice until recently. The problem, they point out, is that exact match is a 14 binary measure. It does not discriminate probabilistically between a tagger that claims ignorance for a sense assignment, for example, and a tagger that assigns an incorrect sense a lower probability; with exact match both systems are equally penalized. Therefore, they propose an evaluation metric based on a measure of cross-entropy that credits a tagger partially based on the probability assigned to the correct tag. (See Chapter 3 for a more detailed discussion of this measure). Protocol for systematic evaluation of WSD systems and sense-tagged data acquisition They develop a protocol for acquiring large amounts of tagged data systemat- ically and incrementally. Their rationale is based on the stipulation that it is better to tag large amounts of examples for a small number of words with clear guidelines and extensive in depth analysis of the tagging task rather than a small number of examples for a large number of words. They justify their proposed iterative approach based on four main points: 1. The protocol combines an emphasis on broad coverage with advantages of evaluating a limited set of words by choosing the words that cover a wide range of frequencies, levels of ambiguity, etc; 2. A small, predefined set of words is more tractable for the manual annotator as s/he only needs to focus on one word at a time; 3. With a small number of words and large number of examples more atten- 15 tion can be dedicated to the specifications and guidelines of the manual annotation process, thereby reducing the number of possible problems; 4. This proposed protocol addresses needs of both supervised and unsuper- vised systems. Based on this proposal and further discussions within the community, the first SENSEVAL was conceived. The organizers for SENSEVAL coordinate the cre- ation of different tasks for different languages. The exercise takes place within a specified window of time with hard deadlines. The ontologies used for the tag- ging are determined before-hand for each language. So far, there are two types of task for any given language: a Lexical Sample task and an All Words task; not all languages have both tasks. Typically a Lexical Sample task is where a number of polysemous words in a large corpus is chosen for tagging by the organizers of the task. The systems are required to tag instances of the chosen words in a specified corpus. The systems? performance is evaluated by the organizers. For instance, in the recent SENSEVAL 2 exercise, the number of English nouns in the Lexical Sample task is 29 nouns of different levels of polysemy. The organizers of the task provide trial, training and test data to the participating systems at predefined time intervals. The trial data is for calibrating the participating systems in terms of format issues. The training data provides large amounts of annotated data where the predetermined words are tagged in context. The contexts are typically 16 two to four lines in length. Finally, the participants receive the testing data a number of days before the submission of results to the organizers for evaluation. The All Words task follows the same time line as the Lexical Sample task but in this case the WSD systems are required to tag all words in a specified corpus, i.e., all content words in running text. Typically no training data is provided for this task. SENSEVAL 3 is bound to take place in mid 2004 with even more languages. 2.4 Multilingual Approaches to WSD In this section, we focus on previous WSD approaches that address the problem within a multilingual framework since it is directly related to the topic of this thesis. For a general survey of approaches to WSD, we recommend the article by Ide and Veronis [33]. Several systems have addressed the problem of WSD/WST within a multilin- gual framework. They are all pre-SENSEVAL. In this section, we will describe four representative approaches that addressed the issue. They all exploit the observation that when a polysemous word in a language (L1) is translated into another language (L2), often senses of the polysemous word in L1 are translated into distinct L2 words in different contexts. Accordingly, they use lexical translations as a source of sense distinction. This idea has been around since the early 1990s [7, 15, 27]. The four WSD/WST methods described below are statistical corpus based meth- 17 ods; they differ mainly in their algorithmic approaches and their resource requirements. The first two methods, Brown et al. (1991) and Gale et al. (1992), only require the availability of parallel corpora, that are sentence and token level aligned. They both use the translation target words as labels for the ambiguous words in the source lan- guage. Both the third and fourth presented approaches, Dagan & Itai (1994) and Kikui (1999), rely on bilingual comparable corpora and bilingual dictionaries. Moreover, the work by Dagan & Itai requires a parser for at least the source language. The first method and the last two methods explicitly aim at improving target word selection in a practical machine translation application environment. They all utilize monolingual context on the source language side and have some way of bridging cross linguisti- cally to the target language side. In the first two methods, the authors use token level alignments in a parallel corpus as a bridge between the two languages; in the latter two studies, the authors use bilingual lexicons. All four systems present results on a handful of data and unfortunately the evaluation metrics are different in each paper, rendering it difficult to compare performance across systems. 2.4.1 Word Sense Disambiguation Using Statistical Methods: Brown et al. One of the earliest WSD studies within a multilingual framework is research by Brown et al. [7]. They present a pilot study where they investigate the impact of adding a sense disambiguation component to a statistical machine translation system [6, 8]. 18 The goal in this study is functional, namely, the improvement of lexical generation for an ambiguous word in a machine translation application. The notion of a sense in the context of translation combines pragmatic uses of words across languages that are not necessarily ambiguous in a source language together with genuine ambiguity such as is the case of a word like bank in English. An example of pragmatic disam- biguation is demonstrated as follows: Choosing the correct translation for the word il in the French sentence il y a une probleme. is considered by the authors a sense disambiguation problem with the choice between translating the French word il as it or he in English. Nonetheless, the ideas that are proposed are very interesting and may be extended to resolving paradigmatic genuine sense ambiguity problems. They obtain an improvement of 37% in the quality of translations. Method Requirements The method requires a parallel corpus and part of speech taggers for both languages of the parallel corpus. Method Description The approach assumes the availability of a statistical machine translation system that creates alignments between words in the English-French parallel corpus. A set of most frequent words is extracted from both sides of the parallel corpus. 19 Each of the words is described in terms of a number of contextual informants. The words in the target language, French, have seven informant features: tense- of-current-word, word-to-left, word-to-right, first-noun-to-left, first-verb-to-left, first-verb-to-right, and first-noun-to-right. For English words, only two infor- mants are defined, first-word-to-left and two-words-to-left. Only two senses are allowed per word in either language. The WSD system makes a binary decision between the different informants and the translations of the word in question. The flip-flop algorithm is used [60] in conjunction with the splitting theorem [4] in a fashion similar to decision tree learning. The flip-flop algorithm asks binary questions of a set of English translations corresponding to a French word. It divides the translations into two classes. The splitting theorem?s role helps in deciding the best informant (feature) based on mutual information in linear time. The best question about a potential informant is discovered; in turn, this question divides the French vocabulary set into two classes; the algorithm then uses the Splitting Theorem to divide the set of English translations into two sets that have maximum mutual information with the French sets. The process goes on, alternating between splitting the French vocabulary and the English translation sets. Since this is a binary process, only one bit of information is the bound; eventually, the algorithm converges on the English translation that has the maximal mutual information with the French word. 20 Evaluation The method is evaluated on 100 randomly chosen English-French sentence pairs. The algorithm is incorporated in a statistical machine translation system; therefore, it is an indirect evaluation. The machine translation output is manually marked as acceptable or unacceptable by the authors. The use of sense disambiguation improves the results from 37% (without sense disambiguation) to 45% (with sense disambiguation). Discussion The authors acknowledge the limitations of the approach stating that it is a pilot study. They also acknowledge that the approach is binary, which is not a realistic scenario. They point out that if the number of classes is unbounded, they expect the results to improve even further; therefore, instead of having binary questions the system will have questions. In our view however, there is no guarantee if the number of senses is unbounded that the system will improve in performance; the system is faced with a problem of fan out with the level of noise increasing exponentially, especially if the alignments and part of speech taggers are not 100% accurate. Moreover, this approach is limited by the possible sense-annotations, which are corpus specific; this creates problems when porting such an approach to different cor- pora even within the same domain. Furthermore, the informants are instance specific ? tokens, i.e., they consider the actual token in the immediate context. One simple 21 way of surmounting such a criticism would be using types instead of tokens as infor- mants. 2.4.2 Using Bilingual Materials to Develop Word Sense Disam- biguation Methods: Gale et al. In addition to presenting a disambiguation approach, this study considers solving the training materials bottleneck for supervised systems. Gale et al.[27] make an explicit distinction between the sense disambiguation problem and the translation disambigua- tion problem. They provide a method within a multilingual framework for creating sense disambiguated materials using translations as labels to annotate a set of poly- semous words in a source language of a parallel corpus. They report results of 90% correctness for a set of six polysemous words where each word has two senses. Method Requirements A parallel corpus with sentence and token level alignments. Method Description This is a supervised approach to WSD. There are two phases: a Training Phase and a Testing Phase. Given a parallel corpus that is sentence and token aligned, start by creating the training material and then test new items using the coefficients obtained from the training phase. The training and test phases are performed as follows: 22 Training Phase: The sense of a polysemous word instance in context is identi- fied based on its translation ? its alignment ? to a target language token. Training Phase: The context score of an instance of a polysemous word is calculated based on a variation on Information Retrieval (IR) techniques, where the contexts are considered in lieu of documents, according to equation (2.1). The score is obtained by calculating the probability of a token appearing within a window of 50 tokens on the right and left of a polysemous instance.      (2.1)     This model ignores word order and collocation information. Local token proba- bilities are too sparse in general; therefore, they opt for a weighting scheme as a smoothing approach. The weights are a ratio of the local source token log like- lihood probabilities and the global log likelihood probabilities. Accordingly, a token that is very frequent in the entire corpus and is frequent in the local context is assigned a low weight value, while a token that is sparse in the entire corpus but frequent in the local context is assigned a high weight value. Testing Phase: Test instances of the polysemous words are identified Testing Phase: Test instances are scored using equation (2.1) Testing Phase: Test scores are compared with training scores and senses are selected based on context score proximity. 23 Evaluation The method is trained and tested on six polysemous nouns with two distinct senses each. The nouns are duty, drug, land, language, position, sentence. The six nouns are selected because their senses correspond to distinct words in French. The training set has 60 training examples per sense. The test set comprises 90 instances per sense per word. The results obtained manually range from 82% to 100% accuracy. In the process of their evaluation, the authors discuss and provide empirical justi- fication for the context window size chosen. They establish that contextual clues are measurable up to a 10,000 words out from the word of interest. They relate this fact to the nature of discourse structure. Moreover, they illustrate that, contrary to common belief that only  6 words are sufficiently useful.5 Up to  50 words is useful for a machine to make sense distinction decisions. Furthermore, they explore the quality and quantity of the training data on sense disambiguation performance. For the impact of quantity, they, surprisingly, show that with only three training examples, their system is able to achieve an accuracy of 75% and accuracy asymptotes as the number of examples increases. As for the question of quality, the authors systematically show the degradation effect on the accuracy of their system?s performance if the training data has errors. With 10% errors introduced in the training set, they obtain a 2% decrease in accuracy; with 30% errors in the training data, the precision decreases 14% only, which is still very robust in their view. 5In contrast to the studies by Weaver (1949) [80] 24 Discussion Even though this evaluation is done on a very small scale, with six words of two senses each, the study tackles issues that are of central concern to us throughout this thesis. This approach performs sense disambiguation using translation words as labels identifying the different senses of a word. The authors erred on the side of caution in this study by using only two homonymic senses per word; they did not look at other polysemic relations such as regular polysemy or metonymy. Homonymic senses are the most likely senses to translate to distinct words in other languages. With the current availability of token alignment software and bilingual parallel corpora, it would be interesting to explore how this method scales up when given large amounts of data and polysemous words. 2.4.3 Word Sense Disambiguation Using a Second Language Mono- lingual Corpus: Ido Dagan & Alon Itai In this paper [15], Dagan & Itai present a novel approach for resolving lexical am- biguity in one language using statistical information from a monolingual corpus in another language. The method aims to solve the problem of translation word selec- tion for machine translation applications. The method uses a parser for both languages and statistical information from the target language corpus to decide on the most ap- propriate translation for an ambiguous word in a source language. The approach is evaluated on two different source languages, German and Hebrew, with English as the 25 target language for both. The authors report performance scores of 91% accuracy on Hebrew-English translations and 78% on German-English translations. Method Requirements The approach requires the availability of a bilingual lexicon, a parser for the source language corpus and one for the target language. In principle, there is no restriction on the type of corpora, yet preferably they should be of the same genre. Method Description Parsing the source language into syntactic tuples In their implementation, the authors use Slot Grammars [53] which is a form of dependency parsing identifying verb-object, verb-subject, word-adjunct, etc. type syntactic relations. There is no commitment in the paper to a specific pars- ing paradigm as long as syntactic tuples may be extracted from the parsed cor- pus. Locating ambiguous words in the source language An ambiguous source word is defined, in the context of this paper, as a word that has multiple translations in a bilingual lexicon and fits the syntactic frame of the specific source word instance in the source corpus. Given such a definition, many of the alternative source senses are pruned on syntactic grounds. The words are lemmatized before parsing to reduce sparseness. 26 Mapping source syntactic tuples to target language The method is straightforward: Using the bilingual lexicon, the words in the source syntactic tuple are translated to the target language. The authors identify syntactic divergences cross-linguistically, where there is a mismatch between syntactic frames in the source and target languages, as a source of controllable noise. For instance, for some verbs in German, their objects translate into sub- jects when translated into English. For example, the German sentence Der Tisch geflaellt mir is translated as I like the table; the subject Tisch in German be- comes the direct object table in English; and the object mir in German becomes the subject I in English. Such divergences are dealt with by means of hand coded rules that target a class of verbs that exhibit such a phenomenon. Choosing the most appropriate translation tuple from the target language corpus This phase involves several filtering steps. (1) The first step depends on the frequency of observing the translation tuple in the target corpus. This step weeds out some implausible tuples if they have not been seen in the target corpus. (2) The second step is addressed by a probabilistic model for the different possible target tuples denoted as with frequencies in the target corpus. has a multinomial distribution and has the possible values  .  27 is the probability that  is the correct translation of . The authors use the maximum likelihood estimator to estimate the probability  of any given tuple . The counts associated with different values are sorted in a descending order. A threshold is set. The ratio of the estimated probability for a certain tuple and the estimated probabilities of all the other tuples has to exceed the set threshold. The threshold is small when the frequency counts are very distant and it is large when the counts are close. The ratio is referred to as the odds ratio. This model entails three underlying assumptions: The events in  are mutually disjoint; a source language syntactic tuple can be translated into one of the tuples  ; and every occurrence of the tuple  can be the translation of only one source language syntactic tuple. The authors then define a confidence interval for deciding on the quality of the data, i.e., whether the translation tuple is good enough to be chosen as a translation for the source tuple. The confidence interval depends on the counts of the target tuple in the target corpus and the odds ratio threshold. The threshold is higher when the values of and  are smaller, thereby creating a dynamic threshold that has the desirable effect of pruning cases where the data is not supportive enough. (3) The third step is dealing with situations where there are multiple ambi- guities from multiple syntactic tuples in the same sentence. The authors 28 devise a constraint propagation algorithm that takes the list of all source tuples and their possible alternative translation target tuples and eliminates the tuples that do not satisfy the threshold set with a prespecified confi- dence level. Evaluation The method is evaluated on a random set of examples. The examples consist of source Hebrew and source German paragraphs. In both cases the target language is English. The Hebrew examples are randomly picked from Foreign News sections in the Israeli press. The German paragraphs are picked from the German press. The corresponding English target text is picked from American news articles as well as the Hansard corpus of the Canadian Parliament. The choice of ambiguous words is simulated with a translator and a preliminary bilingual lexicon. For every source language word, the translator searches all possible translations in a bilingual dictionary; s/he eliminates those that do not fit the syntactic structure of the source instance. The translations in the bilingual dictionary are modi- fied manually to be closer to what would be expected of a transfer translation lexicon. Once the ambiguous source words are located, the syntactic tuples are determined and mapped into English. Since they do not have a parser for the source languages, they manually translate the source language paragraphs into English; the translation is a very literal translation; the resulting manual literal English translations are parsed 29 using the ESG parser, which identifies the relevant syntactic tuples in the source lan- guage through a simple mapping routine. This process results in 103 ambiguous He- brew words and 54 ambiguous German words. The statistical English data is acquired from a 25 million word corpus that is filtered from a combination of The Washington Post, the Hansard corpus of the Canadian Parliament, and Associated Press news items. Only sentences of 25 words or less are used. This approach is referred to as Translation Word Selection (TWS). The baseline created is the most frequent translation target word. For convenience we will refer to it as (FB). The authors report results using two evaluation metrics: applicability and preci- sion. Applicability is a coverage measure, i.e. how many cases are attempted out of the possible cases; while precision is the typical metric of how many found items in those retrieved are correct instances. For Hebrew, TWS results are 91% precision and 68% applicability, while FB achieves 63% precision at the same applicability level as TWS. For German, the results are not as good with a precision of 78%, applicability of 50% for TWS, and precision of 56% for FB. The German results are lower due to the change in corpus genre from source test set to target language corpus genre. Further results are reported on applying the TWS approach with the parser on the source side alone, approximating the parser on the target side with collocational infor- mation collected from the target corpus. The results yielded are lower at 85% preci- sion and 64.3% applicability for Hebrew. These results are compared against an FB of 30 71.1% precision. Discussion This paper presents a very interesting approach to solving sense disambiguation for a specific target application, machine translation. The authors use linguistically mo- tivated models ? parsers ? in conjunction with statistical information in a hybrid manner, combining different sources of information. They approximate the lack of tools such as parsers and bilingual lexicons using manual resources, yet they present very rigorous simulations and evaluation criteria, which is very inspiring for the cur- rent thesis. This paper dates back to the early 1990s when parallel corpora were still an extremely expensive resource to obtain, supporting their strong argument against using them. Yet, parsers for many languages do not exist, and their approximation with the manual translation is feasible due to the limited scale of the evaluation, with only 130 paragraphs in total for both source languages, which is a severe impediment for applying this method on a large scale using such simulations. The article seems to dismiss the complexity of acquiring a bilingual lexicon. Building such a resource with an adequate level of coverage is not a trivial matter, especially if the lexicon is required to list syntactic subcategorization frames, which is a requirement for this approach to work. 31 2.4.4 Resolving Translation Ambiguity Using Non-Parallel Bilin- gual Corpora: Kikui This is a more recent approach to WSD within a multilingual framework [39]. This study presents an unsupervised approach for choosing an appropriate translation for a source language word to a target language, given a specific context. The method incorporates two different unsupervised modules: a Distributional Sense Clustering algorithm applied to the source language; and a Translation Disambiguation algorithm applied to the target language by linking the source sense clusters to their translation equivalents in the target language. The method is tested on an English to Japanese machine translation system with promising results. Method Requirements Large amounts of bilingual comparable corpora where the corpora are of the same domain and time frame. A bilingual dictionary Method Description Distributional Sense Clustering algorithm Both corpora are sense disambiguated using the distributional clustering ap- proach introduced by Schu?tze [75]. The method encodes ambiguous words as vector profiles where the different dimensions are the words that fall within an 32 n-sized window from the ambiguous word in question. The contents of the vec- tors are the co-occurrence frequencies. Similar to Schu?tze, Kikui uses Singular Value Decomposition to reduce the dimensionality of the data. An agglomera- tive clustering algorithm is applied to the vectors to create the sense clusters. Translation Disambiguation algorithm The Distributional Sense Clustering algorithm is applied to the most frequent terms in a source language corpus where source sense clusters are created. Us- ing IR techniques, the source words are pruned such that only words with high tf-idf values are kept, thereby creating a source term-list.6 The source term list is translated to the target language using a bilingual dictionary resulting in transla- tion candidates. The Distributional Sense Clustering algorithm is applied again to the target lan- guage. A cosine similarity measure is applied to the translation candidates and the resulting target language clusters, and those that have the highest similarity values are chosen as the target language translation. Evaluation The method is trained and tested on 1994 New York Times newspaper articles in English and 1994 Japanese Shinbon newspaper articles in Japanese. The gold stan- dard is a set of manually corrected machine translation output. The method achieves 6Term frequency (tf) divided by inverse document frequency (idf). 33 an accuracy rate of 79.1% against the gold standard. Discussion The method as described aims at WSD using target language words as labels. Simi- lar to the previous approaches, such a tagging technique is corpus specific; translation word instances are used as labels for the ambiguous source words. The method is lan- guage independent in the monolingual sense discrimination phase since the approach applies distributional clustering with no explicit language coding. Like all approaches that depend on bilingual dictionaries as a necessary bridging component, the method is limited by the coverage of the bilingual dictionary of the corpus terms. It would be interesting to explore how this method scales up to different genres corpora ? or mixed genres corpora. The method is trained and tested on the same limited domain corpora. Accordingly, polysemous words in such corpora tend to have a very high bias toward specific senses. 2.4.5 Summary System Method Corpus Type Inventory # Label Linguistic Tools Brown et al. Sup. Parallel 2 words Tokenizers Gale et al. Sup. Parallel 2 words Tokenizers Dagan & Itai Unsup. Comparable Biling. Dict. words Tokenizers & Parsers Kikui Unsup. Comparable Biling. Dict. words Tokenizers Table 2.1: Summary of Systems? Required Resources In Table 2.1, we summarize the resources required by each of the systems described 34 in this section. The first two methods require token aligned parallel corpora; the tag set ? label ? size is two senses where a sense is a target translation word. The second two methods utilize comparable corpora and require bilingual dictionaries. The tag set size is not limited; translation words are used as sense labels. The Dagan & Itai method requires parsers for both languages involved in the exercise. System Language Metric Size Gold Standard Performance Brown et al. En-Fr improv. 100 inst. No 8% improv. Gale et al. En-Fr acc. 6 words, 140 inst./words No 90% acc. Dagan & Itai Heb-En, Ger-En prec., applic. 103 Heb., 54 Ger. inst. No 91% prec. 63%applic. Kikui En-Jap acc. 120 inst. Yes 79.1 acc. % Table 2.2: Summary of Systems? Evaluation In Table 2.2, we give an overview of the four different evaluations we describe and discuss above. We characterize them in terms of the Language the approach is tested on, the Metric utilized, Size of data, presence of a Gold Standard, and Performance. We note that all systems use accuracy except for Dagan& Itai method. The first two systems do not define a Gold Standard. It is difficult to draw any conclusions on their respective performance since the different methods use different data sets and evaluation metrics. 35 Chapter 3 Word Sense Tagging Using Parallel Corpora: SALAAM 3.1 Introduction Many researchers in the field of computational linguistics and specifically in the area of Word Sense Disambiguation (WSD) have exploited the observation that ambigu- ous words in one language translate into different words in a second language. In this chapter, we present a novel unsupervised method of Word Sense Tagging (WST) that builds on this very observation. It relies on texts in translation with the aim of resolving sense ambiguity in natural language; the approach described here is a multi- lingual unsupervised sense tagging approach that we will refer to as SALAAM which stands for Sense Assignment Leveraging Alignments and Multilinguality. SALAAM exploits the translators knowledge of language and the context of ambiguous words in language to sense-annotate large amounts of text for two languages simultaneously. As mentioned in the Introduction Chapter 1 to this thesis, WSD is believed to be an 36 important enabling building block for potentially improving the performance in many NLP applications. Within the area of data-driven WST, there are two main approaches with some hybrids: unsupervised methods [1, 49, 69, 83, 85], and supervised methods [10, 59, 84]. Supervised methods traditionally yield better performance results in WST [40]. The main difference between supervised and unsupervised methods lies in the need by the former for sense-annotated data for the training. Supervised methods are highly tuned to the training corpus type. This tuning helps in producing reliable results but it is a double edged sword since it significantly affects the portability of supervised systems to different corpora genres. Typically, supervised methods require large amounts of good quality data to produce good results. Unfortunately, large amounts of sense- annotated data do not exist for nearly all languages. On the other hand, unsupervised methods have the advantage of making minimal assumptions about the data; they do not need sense-annotated data as a prerequisite. Comparatively speaking, unsupervised methods are less tuned to the corpus domain, which significantly impacts the quality of the tagging. If an unsupervised method achieves close to supervised methods? performance without relying on sense-annotated data from the outset, then it is a significant contribution to the field. The method we describe in this chapter, which is at the core of the following chap- ters as well, is an unsupervised method for word sense tagging a source and target language in a parallel corpus using the sense inventory of the source language. The 37 method relies on the availability of large amounts of text in translation. It assumes the availability of asymmetric resources for two languages. Throughout this thesis, a source side of the parallel corpus is defined as the side that possesses the knowledge resources required. This chapter is laid out as follows: The following section, Section 3.2 discusses the motivation for this study; Section 3.3 defines the problem; Section 3.4 describes the relevant background; in Section 3.5, we present the over-arching hypothesis and insight for the devised method; Section 3.6 describes the approach in detail; this is followed by a detailed evaluation of the source tagging quality in Section 3.7; results of the evaluation are presented in Section 3.8; results and shortcomings of the approach are discussed in detail in Section 3.9; finally, we conclude with a wrap up summary Section 3.10. 3.2 Motivation What is the use in possessing large amounts of sense-annotated data? sense-annotated data could potentially help improve many NLP applications. More- over, possessing sense-annotated data in several languages provides an interesting test bed for exploring different cross-linguistic phenomena related to lexical semantics. It has been shown that several languages pattern in the same way with respect to certain metonymic relations such as container/contained sense transfers [82]. As mentioned earlier, supervised WSD methods yield better performance than un- 38 supervised methods. But a severe bottleneck for supervised systems is the annotated data acquisition for training. Providing large amounts of sense-annotated data?albeit noisy?could potentially help alleviate this impediment. This idea is further investi- gated in Chapter 5. Possessing sense-annotated data in a language with scarce knowledge resources potentially constitutes the initial step in bootstrapping sense inventories for such a language. We explore this issue further in Chapter 4. 3.3 Problem Statement Manually sense-annotating texts is the guaranteed method of obtaining good quality tagged data. But alas, this is a very tedious and laborious job, let alone expensive [24]. Accordingly, automating the process is highly desirable. To our knowledge, all WSD systems aim at providing sense-tagged data in a single language at a time. Most methods naturally target languages with many automated resources and tools. Languages with scarce resources?referred to as well as low density languages?are left behind in the process. In this chapter, we introduce an unsupervised method, SALAAM, for word sense tagging that exploits texts in translation ? parallel corpora. The method is unsu- pervised inasmuch as it does not rely on the existence of sense-annotated data as a prerequisite. The approach aims at resolving the ambiguity of polysemous words in two languages simultaneously: a language with rich resources and one with scarce 39 resources. It aims at exploiting the asymmetry in resources for the benefit of both lan- guages; SALAAM leverages off of rich source language resources to create a seed for acquiring automated resources for a low density language. The focus of this chapter is to lay out the methodology and present a detailed evaluation of the source language sense-annotation. Evaluation of the quality of the target language annotations is inves- tigated in Chapter 4. The approach presented explores the notion that senses of polysemous words in one language are often translated into different words in some set of other languages [74, 34, 32, 23]. Current approaches view the context of a polysemous word in ques- tion in terms of local monolingual features; the features could be in terms of the words, relations of words, or sentences surrounding the polysemous word. In contrast, we are defining a polysemous word?s context in cross-linguistic terms. Such a novel extension to the notion of context to cross language boundaries allows for the tagging of ambigu- ous words that are traditionally off the radar for contextually monolingual approaches. 3.4 Relevant Background Using lexical translations as a source of sense distinction is an idea that has been around since the early 1990s [7, 15, 27] (see chapter 2). The key observation is that when a polysemous word in one language (L1) is translated into another language (L2), the polysemous word in L1 is translated into several distinct L2 words in different contexts corresponding to the L1 word?s various senses. The following sub-sections 40 discuss different attempts at using that same idea for the purposes of WSD. 3.4.1 A Translational Basis for Semantics. Helge Dyvik In this study [23], Dyvik examines how translational phenomena may be used as data for the development of linguistic semantics. He treats the translational relation be- tween two languages as a primitive; a phenomenon that is accessible via bilingual informants. He distinguishes a translational relation from an abstract linguistic ex- pression such as synonymy. Dyvik develops a theoretical framework for testing the validity of translational capacity as a discriminating basis for senses of ambiguous words. Dyvik conducts a qualitative study on a set of Norwegian polysemous words and their translations into English. He proposes an unsupervised method using texts in translation that does not rely on any external resources for sense distinction. He dis- covers word senses in a corpus by using translations and their reverse translations, i.e., manually locating the translations of a Norweigian polysemous word in the English text then searching for the translation of the English words that correspond to the orig- inal Norweigian word in the Norweigian text and so on, back and forth. He concludes that translation could indeed be used reliably for sense distinction since it is a linguistic primitive. Exploiting translations enabled him to discover appropriate senses for the majority of the Norweigian polysemous words investigated. 41 3.4.2 Distinguishing Systems and Distinguishing Senses: New Eval- uation Methods for Word Sense Disambiguation. Philip Resnik & David Yarowsky In this article [74], Resnik & Yarowsky make a set of observations about state-of-the- art WSD by comparing advancements in the field with progress levels achieved by other enabling technologies in the NLP. They offer several proposals with the goal of improving the evaluation criteria for automatic WSD, improving the process of acquiring training and testing materials as well as defining sense inventories. They devise an iterative protocol for dealing with all the issues that hinder the creation of a standardized benchmark for comparing the performance of WSD systems. Moreover, such a protocol addresses the obstacles faced by researchers in the field. The results of their proposals are adopted in both SENSEVAL 1 and SENSEVAL 2 exercises. Evaluation Criteria The authors criticize Exact Match as an evaluation metric that was the com- mon practice until recently. The problem, they point out, is that Exact Match is a binary measure. It does not discriminate probabilistically between a tagger that claims ignorance for a sense assignment, for example, and a tagger that as- signs an incorrect sense a lower probability; with Exact Match both systems are equally penalized. Therefore, they propose an evaluation measure based on a measure of cross-entropy that credits a tagger partially based on the probability 42 assigned to the correct tag. The measure they propose is computed as          (3.1)   where  is the number of test instances and  is the probability assigned by the algorithm  to the correct sense,  for the word  in   . Given a hierarchical sense inventory, they further propose the evaluation measure be sensitive to the semantic distance between the sense labels. Therefore, if a tagger assigns a sibling of the correct sense to the word in question, the tagger should be penalized less than if it assigns the label for a homonymous sense of the word. Accordingly, they devise penalty distance matrices that capture taxonomic semantic distance in hierarchical Ontologies. Entries in the matrix are based on a pairwise calculation of semantic distance for all the senses of a given word. Taggers? sense assignment is to be weighted by the communicative distance per sense pair in the ontology. They give the calculation as              (3.2)    where, for any test example , all  senses   of word  are considered, weighting the probability mass assigned by the tagger  to incorrect senses       by the cost or distance of the mistagging. 43 Multilingual Inventory The authors put forward a proposal to restrict a sense inventory to the distinc- tions that are attested for?lexicalized?cross-linguistically in some minimum number of languages. They do not specify how many languages constitute a rea- sonable many, however. In their view, this is a mid-point between very coarse- grained listings and the very fine distinctions permeating Ontologies such as WordNet. Translation as a source of sense distinction In order to validate their proposal for sense inventories based on multilingual ev- idence, Resnik & Yarowsky explore the relationship between monolingual sense inventories and translation distinctions cross linguistically. They measure the probability of an English sense distinction being lexicalized differently in 12 di- versely different languages, at various levels of granularity. They analyze native speakers annotations of 222 polysemous contexts across the 12 languages. They show that monolingual sense distinctions could be discriminated in some set of second languages. Moreover, their findings suggest a correlation between lan- guage family distance and the extent to which polysemous words express their various senses as distinct words, i.e., the farther the family distance of L1 from L2, the better the sense distinction. They cluster the resulting manual sense an- 44 notations of the English words and their corresponding translations. They obtain results that correlate well with monolingual sense distances in the hierarchical Hector sense inventory [41], thus lending support to the plausibility of hier- archical sense inventories. The clustering is performed based on a measure of Sense Proximity. This is a cross-linguistic measure for calculating the extent to which two senses of a word lexicalize differently in a given language. The mea- sure is defined as follows:                 (3.3)       where  and  are senses of the same word in a given language. Based on the probability of distinct lexicalization  , the levels of granularity for sense lexicalizations cross-linguistically are quantified; Resnik & Yarowsky conclude that all languages make robust distinctions on the homograph level 95% of the time; on the major sense level 78% of the time; and fine-grained level distinctions 52% of the time. 45 3.4.3 Polysemy and Sense Proximity in the SENSEVAL-2 Test Suite. Irina Chugur, Julio Gonzalo, Felisa Verdejo In an extension of the study by Resnik & Yarowsky, Chugur, Gonzalo & Verdejo investigate the possibility of characterizing sense inventories both quantitatively and qualitatively.[34] They address specific issues: What are the ways in which senses of a given word relate and what is the type of that relationship; How well are individual senses defined? Is it fine enough, coarse enough, etc.; How do such issues affect the evaluation of WSD systems? Bearing these questions in mind, the authors describe the SENSEVAL 2 WordNet 1.7 subset. They characterize the ontology based on two parameters related to gran- ularity. The first is fine-grainedness, namely, how specific are the sense distinctions, and is it possible for WSD systems to discriminate between senses. The second pa- rameter examines the flip side of the previous parameter: are the sense definitions too coarse-grained? To that end, the authors devise a complementary measure to the Sense Proximity measure defined by Resnik & Yarowsky. The measure is referred to as Sense Stability. Based on cross-lingual evidence, Sense Stability measures the likelihood that a pair of occurrences for a word sense  receives the same translation for a language , av- 46 eraged over as many languages as possible. Quantitatively, Sense Stability is defined as           (3.4)      Based on Equation (3.4), coarse-grained senses have low stability as their contexts may lead to them lexicalizing differently across different languages. Chugur et al. design an experiment to test the Sense Stability and Proximity of the words in WordNet 1.7 which are used in the SENSEVAL 2 exercise. They adopt the same experimental design as that of the Resnik & Yarowsky?s human experiment described in the previous section. They have 11 native/bilingual speakers of four dif- ferent languages who are asked to translate words marked in context to their native language. There are 508 short contexts for 182 senses of 44 words in the SENSEVAL test suite. In analyzing their results, they consider four different factors. Language Family Distance Chugur et al. conclude that there is no significant sense difference across four languages utilized; they do not believe that adding more languages is critical for observing a stronger indicator of language distance. Proximity and Stability The Stability and Proximity measures are integrated in a single matrix where the diagonal of the matrix is the Stability while the rest of the matrix cells are the 47 pairwise Proximity measure values. They propose the integrated matrix of both measures as an evaluation criterion for SENSEVAL 2 systems. Similarity and semantic relations between senses Four different semantic relations are examined: Homonymy, Metaphor, Special- ization/Generalization, and Metonymy. Homonymy is a no relation case such as bar - the law sense and bar - the unit sense. Metaphor is a similarity case for instance, child - the kid sense. Specialization/generalization is a case of extend- ing or reducing the scope of the original sense, for example, fine- the greeting sense and fine - the ok sense. Metonymy is a case of semantic contiguity, for instance, yew - the tree sense and yew - the wood sense. Chugur et al. conclude that multilingual sense distinctions are reliable for homonyms. 27% of metaphors have a proximity of over 0.5, but multilingual distinctions are not sufficient indicators for them. Specialization/generalization behaved as ex- pected with medium to high proximity. For metonymy, they conclude that mul- tilingual evidence is a good first approximation but it is not sufficient as a sole criterion for metonymic distinctions. Consistency of the data The Chugur et al. experiment suffers from very low inter annotator agreement, at 54% due to inconsistencies in the tagging by the participating subjects. Some annotators tag the same sense with different translation words due to variability 48 in synonym sets in a particular language. Variable syntactic realizations cross- linguistically cause problems with the data; for instance, a noun modifying a noun in English becomes both an adjective and a noun in Russian. Such situ- ations result in slightly variable forms of a unique root, however, the counting algorithm counts them as different translation collocations when words are part of complex expressions. Finally, they examine problems with the human anno- tations of the SENSEVAL-2 data. All in all, the authors conclude that WordNet 1.7 is a good test bed for WSD systems. They confirm the conclusion drawn by Resnik and Yarowsky: that multilingual evidence is a good basis for sense disambiguation. 3.4.4 Cross-lingual Sense Discrimination: Can it work? Nancy Ide In her study [32], Ide attempts to explore questions that arise from the proposal made by Resnik & Yarowsky with respect to sense definitions. She poses the questions of how many languages are sufficient to produce reliable sense distinctions? when do we know we have sufficient sense distinctions? how can we generate such sense distinc- tions from currently available resources? She acknowledges the limited usefulness of bilingual dictionaries owing to the lack of standards and the pervasive inconsistencies among them. She concludes by stating that parallel corpora are the optimal test bed for these ideas. Ide conducts a manual experiment to investigate the feasibility of using parallel 49 corpora for identifying distinct senses of polysemous words inasmuch as they lexi- calize differently in five different languages. The parallel corpus is George Orwell?s Nineteen Eighty Four translated from English into Slovene, Estonian, Roma- nian, and Czech. The languages pertain to four different language families: Germanic, Slavic, Finno-Ugrec, and Romance. The text comprises 100,000 words translated di- rectly from the original English text. The corpus is sentence aligned. For purposes of the experiment, she picks four words: hard, line, country and head. Parallel sentences with the words in question are extracted and sent to a linguist who is a native speaker of the language of translation. The task of the linguist is to identify the translation of the ambiguous English word in the translated sentence. More than 85% of the English word occurrences have a corresponding lexical unit in any of the four translation language corpora. A manual association link is created for the English word and its translation with the WordNet a sense number. A Coherence Index (CI) is devised to measure the extent to which a word in English is lexicalized differently in a translated text. Given a pair of senses for a word, the CI is measured as follows:      (3.5)   where is the number of comparison languages under investigation;  and  are the number of occurrences of sense  and sense  in the English corpus;  is the number of times senses ! and  are translated as the same lexical unit in a lan- 50 guage . The value of CI is between 0 and 1, similar to the Sense Stability measure proposed by Chugur et al. described above, the higher the CI, the more coherent the senses, the lower the CI the more the senses lexicalize differently. Ide considers language relatedness and distance among the different languages in her study. No significant impact is detected based on family relatedness. Based on the CI values, Ide applies agglomerative clustering to the data to test if structures resembling dictionary entries emerge. She finds a strong correlation be- tween the cluster-induced hierarchies and some dictionary entries for the words hard and head on a coarse-grain level. Accordingly, Ide concludes that translation can suc- cessfully be used as a filter for sense distinction. 3.4.5 Discussion The crux of the current chapter builds on the ideas presented in the papers described above. All four studies exploit the different cross-linguistic sense lexicalizations for sense discrimination. Each of the studies comprises a manual exploration of the fea- sibility of using parallel corpora for sense discrimination. In this chapter, we devise a method that takes this core idea, expounds on it, and creates a practical demonstration, on a large scale, of its empirical feasibility and validity. 51 3.5 Hypothesis Inspired by previous research described in chapter 2 and in the previous section, this investigation explores the relationship between translations of multiple instances of a polysemous word in a corpus. We emphasize two key observations: Translation Distinction Observation (TDO) Senses of ambiguous words in one language are often translated into distinct words in a second language. To exemplify TDO, we consider a sentence such as I walked by the bank. where the word bank is ambiguous with senses. A translator may translate bank into rive corresponding to the geological formation sense or to banque correspond- ing to the financial institution sense depending on the surrounding context of the given sentence. Essentially, translation has distinctly differentiated two of the possible senses of bank. Foregrounding Observation (FGO) If two or more words are translated into the same word in a second lan- guage, then they often share some element of meaning. FGO may be expressed in quantifiable terms as follows: if several words     in L1 are translated into the same word form in L2, then     share 52 some element of meaning which brings the corresponding relevant senses for each of these words to the foreground. For example, if the word rive, in French, translates in some instances in a corpus to shore and other instances to bank, then shore and bank share some meaning component that is highlighted by the fact that the translator chooses the same French word for their translation. The word rive, in this case, is referring to the concept of land by a water side, thereby making the corresponding senses in the English words more salient. It is impor- tant to note that the foregrounded senses of bank and shore are not necessarily identical, but they are the closest senses to one another among the various senses of both words.1 Figure 3.1 below illustrates FGO. bank shore                                      rive   Figure 3.1: Common Senses Shared Between Polysemous Words In Figure 3.1, the direction of the line fillings in the geometrical shapes is an indication of shared meaning characteristics between the senses of the words bank and shore. The difference in geometrical shape illustrates the fact that the close senses for the two words are not necessarily identical. As demonstrated 1FGO as currently stated makes the implicit assumption that the L2 word is not ambiguous. This assumption is fully explored in the discussion section. 53 in the diagram, the French word rive is the translational choice for both pol- ysemous English words. In this diagram, rive highlights the shared meaning component for the two words in English; the shared semantic attribute is water edge/geological formation. Given observations TDO and FGO, the crux of the SALAAM approach aims to quantifiably exploit the translator?s implicit knowledge of sense representation cross- linguistically, in effect, reverse engineering a relevant part of the translation process. 3.5.1 General Hypothesis Statement Given texts in translation with a source and target language, where a language is defined as a source language based on the fact that it has a sense inventory, we hypothesize that a target language word that is translated into distinct source lan- guage words serves as a good source of evidence for grouping the source language words. Accordingly, in the current example, rive is a good source of evidence ? anchor ? for grouping the words bank and shore. 54 3.6 Method 3.6.1 General Method Description Hypothetically, if the task of sense-annotating a parallel corpus (comprising a source and a target language) is manually attempted, where the annotator is tagging a poly- semous source word with its corresponding target translation word, then the study re- quires the annotator?s knowledge of both source and target languages. S/he will create a mapping of words in L1 to words in L2. Accordingly, the words bank and shore are tagged with the word rive. Yet, tagging a source language with target words renders the annotations extremely corpus-specific. To achieve generality with the sense-tagging, we tag the corpus with an independent tag set from a source sense inventory. For illustration purposes, we assume the source language is English and the target language is French. Furthermore, the existing sense inventory is in English, corre- sponding to the source language.2 Given a parallel corpus, a high-level view of the method is summarized in the following five steps: 1. Locate words in the English source corpus and their corresponding French target translations 2In Diab and Resnik (2002), the naming of the corpora is reversed in accordance with the noisy channel naming convention. We decided to make it less confusing for the reader by following the more intuitive reading since the source is also linked to resource availability for SALAAM. 55 2. Group source words that translate to the same target word orthographic form, thereby creating source groups 3. Measure the similarity among the different senses of the words in the source group based on their distance in a source language sense inventory 4. Assign the selected sense tags to the respective words in the corpus 5. Project the assigned sense tags from the source language words to the corre- sponding target language words in the parallel corpus The first step in the preceding generic description would be a labor-intensive exer- cise if attempted manually. In order to automate the process on a large scale for parallel corpora, the need arises for a method that automatically discovers source-target word mappings (alignments). Once the translational correspondences are discovered, grouping source words based on their translation to the same target word is directly applied. Step 3 assumes the existence of a large independent sense inventory that is amenable to computational systems; moreover, it is assumed to have an associated quantified similarity measure between the words? senses. The similarity measure is used to cal- culate the similarity between the different source words? senses. The closest senses resulting from the previous step are chosen for tagging the source words in the source groups. Once the words in the source language are annotated with their appropriate sense 56 tags, the tags are projected to their corresponding translations in the target corpus. Ef- fectively, the sense tag assigned to the source word is the same sense tag projected onto the target word, thus creating a link for the target word from the translation language in the source inventory. 3.6.2 Required Resources Our goal is to realize the described method automatically on a large scale. Therefore, two knowledge resources are required: Large amounts of text in translation are required, hence the need for a parallel corpus. Parallel corpora exist in myriad languages, for example, in religious books such as the Quran and the Bible [73], as well as in the UN Proceed- ings, and the Canadian Parliamentary Proceedings. Moreover, researchers have devised methods to mine the internet for large amounts of parallel corpora auto- matically with relatively minimal manual labor at high accuracy levels [71]. A sense inventory is needed for the source language, where each word is repre- sented with its/as its corresponding senses. This inventory is required for only one of the languages of the parallel corpus.3 The sense inventory should be rich enough to provide maximum coverage for the parallel corpus above. 3The translator in the context of a parallel corpus is trusted to have chosen the most faithful target lexical translation that conveys the sense of the source word by preserving the salient meaning element. 57 3.6.3 Detailed Method Description As mentioned earlier, this approach, SALAAM, is unsupervised in that it does not rely on the availability of sense-annotated data for either language of the parallel corpus. Figure 3.2 provides a schematic view of the method followed by a detailed description of the individual processes. In the figure, each process is presented with an example on the right hand side. Figure 3.2: Flow chart demonstrating process flow in SALAAM method The details of the schematic figure are as follows: 58 Word Align Parallel Corpus SALAAM assumes the availability of token-aligned parallel corpora. A token is de- fined as a space delimited unit in a tokenized text. A token could be a number, a punc- tuation mark, a symbol or a word instance. Alignment is the process of discovering the translational mappings of token instances between source and target languages in a parallel corpus. A token instance is a unique occurrence of a token in a corpus. Figure 3.3 illustrates an example of token-aligned text expected as input by the algorithm. La maison grise est grande The gray house is big Figure 3.3: A sample token alignment in a parallel corpus Every token in the source corpus is aligned to some token or set of tokens in the target corpus. One-to-many alignments, in most cases, arise due to lexicalization di- vergences. A source token may align with the NULL token, which is the empty token indicating the non-existence of an appropriate alignment token on the target side. Create Source Type Sets The process of creating source type sets involves the following steps: Identify Aligned Tokens 59 Figure 3.4 illustrates the alignment of target French token instances to English source token instances. An instance of the French token rive aligns with the source token instance bank; the second French token instance rive, in the figure, aligns with the source token instance shore; similarly, target token instances of banque align with source tokens bank and repository; the dots indicate running text. ...rive...rive...banque...banque... ...bank...shore...bank...repository Figure 3.4: Tokens aligned in a parallel corpus Figure 3.5 shows different source token instances of bank and shore which align with instances of the French token rive; likewise, the figure illustrates source token instances of bank and repository aligning with instances of banque. The numbers in the figure demonstrate the process of bookkeeping the information for corpus location and occurrence. For example, rive#7#1#27 is the token in- stance of the target word rive, where 7 is a line identification number in the corpus, 1 is rive?s location in the line ? all token instances in a line are num- bered from  to ? and 27 is the frequency of occurrence of the token rive in the target corpus. It is worth noting that token alignment is between source and target lines with the same identification number, therefore, in Figure 3.5, the line identification number is the same for target and source token instances for all the 60 listed pairs. Given a parallel corpus, it is not always the case that a line or a sentence on the source side will correspond to a line or sentence on the target side. In many cases, we find a sentence on the source side corresponding to multiple sentences on the target side or vice versa. Several researchers devise automatic approaches for automatically discovering sentence alignment in parallel corpora [55]. Sen- tence or line alignment is an interesting problem but it is outside the scope of the current research. SALAAM assumes that the source and target corpora are line aligned (sentence aligned). Conflating Alignments Once the parallel corpus is token-aligned as shown in Figure 3.5, the target to- ken instances are conflated into target word types. Accordingly, rive#7#1#27, rive#7#5#27, rive#44#4#27, rive#7#10#27, rive#9#1#27 are conflated into the target word type RIVE and, similarly, banque#2#5#65, banque#2#69#65, banque#36#9#65, banque#12#45#65, banque#14#15#65 are conflated into the target word type BANQUE. All instances of source tokens that align with the same target word type are grouped in a source token set as illustrated in Fig- ure 3.6. All the source token instances aligned with rive target token instances are grouped to form a source token set corresponding to the target word type RIVE and those aligned with banque are grouped to form the source token set corresponding to the target word type BANQUE. 61 rive#7#1#27  bank#7#1#104 rive#7#5#27  bank#7#4#104 rive#44#4#27  bank#44#4#104 rive#7#10#27  shore#7#12#22 rive#9#1#27  shore#9#5#22 : banque#2#5#65  bank#2#7#104 banque#2#69#65  bank#2#87#104 banque#36#9#65  bank#36#7#104 banque#12#45#65 repository#12#49#19 banque#14#15#65  repository#14#16#19 Figure 3.5: Aligned token instances from target to source In order to create source type sets, the source token instances are conflated into source type words, in a similar manner to the conflation of target tokens into target types, source tokens undergo the same process. For example, Figure 3.7 illustrates the source type sets for RIVE and BANQUE. In the Figure 3.7, the word tokens bank are conflated to form the source word type BANK and likewise for SHORE and REPOSITORY. Sense Assignment to Source Type Sets 62 RIVE:bank#7#1#104, bank#7#4#104, bank#44#4#104, shore#7#12#22, shore#9#5#22 BANQUE: bank#2#7#104, bank#2#87#104, bank#36#7#104, repository#12#49#19,repository#14#16#19 Figure 3.6: Target word types and their corresponding source token sets RIVE: BANK, SHORE BANQUE: BANK, REPOSITORY Figure 3.7: Source type sets for the target words RIVE and BANQUE A distance metric is defined to measure the similarity between the senses of the source word types in the source type sets. A similarity function   , where sim calculates the distance between all the senses of word   and word , for all word pairs in the source set. The goal is to maximize the overall similarity among the word senses across the source word types in the source type set. The resulting similarity measure is    which is an optimization function; it chooses the senses that are most similar among all the senses of all the words in a given source set. Given the source set for RIVE as BANK, SHORE, all the senses corresponding to the two words in a sense 63 inventory are compared and the ones that are the most similar according to the defined similarity measure are chosen as the appropriate tags for the respective word types. For illustration, if we look up the words BANK and SHORE in the Collins Cobuild Dictionary [76], we find five nominal senses listed for BANK and two for SHORE. Accordingly, the sim function computes  comparisons, each comparison resulting in a similarity value. The sense tags that yield the highest similarity value are assigned to their corresponding word types. In fact, more than one tag may score the highest sim value. For illustration, the five senses listed for BANK are: 1. a bank is an institution where people or businesses can keep their money 2. the bank in a gambling game is the money that belongs to the dealer to the casino management 3. a bank is the raised ground along the edge of a river or a lake 4. a bank of something such as computer data or blood is a store of it that is kept ready for use when needed 5. a bank of switches, keys, etc., on a machine The two senses listed for SHORE are: 1. the shore of a sea, lake or wide river is the land along the edge of it 64 2. a particular country with a coastline is sometimes referred to in literary English as the shores of the country The senses listed for REPOSITORY are: 1. a person or a group of people who you can rely on to look after something important 2. a place you can keep objects of a particular kind By inspecting the definitions of the different sense entries for the source word types BANK and SHORE, we see that sense #3 of BANK and sense #1 of SHORE are the most similar among the different possible pairings of senses. We judge them to be similar based on the proximity in the meanings of the def- initions rendered. Therefore, the source word types BANK and SHORE are as- signed those senses, respectively. In this phase, the role of the similarity measure is to produce quantitative values for the distance between the different senses of the different words. In quantified terms, the similarity function just utilized is nothing but a computation of the overlap between the content words that make up the sense definitions [43]. The choice of senses for tagging is based on set- ting a sense selection criterion. For the given example, the selection criterion is set to the senses that have a maximum overlap in the words of the sense defini- tions. Based on the chosen sense definitions, the salient meaning element shared is land by the water edge. Similarly, for the source words BANK and REPOS- 65 ITORY in correspondence with the target word BANQUE; BANK is assigned its sense #4 and REPOSITORY is assigned its sense #2. In this case, the salient meaning component is a place to keep objects of a kind. The resulting source type tag set is illustrated in Figure 3.8, as well as the senses propagated to the token instances corresponding to the word types as illustrated in Figure 3.9. In the figure, the subscripts indicate sense numbers. RIVE:    BANQUE:     Figure 3.8: Sense Tagged Type Source Sets RIVE:         BANQUE:                Figure 3.9: Sense Tagged Token Source Sets It is extremely important to note that source type sets have to have at least two members in the set in order to apply a similarity function among word senses, i.e. by definition, a similarity function applies to a minimum of two items. There- 66 fore, it is crucial to highlight the significance of variability in alignment. To illustrate, if throughout the parallel corpus, all instances of the French target word instances rive align with source instances shore, the resulting source type set, after conflation, will have a single word type SHORE which cannot be sub- mitted to the similarity function, consequently, neither instances of shore nor the corresponding target instances of rive are assigned sense tags. Project Source Sense Tags to Target Tokens Finally, source sense tags assigned to source tokens from the source sense inven- tory are projected onto target language corpus tokens, which is a direct mapping step. In Figure 3.10, instances of rive and banque are assigned the senses correspond- ing to the source language sense inventory entries indicated by the subscripts, thereby creating links for the French words in the Collins Cobuild Dic- tionary. 3.6.4 Evaluation Metrics 1. Precision (P) A measure of accuracy for sense tagging where the tags resulting from SALAAM are evaluated against a predefined gold standard set. Quantitatively, Precision is measured as follows: 67               !      !     :                        !" #$ %      !" #$ %     Figure 3.10: Projecting source inventory senses onto target language instances correct tags Precision (P)  (3.6) items tagged 2. Recall (R) A measure of the retrieval capacity of a system where the tags resulting from SALAAM are evaluated against a predefined gold standard set. Quantitatively, Recall is measured as follows: 68 correct tags Recall (R)  (3.7) items in gold standard 3. F-Measure (FM) This is an Information Retrieval measure which is a summary measure of Preci- sion and Recall. Quantitatively, F-Measure is measured as follows: " F-Measure (FM)  (3.8) "  4. Coverage (COV) This is a measure of the number of items attempted by SALAAM out of the possible items in the gold standard. Coverage is measured as follows: gold standard set items tagged Coverage (COV)  (3.9) items in gold standard 5. Zscore Significance Test (Z) This is a two-tail significance test of difference between two proportions.4 The significance level is set to 95%, i.e., a test is significant if # $  or # %  4http://franz.stat.wisc.edu/ rossini/courses/introbiomed 69 3.7 Evaluation In order to formally evaluate SALAAM for English word sense tagging, we need four different components: 1. A parallel corpus with English on one side as the source language. The corpus needs to be large enough for training stochastic translation models for the au- tomatic discovery of token mappings ? translation alignments. Moreover, the corpus has to exhibit enough variability in order to render the similarity measure operational. Accordingly, the need arises for a balanced corpus. A balanced corpus is defined as a corpus that has equivalent amounts of data pertaining to diverse topics. 2. A broad coverage sense inventory for the English source language 3. A hand-annotated subset of the corpus to provide a gold standard for evaluation 4. Performance figures for other systems on the same task, evaluated against the same gold standard Acquiring all four components simultaneously proves to be a challenge. To our knowledge, there are no balanced parallel corpora with a hand-annotated gold stan- dard. On the other hand, the few hand-annotated sets available do not exist for parallel corpora. Then, we pose the question: Which is more feasible, translating a corpus that has an associated gold standard or hand-annotating a portion of the English side of a 70 parallel corpus? Given how involved the process of hand-annotating a corpus is, we opt for the former solution of translating a parallel corpus that has an associated gold standard set. SENSEVAL The requirement for a hand-annotated set as a gold standard which also is used for evaluating other WSD systems is met through the SENSEVAL 2 exercise English All Words task.5 (See SENSEVAL era Section in Chapter 2) In SENSEVAL 2, the English ontology is WordNet 1.7 pre (WN17pre). 3.7.1 Materials Ontology Like previous WordNet editions [24], WN17pre is a computational semantic lexicon for English. It is rapidly becoming the community standard lexical resource for En- glish since it is freely available for academic research. It is an enumerative lexicon that combines the knowledge found in traditional dictionaries in a Quillian (1968) style se- mantic network [65]. Words are represented as concepts, referred to as synsets, that are connected via different types of relations such as hyponymy, hypernymy, synonymy, meronymy, antonymy, etc. Words are represented as their synsets in the lexicon. For example, the word bank has 10 synsets in WN17pre corresponding to 10 different 5http://www.senseval.org 71 senses. The concepts are organized taxonomically in a hierarchical structure with the more abstract or broader concepts at the top of the tree and the specific concepts to- ward the bottom of the tree. Accordingly, the concept FOOD is the hypernym of the concept FRUIT, for instance. Similar to previous WordNet taxonomies, WN17pre comprises four databases for the four major parts of speech in language: nouns, verbs, adjectives, and adverbs. The nouns database consists of 69K concepts and has a depth of 15 nodes. The nouns database is the richest of the 4 databases. Majority of concepts are connected via the IS-A identity relation. In this chapter, we focus on nouns only.6 An excerpt of the noun database for WN17pre is shown in Figure 3.11 below. In the figure, the dotted lines indicate several nodes omitted for space consideration and subscripts indicate sense numbers. Gold Standard To evaluate the performance of SALAAM, we use the SENSEVAL 2 English All Words tag set as a gold standard. The gold standard is manually annotated with WN17pre by the organizers of the SENSEVAL 2 exercise. In this chapter, we are only interested in the nouns in the set. The nouns in the gold standard are annotated with one or more senses from WN17pre. Some nouns are tagged with P and others with U, where the P tag indicates proper nouns, and the U tag indicates unassignables, 6Nothing inherent to SALAAM restricts it to a specific part of speech. 72 Figure 3.11: An excerpt from the noun database of WN17pre where the annotator could not find the appropriate sense in the list of WordNet senses. The gold standard for this evaluation comprises 1071 nouns after excluding instances annotated with U or/and P tags. Test Set The nouns in SENSEVAL 2 English All Words (SV2AW) test corpus constitute the test set for this evaluation. SV2AW comprises 3 articles from The Wall Street Journal amounting to a total of 242 lines and 5815 tokens. The articles discuss three topics: culture, medicine and education. 73 Corpora SV2AW is a very small corpus for SALAAM to be applied. First off, for stochastic token alignment, the need arises for a large corpus to ensure reliable alignment results. Moreover, variability in contexts is essential to produce source type sets that have sev- eral members. Therefore, the test corpus needs to be augmented with a large enough parallel corpus in order to ensure two factors: Reliable alignment quality and variabil- ity in contexts. Accordingly, SV2AW is augmented with four corpora that are deemed balanced. The corpora are described as follows: 1. The Brown Corpus of American English (BC):[26] BC comprises articles from specialized scientific journals, novel excerpts and news articles as well as non-fiction work. It is a balanced corpus of approxi- mately one million words. 2. The SENSEVAL 1 Trial, Training and Test corpus (SV1) [40] SV1 comprises excerpts from the following different corpora: The Wall Street Journal, which is mainly news articles; The British National Cor- pus, a balanced corpus of roughly 100 million words of different genres, sim- ilar to BC but in British English; and IBM technical manuals. All in all, SV1 amounts to 1.5 million tokens. 74 Corpora Lines Tokens BC-SV1 101841 2498405 SV2-LS 74552 1760522 WSJ 49679 1290297 SV2AW 242 5815 Total 226314 5555039 Table 3.1: Relative sizes of corpora used for evaluating SALAAM on SV2AW test set 3. The SENSEVAL 2 Lexical Sample trial, training and test corpus (SV2-LS) 7 This corpus is similar to the SV1 corpus in constitution. It comprises 1.76 mil- lion tokens. 4. The Wall Street Journal (WSJ) corpus WSJ comprises sections 18-24 of the Penn Tree Bank. This corpus has 1.29 million tokens. The WSJ mainly contains news articles. The relative sizes of the four corpora used for augmentation listed above as well as the test corpus are illustrated in Table 3.1. None of the corpora exists in translation. Resorting to human translators would have been an ideal solution but considering the expense and time factors, we opt for off-the-shelf machine translation (MT) systems to do the job. We use commercially 7http://www.senseval.org/ 75 available MT systems as an approximation (pseudo-translation) [18]. The process of pseudo translation is appealing from several angles: It is cheap and fast to produce large amounts of translated data in a reasonable amount of time; one could use several MT systems for several languages. Accordingly, we pseudo-translate the four aug- menting corpora as well as the test corpus into 3 different languages: Arabic, French and Spanish. We use two machine translation systems per language. For Arabic, we use two machine translation systems available on the Web, Al-Misbar (AM8), and Tarjim (TR9). For French and Spanish, we use two MT systems: Global Link 6.4. Pro (GL) and Systran Professional Premium 2.0 (SYS). The process of pseudo translation results in six parallel corpora, two for each language. The choice of languages is mainly influenced by the claimed quality of translations for both GL and SYS in French and Spanish. Moreover, EuroWordNet exists for both French and Spanish, and could later serve as a test bed for the projected tagging on the target language side of the parallel corpus. As for Arabic, the choice is mainly because of its distance from English. As a Semitic language, Arabic is farther from English than Spanish and French which are both Latin based. Moreover, Arabic, by many standards, is considered a low density language which creates a realistic test case for SALAAM. 8URL is http://www.almisbar.com 9URL is http://www.tarjim.com 76 3.7.2 Tools Part Of Speech (POS) Tagger There are no inherent constraints within SALAAM for a specific POS. But in order to constrain the search space in the sense inventory for this evaluation, we restrict the POS to nouns. Both the BC corpus and the test corpus SV2AW are manually POS tagged. The rest of the corpora are tagged using the Brill POS Tagger [5]. The Brill POS Tagger is trained on the manually POS tagged BC. Tokenization In this evaluation of SALAAM, we process four languages: English, Arabic, French and Spanish. For the English and French corpora, we use the tokenizer provided by Dan Melamed (personal communication) with some modifications. For Spanish, we use a tokenizer developed by Nizar Habash and Bonnie Dorr (personal communication), with some modifications. As for the Arabic corpora, we created a simple stemmer/tokenizer which based on standard regular pattern matching. The Arabic text is first transliterated into Latin script, then the tok- enization separates out suffixes and prefixes. In Arabic, suffixes are typically pronouns and prefixes are usually prepositions or articles 10 10The tokenization is intentionally kept at a minimum in order to maintain a comparative base among the three target languages while simultaneously making minimum assumptions as far as the target lan- guage requirements are concerned. 77 Stochastic Alignment Tool: GIZA++ SALAAM assumes token aligned corpora as input. However, since the field of alignment is still in its early stages, token aligned parallel corpora that meet the specific requirements for SALAAM do not exist. Therefore, assuming the paral- lel corpora are sentence aligned, we use an automated token alignment system, the GIZA++ package [62]. GIZA++ is part of the EGYPT statistical machine translation package [2]. GIZA++ is an implementation of IBM models 1-5 [8]. The models are trained in succession where each of these models produces a Viterbi alignment. The final parameter values from one model are used as the starting parameters for the next model. Given a source and target pair of aligned sentences, GIZA++ produces the most probable token-level alignments. Mul- tiple token alignments are allowed on the source language side, i.e., a token in English may align with multiple tokens on the French side. Tokens on either side of the parallel corpus may align with an empty token indicated by the NULL to- ken. 3.7.3 Sense Selection and Similarity Measure As described earlier in Section 3.6.3, a similarity measure is needed to determine the quantitative distance between the senses of the words in question. For the purposes of this evaluation, we use Noun Groupings (NG) distance measure for calculating the similarity values between the different senses in a source type set. NG is an algorithm 78 proposed and implemented by Resnik [70]. The algorithm is an optimization function. Given a source type set of words in English, NG calculates the pairwise similarity across all senses of the words in the source set and assigns the highest confidence scores to those senses that are the closest in the set. The confidence scores range from 0 to 1. At the core of NG is an information theoretic similarity measure devised by Resnik [72]. Given a taxonomy of concepts and frequencies of words in a large corpus, Resnik?s similarity measure calculates the distance between two concepts as:  &     (3.10) where   is the set of concepts that subsume both  and , and where  is the information content of node .  is estimated by observing frequen- cies in a corpus. Therefore, the quantity defined in this similarity measure, calculates the informa- tion content of all the nodes that subsume two synsets and returns the one with the maximum information content as well as the two senses that are the closest. Intu- itively, if two senses are not similar, then the information content returned will be very small indicating that the most informative subsumer is very high up in the taxonomy. In the worst case, where there is no similarity at all, the information content is 0 per- taining to the top node in the hierarchy. NG provides the quantitative values of the distance between the different senses of the words in the source type sets but it does 79 not choose the appropriate senses. Consequently, in SALAAM, we devise a sense selection criterion threshold to choose the appropriate senses to assign to the words in the source type sets. 3.7.4 Development and Testing Environment The SALAAM system is developed and tested on a Sun Blade 1000 with 1GB of RAM running OS Solaris 2.8. The system is written in C and Perl. 3.7.5 Evaluation Measure In this chapter, we use the SENSEVAL 2 scoring program scorer2.11 scorer2 is an implementation by Cotton [13] of the Melamed and Resnik [56] metric for sense disambiguation evaluation. The measure is a principled metric for tagger evaluation given hierarchical tag sets. This measure rewards a system by a 1 if the sense assigned is correct, a 0 if it is completely incorrect. This measure is different from traditional measures as it allows room for partial credit. Therefore, if a WSD system assigns an incorrect sense to a polysemous word but this sense is a sibling and a direct descendent of the same hypernym of the correct sense, the WSD system is rewarded with partial credit. scorer2 can report results in three different modes: Fine grain mode, coarse grain mode; and mixed grained mode. The fine-grain mode is the strictest evaluation metric. 11We use the version fixed by Rada Mihalcea http://www.senseval.org 80 All the results reported here are evaluated using scorer2 in the fine-grain mode. We are not reporting an evaluation against a baseline since none was used for the official evaluation in SENSEVAL2 exercise. 3.7.6 Evaluation Parameters Part of Speech As mentioned earlier, SALAAM has no inherent constraints on POS tag. How- ever, for the purposes of this evaluation, we set the POS to nouns. Stop Word List Another parameter is the removal of closed word items from the alignments since they are a source of noise. We use a stop word list that contains mainly punctuation, prepositions and articles in the source language English. GIZA++ Parameters The parameters are set as follows: 20 iterations for model 1, 10 iterations for HMM and 20 iterations for model 4.12 The maximum sentence length on either or both sides of a parallel corpus is set to 70 tokens. 12Models 2 and 3 are eliminated from the alignment based on the discussion in [62]; HMM and model 4 essentially replace models 2 and 3. Model 5 is excluded owing to the excessive time requirements. 81 3.7.7 Evaluation Factors Different Target Languages We have three different languages for this evaluation, Arabic, French, and Span- ish. Different MT systems Each language is pseudo-translated using two different MT systems. Sense Selection Criterion NG assigns a confidence score to each word sense when calculating the similar- ity between the different words in the source type set. If NG is not confident of the scores it typically divides the 1.0 confidence score among all the senses of a given word, yielding a uniform confidence distribution. Consequently, we have two sense selection thresholds 1. MAX The sense tag(s) with the highest confidence score is(are) selected. There is no minimum confidence score threshold. 2. THRESH The sense tag(s) with the highest confidence score is(are) selected. A min- imum threshold of %   is set. 82 3.7.8 Evaluation Conditions Based on the three factors, we devise several experimental conditions. In all the condi- tions, in accordance with the method description in Section 3.6.3, the target language is used as the source of evidence to create the source type sets. The first condition describes the default set of conditions. This is followed by a set of conditions where the output of MT systems for the same target language is merged pre-alignment or post-alignment. The idea behind merging the output of two pseudo- translations is to maximize the translation variability assuming that two different MT systems most likely use different knowledge bases for the translation process. We con- clude with a set of conditions where the output of applying SALAAM to the parallel corpora pertaining to different languages is merged in different modes. The impetus for such a merge is to test to what extent evidence from different languages aids the performance of the SALAAM tagging system. 1. Default conditions: AR-TR, AR-AM, FR-GL, FR-SYS, SP-GL, SP-SYS We have six default conditions which result from the intersection of factors 3.7.7 and 3.7.7 at sense selection criterion MAX as described in Factor 3.7.7.1. The six conditions are FR-GL, FR-SYS, for evaluating the English test set result- ing from applying SALAAM to the English-French parallel corpus yielded from pseudo-translating the test corpus and the augmenting corpus using the GL MT system and the SYS MT system, respectively; SP-GL and SP-SYS for evaluat- ing the results obtained applying SALAAM given the English-Spanish parallel 83 corpus when pseudo-translated using GL and SYS, respectively; and similarly, AR-AM and AR-TR for Arabic where the English source corpora are pseudo- translated using AM and TR, respectively. 2. Intralanguage pre-alignment merge with MAX sense selection criterion: FR-GLSYS Translation resulting from the two MT systems are interleaved where transla- tions of the English odd lines are translated using the GL MT system and the even lines are translated using the SYS MT system. This condition is only ap- plied to the French pseudo-translated corpora. The sense selection criterion is set to MAX 3. Intralanguage post-alignment merge with MAX sense selection criterion: GLSYS-FR M, GLSYS-SP M, AMTR-AR M Source and target token alignments resulting from GIZA++ are merged prior to submission to the NG algorithm for calculating the similarities. This condition, as mentioned before, aims at maximizing variability in the source type sets. The sense selection criterion is set to MAX. 4. Intralanguage post-alignment merge with THRESH sense selection crite- rion: GLSYS-FR T, GLSYS-SP T, AMTR-AR T Similar to Condition 3 but the sense selection criterion is set to THRESH as described in Factor 3.7.7.2. 84 5. Pairwise and three-way interlanguage intersection at THRESH sense selec- tion criterion: AR-FR I T, AR-SP I T, FR-SP I T, AR-FR-SP I T In this set of conditions, we evaluate the results of intersecting the SALAAM tag sets resulting from evidence obtained from two and three target languages, respectively, with the MAX sense selection criterion. This merge occurs after the application of the NG algorithm and the sense assignment to the nouns in the test set. The intersection mode is where only the senses that overlap for the commonly tagged noun instances are kept in the final tag set. Unique noun instances for each language are also kept in the final tag set. In the pairwise con- ditions, we intersect the tag set that results from Condition 4 for two languages at a time, therefore, AR-FR I T is the intersection of the tag sets resulting from conditions AMTR-AR T and GLSYS-FR T. In the three-way condition, AR- FR-SP I T, the same process is applied but with the tag sets resulting from all three languages in Condition 4. 6. Pairwise and three-way interlanguage union at THRESH sense selection criterion: AR-FR U T, AR-SP U T, FR-SP U T, AR-FR-SP U T In this set of conditions, we evaluate the results of union merging the SALAAM tag sets resulting from evidence obtained from two and three target languages, respectively, with the THRESH sense selection criterion. This merge occurs after the application of the NG algorithm and the sense assignment to the nouns in the test set. The union mode is where all the senses that are assigned to the 85 commonly tagged noun instances are kept in the final tag set. Unique noun instances for each language are also kept in the final tag set. In the pairwise conditions, we union merge the tag set that results from condition 4 for two languages at a time, therefore, AR-FR U M is the union of the tag sets resulting from conditions AMTR-AR T and GLSYS-FR T. In the three-way condition, AR-FR-SP U M, the same process is applied but with the tag sets resulting from all three languages in Condition 4. 3.7.9 Experimental Hypotheses We have the following experimental hypotheses corresponding to the eight experimen- tal conditions: 1. Hypothesis 1 SALAAM exploits translation evidence in the default condition 1 yielding com- parable FM to state-of-the-art unsupervised WSD systems. 2. Hypothesis 2 SALAAM applied in Condition 2 yields improved precision when compared to default Condition 1 since it increases the variability in the source type sets, however recall is comparable to recall results obtained using the two pseudo- translations, independently, since this condition, FR-GLSYS, has only half of each pseudo-translation. 86 3. Hypothesis 3 SALAAM applied in Condition 3 improves FM over SALAAM in the default Condition 1 because of the increase in variability in the source type sets. 4. Hypothesis 4 SALAAM applied in Condition 4 improves precision, P, over the performance of SALAAM applied in Condition 3 as the higher sense selection threshold, THRESH, weeds out senses where the NG yields a uniform confidence score distribution indicating the lack of a bias in the similarity measure toward any of the senses involved in the sense similarity calculation. In short, removing the noise from the final tag set. 5. Hypothesis 5 SALAAM applied in Condition 5 significantly improves precision results over results obtained by SALAAM when applied in condition 4, as Condition 5 is exclusive merging of evidence from two languages which are themselves merges of two pseudo translations at the THRESH sense selection criterion. When the output tag sets are intersected, the precision increases as the tag set is further refined by the intersection process but the recall decreases as some valid senses might be weeded out if they are not shared across the tag sets of the two or more languages merged. 87 6. Hypothesis 6 Recall values are improved when SALAAM is applied in condition 6 over re- call yielded by Condition 4 since it is the union of multiple tag sets pertaining to several languages, thereby including evidence from several languages which most likely cover different portions of the data. Accordingly, we expect evidence from Arabic and any other language to yield the better results, therefore, both the coverage and the performance of SALAAM as measured by FM improves when applied in Condition 6 over performance of SALAAM in Condition 5 as the THRESH selection criterion removes the noise from the source type sets. 3.8 Results In this section, we present the results of applying SALAAM in the different experimen- tal conditions presented in Section 3.7.8 to evaluate English source language tagging of nouns in the test corpus SV2AW when evaluated against a hand tagged gold stan- dard. The following subsections correspond to the different hypotheses described in Section 3.5. 3.8.1 Testing Hypothesis 1 We have six experimental conditions in the default condition 1 corresponding to six parallel corpora. Table 3.2 illustrates the results obtained by applying SALAAM in 88 Condition P% R% COV% FM FR-GL 58.1 50.9 87.62 54.26 FR-SYS 58 49 84.43 53.12 SP-GL 57.9 48.6 83.86 52.84 SP-SYS 60 51.5 85.93 55.43 AR-TR 58.3 51.4 88.27 54.63 AR-AM 57.5 49.3 85.74 53.09 Table 3.2: SALAAM performance results on English source SV2AW test data in the default conditions these default conditions where the source of tagging evidence is from a single pseudo- translated target language using a single MT system. The performance scores depicted in Table 3.2 are evaluated using the scorer2 software in the fine-grain mode. Figure 3.12 illustrates the relative performance of SALAAM against state-of-the- art WSD systems on the same task of sense tagging nouns.13 All these WSD systems participated in the SENSEVAL 2 English All Words task. The X-axis is the precision percentage and Y-axis is recall percentage. In Figure 3.12, supervised systems are presented as gray filled diamonds, unsuper- 13The nouns are isolated from the submitted tag set of the different systems and evaluated using scorer2 in the fine grain mode. 89 Figure 3.12: SALAAM performance precision & recall results in the default conditions plotted against state-of-the-art WSD systems on the same test set SV2AW 90 vised systems as hollow triangles, gray squares are partially supervised systems,14 and finally black triangles are the results obtained by SALAAM. Unsupervised systems do not rely on annotated data in the process of sense tagging. Supervised systems depend directly on WN17pre sense-tagged training data. Partially supervised systems rely on hand-annotated data from other sources such as the DSO corpus or they back-off to the most frequent sense in WN17pre in case the WSD system could not make a guess.15 All the results obtained by SALAAM are comparable on both precision and recall performance, i.e., they do not significantly differ from one another according to the Z significance test at  $   . SALAAM performance is comparable performance to state-of-the-art WSD systems. None of the unsupervised systems is better than any of the SALAAM conditions on both precision and recall simultaneously. In fact, all SALAAM default conditions are significantly higher than all the unsupervised systems on both precision and recall except for one system which is significantly higher than SALAAM conditions on recall. Three systems are better than SALAAM on P but significantly lower on R. It is also worth noting that the majority of the systems ? 14The classification into supervised, unsupervised, and partially supervised system is based on the descriptions of the respective systems published in the SENSEVAL workshop proceedings and further confirmed through personal communication with the authors. 15This is considered partially supervised due to the fact that the frequency information in WN17pre is based on SemCor which comprises 200k words of hand annotated sense running text of the Brown Corpus. Moreover, there is no quantification on the number of cases where the respective system backs- off to the most frequent sense. 91 Figure 3.13: SALAAM F-Measure results in the default condition measured against state-of-the-art WST systems on test set SV2AW including all supervised systems ? have close to 100% coverage while the highest coverage achieved by SALAAM is 88.27% in the AR-TR condition. This issue is further discussed in Section 3.9 of this chapter. The FM measure provides a principled way to view and compare the performance of SALAAM against the other WSD systems. Figure 3.13 shows the FM scores ob- tained by SALAAM and the systems participating in the SENSEVAL 2 All Words task for English. The six SALAAM default conditions are depicted as the black bars in the figure; solid gray bars are supervised systems; the checkered bars are partially super- vised systems; and the hollow bars are unsupervised systems. As illustrated by the graph, SALAAM achieves the highest FM compared against other unsupervised sys- tems; moreover, SALAAM rivals both the partially supervised and supervised systems 92 Condition P% R% COV% FM FR-GL 58.1 50.9 87.62 54.26 FR-SYS 58 49 84.43 53.12 FR-GLSYS 60.6 49.1 81.05 54.25 Table 3.3: SALAAM performance with pre-alignment French target pseudo- translation merge with only three supervised and two partially supervised systems achieving significantly higher FM scores. Therefore, the evidence supports accepting Hypothesis 1. 3.8.2 Testing Hypothesis 2 This hypothesis tests the comparability of SALAAM?s performance when the two MT systems for the target language are interleaved before alignment. Table 3.3 shows the results of condition FR-GLSYS compared against the results obtained by SALAAM in conditions FR-GL and FR-SYS. As illustrated in Table 3.3, the precision obtained in condition FR-GLSYS is higher than that obtained by either condition single MT system condition. We note an increase of approximately 2.5%. Recall is at a mid point between the two recall measures for the single MT conditions FR-GL and FR-SYS. We note that the FM score is close to the high end of the range between the FM scores achieved by FR-GL and FR-SYS. The COV scores are less in condition FR-GLSYS than either default condition. The 93 results lend positive support to Hypothesis 2. 3.8.3 Testing Hypothesis 3 We test the hypothesis that merging the source-token alignments before submission to the NG algorithm yields better recall results than recall obtained by SALAAM when the evidence is from a single MT per language. The merge of the two pseudo- translations occurs post GIZA++ alignment. The sense selection criterion is set to MAX. The intralingual post-alignment conditions are GLSYS-FR M, GLSYS-SP M, and AMTR-AR M, for French, Spanish and Arabic, respectively. Table 3.4 illustrates the results obtained. The highlighted results in Table 3.4 are achieved by SALAAM in the three con- ditions, GLSYS-FR M, GLSYS-SP M and AMTR-AR M. These results are higher than those obtained by SALAAM in the default conditions. In general, we observe a slight insignificant increase in precision; SALAAM in condition AMTR-AR M yields better precision results than those obtained by either AR-TR or AR-AM; similarly for GLSYS-FR M, where we note an increase of 1.3% and 1.4%, over FR-GL and FR- SYS. As for recall, we observe a statistically significant improvement in the intralan- guage conditions over the default conditions across the board. SALAAM achieves an improvement of 4-6% in condition AMTR-AR M over the default conditions AR-TR and AR-AM. SALAAM in GLSYS-FR M achieves an improvement of 3.4-4.5% over 94 Condition P% R% COV% FM AR-TR 58.3 51.4 88.27 54.63 AR-AM 57.5 49.3 85.74 53.09 AMTR-AR M 59.1 55.4 93.71 57.19 FR-GL 58.1 50.9 87.62 54.26 FR-SYS 58 49 84.43 53.12 GLSYS-FR M 59.4 54.5 91.74 56.84 SP-GL 57.9 48.6 83.86 52.84 SP-SYS 60 51.5 85.93 55.43 GLSYS-SP M 59.8 53.3 89.21 56.36 Table 3.4: SALAAM performance in default condition vs. intralanguage post- alignment merge condition at MAX sense selection criterion 95 Figure 3.14: SALAAM F-Measure results in the default condition measured against MAX intralanguage condition for the three languages: AR, FR, SP the individual French default conditions. Notably, we see a significant improvement in coverage and FM scores. These improvements are expected as the translation variabil- ity increases leading to the creation of more source sets with multiple members. These results support Hypothesis 3. Figure 3.14 illustrates the significant improvement in FM scores from the single MT default conditions to the intralanguage condition with a sense selection criterion set to MAX. Each cluster of columns pertains to a language. The black column is the intralanguage Condition 3. 96 Condition P% R% COV% FM AMTR-AR T 64.5 53 82.08 58.19 AMTR-AR M 59.1 55.4 93.71 57.19 GLSYS-FR T 65.6 52.1 79.46 58.08 GLSYS-FR M 59.4 54.5 91.74 56.84 GLSYS-SP T 65.7 50 76.28 56.78 GLSYS-SP M 59.8 53.3 89.21 56.36 Table 3.5: SALAAM performance in intralanguage post-alignment merge conditions with sense selection criterion MAX vs. THRESH We observe that precision score for GLSYS-FR M is slightly less than precision for FR-GLSYS, however, this is accompanied by a significant improvement in recall from FR-GLSYS to GLSYS-FR M which leads to a statistically significant improvement in both FM score and COV for the latter condition. 3.8.4 Testing Hypothesis 4 We report results of evaluating SALAAM in Condition 4, an intralanguage condition where the sense selection criterion is set to THRESH, against results obtained from evaluating SALAAM in condition 3 which is also an intralanguage condition but the sense selection criterion is set to MAX. Table 3.5 illustrates the results obtained. The precision scores obtained by SALAAM in Condition 4, where the sense se- 97 lection criterion is set to THRESH, are statistically significantly higher than those obtained by SALAAM with the MAX selection criterion. We observe a significant improvement in precision of 5.4% from AMTR-AR M to AMTR-AR T. Similar be- havior is observed for the other language conditions. We notice a drop in recall which is expected since the THRESH sense selection criterion removes some valid candi- dates in the noise removal process. However, the drop is not statistically significant. Furthermore, we observe an expected drop in coverage but an increase in the FM score. The results support Hypothesis 4. 3.8.5 Testing Hypothesis 5 Precision results obtained from applying SALAAM to the test data using evidence obtained from several languages, where the tag sets are intersected, are better than precision results based on evidence pertaining to a single language. We evaluate this hypothesis by comparing results obtained by SALAAM in Condition 5 against results obtained in Condition 4, the intralanguage merge condition. Table 3.6 illustrates the results where the sense selection criterion is set to THRESH. The first three rows in Table 3.6 illustrate the results obtained by SALAAM in Con- dition 4. The last four rows indicate the results obtained by SALAAM in Condition 5. As illustrated by the table, the precision increases across the board. For instance, considering condition AR-FR I T, we note that its precision, at 66.5%, is higher than that of AMTR-AR T at 64.5% and GLSYS-FR T at 65.6%. similarly, for all the con- 98 Condition P% R% COV% FM AMTR-AR T 64.5 53 82.08 58.19 GLSYS-FR T 65.6 52.1 79.46 58.08 GLSYS-SP T 65.7 50 76.28 56.78 AR-FR I T 66.5 50.5 75.89 57.41 AR-SP I T 66.9 48.4 72.42 56.17 FR-SP I T 67.6 49.6 73.26 57.22 AR-FR-SP I T 66.6 48.2 72.42 55.93 Table 3.6: SALAAM performance for conditions 4 where evidence is obtained from monolingual intralanguage pseudo-translation merge vs. evidence obtained from in- terlanguage pseudo-translation intersection merge in condition 5 99 ditions including condition AR-FR-SP I T, at a precision of 66.6%, it is higher than the precision yielded by AMTR-AR T, GLSYS-FR T, and GLSYS-SP T. It is worth noting that the highest precision obtained is from the combination of evidence from French and Spanish which is explainable by the proximity between the two languages. Owing to the exclusive nature of the intersection, we note the expected decrease in recall and coverage; moreover, we observe a decrease in FM scores for the multilingual conditions. The results in this section support Hypothesis 6. 3.8.6 Testing Hypothesis 6 Recall results obtained from applying SALAAM to the test data using evidence ob- tained from several languages, where the tag sets are union merged, are better than recall results based on evidence pertaining to a single language. We evaluate this hy- pothesis by comparing results obtained by SALAAM in Condition 6 against results obtained in Condition 4, the intralanguage merge condition. Table 3.7 illustrates the results where the sense selection criterion is set to THRESH. The first three rows in Table 3.7 illustrate the results obtained by SALAAM in Con- dition 4. The last four rows indicate the results obtained by SALAAM in Condition 6. As illustrated by the table, we note an increase in recall and coverage from Condition 4 cases to Condition 6 cases. For illustration, condition AR-FR U T yields a recall score of 56.4% in comparison to the scores of 53% by AMTR-AR T and 52.1% for condi- tion GLSYS-FR T. The increase is expected since the union is an inclusive merge. It 100 Condition P% R% COV% FM AMTR-AR T 64.5 53 82.08 58.19 GLSYS-FR T 65.6 52.1 79.46 58.08 GLSYS-SP T 65.7 50 76.28 56.78 AR-FR U T 61.6 56.4 91.46 58.89 AR-SP U T 61.8 55.3 89.59 58.37 FR-SP U T 62.3 53.2 85.37 57.39 AR-FR-SP U T 60.2 55.6 92.31 57.81 Table 3.7: SALAAM performance for conditions 4 where evidence is obtained from monolingual intralanguage pseudo-translation merge vs. evidence obtained from in- terlanguage pseudo-translation union merge in Condition 6 101 is worth noting that the highest recall and coverage are yielded with Arabic both in the monolingual condition AMTR-AR T with a score of 53% recall, 82.08% cover- age and 58.19 score FM; likewise for the interlanguage merge conditions, the highest scores are yielded by the conditions that involve Arabic, AR-FR U T, AR-FR-SP U T and AR-SP U T. We also note the drop in precision across the board for all the interlanguage union conditions compared to the monolingual conditions. 3.8.7 Overall results We summarize the best results obtained from the different conditions in this SALAAM evaluation. The highest precision obtained is from the interlanguage intersection condition at 67.6 % for FR-SP I T. The highest recall obtained is 56.4% in the interlanguage union merge condition AR-FR U T. The highest coverage score obtained is 92.31% yielded by SALAAM in condition AR-FR-SP U T. The highest FM score result is 58.89% obtained in condition AR-FR U T. The highest performance results from any monolingual condition is yielded by AMTR-AR T where it achieves an FM of 58.19, with a precision of 64.5% and recall of 53% and coverage of 82.08%. Similar to Figure 3.12, Figure 3.15 plots the FM performance of SALAAM in the conditions that yield the highest scores against state-of-the-art WSD systems. 102 Figure 3.15: SALAAM F-Measure results on test set in SV2AW in the highest yielding conditions depicted against state-of-the-art WSD systems 103 The black columns in Figure 3.15 show the best FM scores obtained by SALAAM in conditions AR-FR U T, AR-SP U T and AMTR-AR T. As we can see illustrated in the graph, SALAAM outperforms all of the unsupervised methods and is on par with the partially supervised and some of the supervised methods. 3.9 Discussion 3.9.1 Summary of the Results We have established that SALAAM using translational data as a source of evidence is a very successful approach to WSD. It is worth stressing the novelty of the approach where the source of evidence is orthogonal to the traditional sources of evidence used in the field. SALAAM is radically different from any of the other systems in this evaluation. Results obtained by the default conditions are highly comparable (in most cases better) than those obtained by state-of-the-art unsupervised methods for WSD. More- over, merging MT systems for obtaining pseudo-translations for the same language yields even better results than the utilization of a single MT system. Precision results obtained from all the conditions that are set to the THRESH sense selection criterion are better than those obtained when the sense selection criterion is set to MAX. Inter- secting the output of two target languages for SALAAM yields better precision scores than SALAAM applied using a single target language. Union merging the output of 104 two target languages for SALAAM yields better recall values as well as significantly higher coverage of the data. The final paragraph in the results Section 3.8 illustrates that the best results obtained by SALAAM rival a number of supervised methods and are far superior to other unsupervised methods. The results validate the overall general hypothesis that translations aid in resolv- ing sense ambiguity. These results obtained are very interesting especially given that the target translations are MT based which are orders of magnitude worse than hu- man translations. However, MT has the advantage of rendering relatively good quality alignments owing mainly to the consistency of translation and fidelity to source lan- guage word ordering. 3.9.2 Analysis of Results The results in general are consistent with our hypotheses. There are several issues worth drawing our attention to before qualitatively assessing the merits of the results. The first issue worth consideration is that of language. As noted earlier, the highest precision results are yielded by intersection merge of French and Spanish, while the highest coverage and recall are yielded by the union merge of either French or Spanish with Arabic. The first result is explainable by the shared ambiguity across French and Spanish indicating that they target the same portion of English data. Therefore, the common noun instances that are tagged by each language independently reinforce each other. On the other hand, Arabic most likely overlaps less with either French or 105 Spanish leading to less reinforcement of evidence, but when we look at recall, we see a boost in that direction exactly for that reason. The unique noun instances in the tag sets pertaining to the merge of Arabic with Spanish or French are more numerous than those found in the merge of the intersection of Spanish and French. This observation may also help explain the lower precision yielded in condition AR-FR-SP I T due to the addition of Arabic which is more distant from Spanish and French. Therefore, utilizing both a distant and near language in the sense tagging of data with SALAAM is very beneficial. The second issue is that of calibration between precision and recall. We expe- rience the very sensitivity of these two measures and how difficult it is to achieve significant improvement on both measures, simultaneously. This observation is sup- ported by the results obtained in conditions 5 and 6 where the union/intersection of tag sets pertaining to two different language is used as a source of evidence. However, SALAAM is able to achieve an improvement on both measures when we compare the default conditions with the intralanguage merge conditions set to THRESH sense se- lection criterion. Condition AR-AM SALAAM achieves a precision of 57.5%, recall of 49.3% while AMTR-AR T yields 64.5% precision and 53% recall yielding a statis- tically significant gain of 7% in precision and 3.7% in recall. Despite this encouraging result there are endemic problems that are impediments to achieving the best possible precision and recall scores. In the following sections, we analyze some of these issues. 106 3.9.3 Precision Inspecting the source type sets, we notice that they comprise many outliers. The out- liers exist mainly due to noisy alignment, noisy translations, or both. The problem is aggravated when the outliers are monosemous. A monosemous word will score a confidence level of 1.0 by default according to the NG distance measure, thus biasing the sense tag assignment for the other source set words. If the monosemous word is the wrong word in that set then the bias is detrimental to the sense choice for the other words in the set. For example, the source word types ADOLESCENCE, IDOL, TEEN, TEENAGER form a source type set for the French target word ADOLESCENCE. Obviously IDOL is an outlier: Even though it is related to the other words in the source set,16 it will have a negative impact on the sense assignment of the other members in the set. Another source of outliers is distant source words that may align with the same target word. For example, AMORCE in French aligns with INITIATION, BAIT, CAP, which are all correct translations of the French word but they are distant from one another as AMORCE is a polysemous word in French. Yet, the source type set does not have a homogenous set of words, thus leading to a situation of noise in the tagging process that results in wrong sense assignments. These problems are mainly a reflection of SALAAM?s implicit simplifying as- 16Related and similar are different notions: car and tire are related but car and automobile are similar [67] 107 sumption that words in the target language are monosemous by default. Examin- ing the source sets, we observe that this assumption is clearly false. For example, in source sets such as CANON: CANNON, CANNONBALL, CANON, THEOLO- GIAN and BANDES: BAND, GANG, MOB, STRIP, STREAK, TAPE and BAIE: BAY, BERRY, COVE, we can clearly find narrower sub sets. Accordingly, the source set corresponding to CANON is split into CANNON, CANNONBALL and CANON, THEOLOGIAN. Likewise, for BANDES the source set can be split into two subsets: BAND, GANG, MOB and BAND, STRIP, STREAK, TAPE. These subsets reflect the homonymy of the French words. The presence of such sub-clusters in the source set, resulting from homonymy, has a definite negative impact on the quality of the sense tagging. One way to resolve this problem is to gather distributional features for the source data and apply automatic clustering techniques in order to distinguish coarse level word distinctions in the source sets [17, 75]. Once sub-clustering is applied, coherent source type sets are discovered and, simultaneously, the process discovers ? in an automated unsupervised manner ? the number of homonymous senses for polysemous words in the target corpus. In order to verify this hypothesis, we randomly pick target words that have good and coherent source sets and evaluate them using the scorer2 software.17 We note a significant improvement in precision scores. Below is an example of target words and their corresponding coherent source type sets. 17By visual inspection 108 ABSURDITE?: ABSURDITY, FARCICAL, NONSENSE ACCIDENT: ACCIDENT, CRASH, WRECK ACCUSATION: ACCUSATION, FRAMING, INDICTMENT ADVERSAIRES: ANTAGONISTS, OPPONENTS, CONTESTANTS ACCOMPLISSENTS: ACCOMPLISHMENT, ACHIEVEMENT, ATTAINMENT, COMPLETION 3.9.4 Recall There are several issues that affect recall negatively. Owing to memory limitations, GIZA++ sets a cap on the maximum possible length allowed for a sentence. Accord- ingly, 0.5% of the sentences in the test corpus are excluded. This may be fixed in the future by breaking longer sentences into sub-sentences, or simply increasing the memory of the machines in use. The second factor that affects recall is cross-linguistic lexicalization divergences. The approach as described is limited by the unit cross-linguistic lexicalization align- ments. In some cases, a source word is not lexicalized as a unit, therefore creating a one-to-many alignment relation in the target corpus, which will not handled by SALAAM. For instance, the English word implementation is translated into French as mise en oeuvre, but since SALAAM does not handle compounds at this stage, a word such as implementation is left untagged. Roughly, 33% of the target nouns are translated into the same source word through- 109 out the corpus. There are several possible reasons for this. The target language word preserves ambiguity the way the source word does. This could be due to the fact that the target language preserves ambiguity. For instance, the French word intere?s which is a translation of interest preserves ambiguity, thereby it is ambiguous in the same way the English counterpart is ambiguous. In other cases, the MT system simply does not have alternatives for the source lan- guage word, thereby rendering the same target word for the same source word through- out the corpus. For instance, the source word priest is translated as pre?tre in French, and vicar is translated as cure?. Both are correct translations, however, the end result is two singleton source type sets, one for pre?tre and the other for cure?. The singleton sets can?t be tagged by SALAAM. Therefore, translating both vicars and priests as cure?s would solve this problem. Singleton source sets lead to the exclusion of source words from the tagging process. One solution is to introduce more variability in the corpus genre, leading to more variability in the translation especially if the genre of the test corpus is diversified. 3.9.5 Coverage The maximum coverage yielded by any of the SALAAM conditions is 92.01%. This coverage figure is expected due to the fact that SALAAM at this stage only processes unit sized entities. In an earlier study by Ide [32], she concludes that only 86.6% of the single lexical units in the novel Nineteen Eighty Four correspond to single 110 lexical units in translation when looking for correspondents of words in five different languages pertaining to four different language families. This discovery places an upper bound on the coverage achievable by a system such as SALAAM. 3.9.6 Complementarity with Other WSD Systems We have repeatedly stressed the radical difference between SALAAM and other state- of-the-art WSD systems. SALAAM relies on an orthogonal source of evidence for its bias toward a specific sense assignment. This leads us to believe that there is qual- itative and quantitative evidence for SALAAM?s complementarity to other systems. Upon analyzing the data, we find that, indeed, there are some crucial and interesting differences. For instance, SALAAM when applied in the default conditions GLSYS-FR and/or GLSYS-SP correctly tags over 14% of the cases of polysemous words in the test set that could not be correctly tagged by any of the systems whether supervised or unsu- pervised. SALAAM is correctly able to sense tag over 49% of the data that none of the unsupervised systems tag correctly. This indicates that SALAAM can definitely complement a traditional monolingual WSD approach to achieve even better results. 3.9.7 Evaluation of Target Language Tagging As a product of this evaluation, we obtain sense-tagged target data in addition to the sense-tagged source data. At this stage, we do not evaluate the projected tagging quan- 111 titatively (see chapter 4, for a thorough evaluation of the target sense tagging). But the following example attempts to give a feel for the sense-annotation quality. We illus- trate with a sentence that is randomly chosen from the test corpus SV2AW, with the actual annotations produced by SALAAM. The first line shows the English sentence with the sense-tagged nouns, and the second sentence shows the corresponding French sentence as translated by SYS. English sentence d01.s57 from SV2AW      Vogelstein next turned his    to !"   !!   , the second biggest !!  ""  in the #      after "$  !!  . FR-SYS sentence d01.s57 from SV2AW      Vogelstein apre?s a tourne? son    au !!   de %&      , le deuxieme plus grand   de !!  aux     - #    apre?s !!  de '  . Sense tags marked with an asterisk in the annotated French translation are incor- rect. The tagging for deux and points is incorrect because they are mistranslations. deux points is the translation of the punctuation sense of colon in English. It should 112 have been translated as colon in French. A different problem occurs with poumon (lung), which is tagged with the disease sense (as a part of lung cancer) instead of as an organ. The case of Etats-Unis illustrates an instance of mis-tokenization, where a single lexical item is broken to three, yet the tag is correct because each individual token in (etats, , unis) is aligned with U.S.. 3.10 Summary In this chapter, we present a novel unsupervised method and system, SALAAM, for WST. The method achieves very competitive levels of precision and recall when evalu- ated against other unsupervised systems on the same test set. SALAAM is novel in its extension of the notion of context to a multilingual dimension. It is complementary to other state-of-the-art systems as it targets contexts that are typically not that amenable to traditional approaches exploiting monolingual contexts. In Chapter 4, we show that this method may be used for bootstrapping sense inventories for a language with scarce resources. 113 Chapter 4 Extensions to SALAAM 4.1 Introduction This chapter explores several extensions to the SALAAM system (see Chapter 3). It comprises 3 sections. In the first section, Section 4.2, motivated by the lack of machine translation systems for most languages, we investigate the impact of applying SALAAM to naturally-occurring parallel corpora of genres unrelated to the genre of the test corpus. In Section 4.3, we explore bootstrapping the tagging process for a target language and evaluate the quality of the projected word sense tagging of both Arabic and Spanish. Finally, in Section 4.4, we discuss the feasibility of bootstrapping a WordNet style ontology for Arabic based on SALAAM data. 114 4.2 Using Human Translations ? Naturally- Occurring Parallel Corpora 4.2.1 Introduction In Chapter 3, we empirically demonstrate that applying SALAAM to parallel corpora is a promising novel approach to WSD; the results obtained are significantly higher than other state-of-the-art unsupervised WSD systems while rivaling some supervised systems when evaluated on the same data set, SV2AW. In this section, we investigate the application of SALAAM to naturally-occurring parallel corpora as opposed to ma- chine translated, pseudo-translated, parallel corpora. The question is how robust is SALAAM as an approach given a naturally-occurring domain specific parallel corpus of a genre that is unrelated to the test corpus genre. Accordingly, we examine the impact of various corpora genre on SALAAM?s performance. 4.2.2 Motivation In order for SALAAM to work, the need arises for variability in translation contexts which produce rich source type sets (see Chapter 3, Section 3.6). The belief is that such a variability in translation contexts should be obtained from naturally-occurring bal- anced parallel corpora, which unfortunately do not exist. To date, mining such corpora from the web is still a promise that has not materialized, mainly owing to copyright issues. Moreover, there are no naturally-occurring parallel corpora that are tagged. 115 Hence, in Chapter 3, we rely on pseudo-translations as an approximation. But, the fact is, the majority of languages do not have machine translation systems. Nonethe- less, a lot of languages do have domain specific texts in translation; for instance, the Bible exists in over 2000 languages. Therefore, in this section, we investigate apply- ing SALAAM to genre specific corpora that are naturally-occurring in lieu of pseudo- translations as in Chapter 3, for the augmentation of the same test corpus SV2AW. In effect, we are essentially measuring the robustness of the SALAAM approach when using corpora that are incongruent with the test corpus and do not possess the expected level of variability in translation contexts. 4.2.3 Hypothesis We have two general hypotheses: 1. Augmenting SV2AW with naturally-occurring genre specific parallel cor- pora while applying SALAAM yields comparable precision results to those obtained by augmenting SV2AW using pseudo-translation corpora . Naturally-occurring parallel corpora have more translation variability than pseudo translated corpora as the translation process is subject to the creativity of the hu- man translator. When naturally-occurring parallel corpora are not balanced, we expect a reduction in translation variability which allows for the formation of source type sets comparably variable to those produced via pseudo-translations. 2. Corpus genre has a significant impact on SALAAM recall results. 116 Polysemous words if used in a genre specific corpus will have a bias toward specific senses. The absence of domain specific knowledge of the senses is a problem which is escalated if this genre is not of the same type as that of the test corpus. Furthermore, if the genre is narrow, for example, religious or political to the exclusion of other genres, it tends to be consistent in the translation of its terms, therefore decreasing the variability in translation contexts.1 This leads to the creation of singleton source type sets (see Chapter 3, Section 3.6) that are not amenable to SALAAM to tag, hence, significantly affecting recall. 4.2.4 Evaluation The evaluation metrics and significance testing used here are the same as those used for SALAAM in Chapter 3, Precision (P), Recall (R), F-Measure (FM) and coverage (COV). The tokenization tools, stochastic alignment software, ontology, gold standard and test set are also the same as those used in Chapter 3. We report the evaluation of applying SALAAM using two sets of corpora for augmenting the same test corpus, SV2AW, as described in Chapter 3, Section 3.7.1: naturally-occurring parallel corpora 1If the corpus genre is narrow yet of the same type as that of the test corpus, and the translator(s) uses variable ways of expressing ideas, then this is a favorable condition for SALAAM. In principle, SALAAM is expected to perform well if the augmenting corpus used is balanced, regardless of the genre of the test corpus ? whether balanced or not ? as this condition creates variable source type sets; or alternatively if the augmenting corpus is of the same genre of the test corpus with variable translation contexts. 117 which are human translation corpora (HT); and both HT corpora and pseudo-translated (MT) corpora which are the corpora used in Chapter 3. In the process, we explore the impact of pruning the alignments in the HT conditions using a bilingual dictionary. All the evaluations are on English-Spanish parallel corpora. Corpora In addition to the MT corpora described in Chapter 3, we have two HT corpora. More- over , we describe the test corpus, SV2AW, here again for convenience. Table 4.1 indicates the relative sizes of the different corpora used. Here follows a description of the three parallel corpora: The Bible (BIB) This corpus comprises the Old and New Testaments. The English version is the NIV Bible written in modern English and last updated 1901.2 The Spanish Bible is written in modern Spanish.3 BIB has approximately 820K tokens per side. BIB is religious text that is aligned at the verse level [73]. Proceedings of the United Nations 1989-1990 (UN) The UN text is written in modern day English and Spanish. The portions used in this evaluation specifically date back to the years 1989 and 1990. It is a political 2http://www.sni.net/mpj/WEB/index.htm 3http://www.mit.edu/afs/athena.mit.edu/activity/c/csa/www/documents/Spanish 118 and economic genre corpus. The corpus is semi-automatically sentence aligned.4 The resulting corpus has approximately 1.7 million words per language side. The SENSEVAL 2 All Words corpus (SV2AW) This is the test corpus. SV2AW has three articles from The Wall Street Journal. The articles discuss culture, medicine and education, respectively. Each side has close to 6000 tokens. In this set of experiments, SV2AW is pseudo-translated into Spanish using both GL and SYS machine translation systems in the intralan- guage post alignment merge condition. The idea is to maximize the translation variability of the test corpus as this is established a significant improvement rel- ative to a single translation system result (see section 3.7.8, Condition 3), where the English-Spanish parallel tokens are aligned and merged before submitting to the NG algorithm for sense assignment. Corpora Lines Tokens BIB 30427 829031 UN 71672 1734001 SV2AW 242 5815 Total 102341 2568847 Table 4.1: Relative sizes of the English side of corpora used in HT Evaluations 4Thanks to Clara Cabezas, a bilingual native Spanish speaker. 119 Parameters Similar to the SALAAM evaluation using MT, we set the alignment software parame- ters at 70 tokens per sentence. The sense selection criterion is set to MAX (see section 3.7.3) for all the evaluations in this section. Conditions We explore the following experimental conditions: 1. SV2AW alone (SV2AW) This condition aims at viewing SALAAM?s raw results on the test set alone with no augmented corpora. We consider this condition the baseline condition; it sets an upper bound on precision5 and a lower bound on recall, at MAX sense selection criterion. 2. BIB with SV2AW (BIB+SV2AW) This condition examines the results of augmenting the SV2AW test corpus with the Bible corpus, BIB. 3. UN with SV2AW (UN+SV2AW) This condition explores the results of augmenting the SV2AW test corpus with the United Nations corpus, UN. 5Ideally, we can find the true ceiling value for precision if we have human translations and perfect alignment. 120 4. Fixed UN and SV2AW (Fixed UN+SV2AW) Upon inspecting the token alignment quality of the HT corpora, we realize se- vere problems due to the inconsistency in sentence length from English to Span- ish but also owing to naturally-occurring divergences in syntactic and semantic expression cross-linguistically. Therefore, in this condition, the UN alignments are fixed with some linguistically motivated rules. The rules are heuristics for basic category swapping; it is observed, for instance, that the alignment software consistently swaps adjectives and nouns in Spanish. The correction rules rely on the POS tagging on the English side. Rules are applied in a specific order. If an English word is mapped to two Spanish words and the English word following it is mapped to the NULL token then the first Spanish word is assigned to the following English word, rendering the alignment one-to- one in this case. If there are three Spanish words aligned with three English nouns in a row, each English noun and its Spanish alignment are checked to see if they share some prefix; if not, then the first and the third Spanish words are switched. Spanish translations of English adjectives followed by nouns in English ? on the Spanish side indicating that the English words are left untranslated by the MT system ? are swapped. 121 5. Pruned UN and SV2AW (Pruned UN+SV2AW) An alternative method for fixing the alignments is to use a bilingual dictionary to prune the translations. We use a generic bilingual English-Spanish dictionary which comprises 90K entries. The alignment pairs are filtered so that those that do not occur in the dictionary are removed. 6. Pruned and Fixed UN and SV2AW (Pruned Fixed UN+SV2AW) In this case, the UN corpus alignments are fixed according to the correction rules in condition 4 and then pruned according to Condition 5. 7. UN, MT, and SV2AW (UN+MT+SV2AW) For this condition, we merge the UN alignments with those of the pseudo- translated (MT) corpora used in Chapter 3, comprising the Brown Corpus, SEN- SEVAL1 corpus, SENSEVAL2 Lexical Sample and The Wall Street Journal cor- pora, in addition to the test corpus SV2AW. Similar to the test corpus for this evaluation, the pseudo-translated corpora are in the post alignment intralanguage merge condition (see Section 3.7.8, condition 3). 8. Fixed UN, MT and SV2AW (Fixed UN+MT+SV2AW) Similar to Condition 7, but the UN alignment portion is fixed with the correction rules described in Condition 4. 9. UN, BIB, MT, and SV2AW (UN+BIB+MT+SV2AW) 122 Similar to Condition 7, but the BIB corpus is added to the augmented parallel corpora. 10. Fixed UN, Fixed Bible, MT and SV2AW (Fixed UN+Fixed BIB+MT+SV2AW) Similar to Condition 9 but both UN and BIB corpora alignments are fixed based on the rules described in Condition 4. . Results Table 4.2 demonstrates the results obtained by SALAAM where the test corpus, SV2AW, is augmented by HT parallel corpora. We include the results for condition GLSYS-SP (see Chapter 3, Section 3.7.8, Condition 3) for comparison of the HT results against an MT result. GLSYS-SP is chosen because it is the closest approximation to a hu- man translation using machine translation systems; moreover, it yields the best results for the Spanish data with the MAX selection criterion. The SV2AW condition illus- trates the upper bound on precision and the lower bound on recall.6 Both SV2AW and GLSYS-SP are pseudo-translated conditions, MT, hence the bold typeface in the table. As expected, the results obtained in the SV2AW condition alone yield the high- est precision and the lowest recall across all the different conditions ? actually by comparison to all SALAAM conditions, even those of Chapter 3. The overall results 6Due to the very small size of this test corpus, the SV2AW token alignments are obtained from aligning the entire MT corpus as described in Chapter 3 123 Conditions P R COV FM GLSYS-SP 59.8 53.3 89.21 56.36 SV2AW 69.6 24.3 34.9 36.02 UN+SV2AW 57.5 44.5 77.31 50.17 BIB+SV2AW 56.5 36.6 64.82 44.42 Fixed UN+SV2AW 58.6 44.8 76.45 50.78 Pruned UN+SV2AW 59.1 32.9 55.72 42.27 Pruned Fixed UN+SV2AW 59 33 55.91 42.33 Table 4.2: SALAAM Results on SV2AW for MT & HT parallel corpora independently indicate that the use of other corpora, whether pseudo-translated or naturally-occurring parallel corpora, plays a significant role in improving the recall values while adding significant noise and thereby reducing precision. Precision for all HT conditions does not differ significantly from precision of the MT condition GLSYS-SP; according to the Zscore statistical significance test the conditions are the same with  $    confidence. All HT experimental con- ditions yield lower FM results when compared with the MT experimental condition GLSYS-SP, yet markedly at statistically significant lower coverage scores. Recall for all HT conditions is significantly lower than recall for GLSYS-SP but at the same time significantly higher than that of condition SV2AW; all HT conditions at least double the coverage achieved by condition SV2AW. Results obtained by conditions that use 124 the UN corpus are better than those obtained using the BIB corpus. Precision and recall obtained by condition Fixed UN+SV2AW are slightly higher than the UN+SV2AW condition; the improvement is minor, but it shows that fixing alignments is a step in the right direction. We see further improvement in precision for Pruned UN+SV2AW over Fixed UN+SV2AW, yet recall is significantly reduced. Conditions P R COV FM GLSYS-SP 59.8 53.3 89.21 56.36 UN+MT+SV2AW 60 54.7 91.18 57.23 Fixed UN+MT+SV2AW 60.8 55.4 91.18 57.97 UN+BIB+MT+SV2AW 60.1 55 91.46 57.44 Fixed UN+Fixed BIB+MT+SV2AW 60.5 55.4 91.46 57.84 Table 4.3: SALAAM results using both HT and MT for augmenting the test corpus Table 4.3 illustrates the results of merging the pseudo-translated MT corpora with the HT corpora for augmenting the test corpus SV2AW. We observe a slight im- provement in all the results relative to condition GLSYS-SP, yet none of the results yielded by the different experimental conditions is statistically significantly better than condition GLSYS-SP. We note the minor improvement associated with fixing the alignments; we see an increase of 0.8% in precision from UN+MT+SV2AW to Fixed UN+MT+SV2AW and an increase of 0.7% in recall maintaining the same cover- age level; similarly, we note an increase of 0.4% from condition UN+BIB+MT+SV2AW 125 to condition Fixed UN+Fixed BIB+MT+SV2AW. Adding the BIB corpus to the mix seems to slightly improve recall and coverage; For instance, comparing UN+MT+SV2AW and UN+BIB+MT+SV2AW we observe an increase of 0.3% in recall and 0.28% in cover- age. 4.2.5 Discussion As hypothesized, we obtain comparable precision scores when augmenting SALAAM with genre specific naturally-occurring parallel corpora and pseudo-translated corpora. This illustrates the robustness of the SALAAM approach. As expected there is a clear correlation between corpus genre and performance. Even though the difference be- tween precision scores yielded by the conditions UN+SV2AW and BIB+SV2AW is not statistically significant, we observe a drop of 1% from augmenting the test corpus with the UN corpus to augmenting it with BIB. On the other hand, the drop in recall is statistically significant, with a drop of 8% from UN+SV2AW to BIB+SV2AW. The decrease is due to the relative distance of the corpora genre. Qualitatively, the language of the BIB corpus is stylized, which is very different from the language style used in the UN corpus or the test corpus. In fact, just by looking at the dates of the corpora, the UN corpus and the test corpus pertain to the late 20th century, while BIB is early 20th century. This is further supported by the unigram overlap between the BIB corpus and the test corpus of 944 tokens compared to the UN corpus unigram overlap of 1249 126 tokens, thereby exhibiting an increase of 25% in overlap between the UN corpus and the test corpus. When the HT corpora are merged with the pseudo-translated corpora, we observe modest improvements in the different measures. As noted earlier, there is a very sensitive balance between precision and recall, which emerges clearly in all these experimental conditions. It is a challenge to improve on both measures simultaneously. We observe a promising improvement on all metrics in Table 4.3, but less than expected. We believe there are two reasons for this. The first endemic problem comes from the nature of the HT corpora utilized. There is no genre overlap between the HT corpora and the test corpus. The test corpus has articles about education, medicine and culture; the UN corpus is mostly economic and political in nature; the BIB corpus is religious text. Not surprisingly, these corpora added too much noise to the source type sets. In contrast, the pseudo-translated corpora used in Chapter 3 included text from relevant genres. Secondly, the automatic token alignments of the HT parallel corpora are much worse by qualitative inspection than those obtained from pseudo-translations. This is an expected drawback. Human translation tends to be more creative: Often transla- tors express sentences in different lengths in different languages; such variations in length cause havoc for the token alignment software. Upon inspecting the HT English Spanish alignments, we find on average 30% of the tokens aligning with the NULL token, compared with 10% of the tokens in the pseudo-translations. An indication of 127 the promising impact an improvement in the alignment would yield is illustrated by the modest improvement in the results from raw alignments to fixed alignments ? al- beit with ad-hoc rules and heuristics ? as presented in Tables 4.2 and 4.3. It is worth noting that fixing automatic alignments is a vast research area which falls outside the scope of this thesis [52, 31]. As expected, pruning has a negative effect on recall; it eliminates many possible valid members from the source type sets which is probably due to lack of coverage or genre variation between the test corpus and the dictionary utilized. Nonetheless, it has a positive effect on precision. Looking at the flip side of these results, we believe there are two factors that aid the pseudo-translated version of these experiments. The first factor lies in the genre of the corpora utilized; they cover a myriad of different genres which overlap with the test corpus genre. The second factor is the fact that the pseudo-translations are very consistent translations that render better alignments relative to the HT parallel corpora token alignments. 4.2.6 Summary In this section, we establish SALAAM?s robustness given naturally-occurring parallel corpora of genre types that are completely unrelated to the test set, SV2AW. The results obtained show no significant difference in performance precision for SALAAM using pseudo-translations of relevant corpora genre versus utilizing unrelated genre corpora. We also note the degraded quality of alignments when using naturally-occurring par- 128 allel corpora relative the pseudo-translated parallel corpora. 4.3 Target Language Tagging Evaluation 4.3.1 Introduction In this section, we discuss the quality of the projected sense tags onto the target lan- guage words in SALAAM. We present two quantitative evaluations of the projected tagging on two target languages: Spanish and Arabic. The tagged Spanish target text is automatically evaluated against manually annotated Spanish test data. The tagged Arabic data is manually evaluated. This section is arranged as follows: section 4.3.2 presents the motivation behind evaluating target tagging; in section 4.3.3, we present the underlying hypothesis driving the projected sense tagging evaluation; Section 4.3.4 briefly describes the required resources; section 4.3.5 explores sense tagged target Ara- bic data; Section 4.3.6 illustrates quantitative evaluation of sense tagged target Spanish data. 4.3.2 Motivation Given a lexicon and a trained lexicographer, sense tagging texts manually is the guar- anteed method of obtaining good quality sense-annotations for words in running text. However, the task is very tedious, expensive, and, by many standards, daunting to the people involved, even when all the required resources are available [25]. The prob- 129 lem becomes ever more challenging when dealing with a language with virtually no computerized knowledge resources or tools. To date, the only way to obtain sense- annotations in a language with scarce knowledge resources is to do the job manually which constitutes a serious impediment given the sheer number of natural languages in the world. SALAAM is investigated as a method for resolving this impeding bottleneck. SALAAM provides a bootstrapping method for sense tagging a language with scarce automatic linguistic knowledge resources. As a side effect of applying SALAAM to a parallel corpus and tagging the source side, we obtain a tagged target language corpus auto- matically, with no extra effort or cost. No target resources are required except for the actual parallel corpus and a simple tokenizer for the target language. The approach serves as an elegant solution to an age old problem and a series of bottlenecks for the acquisition of automated knowledge resources for scarce languages. 4.3.3 General Hypothesis The application of SALAAM to a parallel corpus should provide a good source for creating seed target language sense-annotations for languages with scarce knowl- edge resources. This general hypothesis is based on the premise that people share basic conceptual notions that are a consequence of shared human experience and perception regardless of the languages they speak. 130 The premise is supported by the fact that we have translations in the first place. People of different linguistic backgrounds are capable of communicating through other modalities. Apart from the empirical value of labelling data with their appropriate senses for computational systems, defining or quantifying senses, first and foremost, aims to make explicit these basic human notions of meaning. Basing the target sense tagging on a source language involves nothing more than capturing that very idea of shared meaning across languages and exploiting it as a bridge to explicitly define the senses in a target language. Therefore, SALAAM is in- troducing a bias based on a sound cognitive axiom that languages share basic elements of meaning. When SALAAM is used for tagging a source language, it cashes in on the variation in translation of polysemous words. The flip side of this view is aims at quantifying meaning commonality across two languages. 4.3.4 Required Resources The current evaluation does not require additional development resources or tools over those used for the evaluation of the SALAAM performance reported in Chapter 3 and Section 4.2 in this chapter. The same corpora utilized for evaluating SALAAM on a source language are used here for the evaluation of the projected sense tagging of the target language. For an evaluation of the target sense tagging, a target test set and target gold standard are identified. 131 4.3.5 Projected Sense Tagging on Arabic Data Introduction In this section, we examine the quality of the projected tags onto Arabic target data. No WordNet ontology exists for Arabic, therefore the evaluation is manual. Arabic is a low density language with scarce automatic linguistic knowledge resources. In terms of data availability, more online corpora including parallel corpora are appearing on the web, yet language specific knowledge resources such as ontologies are virtually non-existent. Arabic is a Semitic language. It is spoken by at least 200 million people. It is one of the few languages that exhibit diaglossia. Diaglossia is a linguistic phenomenon where a community has two languages operating at the same time. All Arabic speaking countries have at least two main forms of Arabic: Modern Standard Arabic (MSA) and some colloquial form. MSA is predominantly used in written text and speeches, mostly in formal settings. Typically, MSA is understood by the educated class in the different Arabic societies. Furthermore, the language spoken in Malta is a derivative of Arabic yet it is written in Latin script. Arabic script is used by Farsi, Daari and Urdu, as well. Most words in Arabic have their origins in 3 or 4 letter roots. Most of the roots are verbal roots. A variety of grammatical case and parts of speech are expressed by changing the root into a stem based on one of 13 templates which are variations on the verb f3l meaning to do.7 For example, ktb, which means to write in the infinitive 7Throughout this chapter, in describing Arabic data we use 3 to indicate the letter aiyn , 2 for glottal 132 form. This uninflected form may be changed into the noun kitab, meaning a book based on the template fi3al, where the f, 3 and l correspond to the three consonants, k, t, b. Mainly, the transformation comes with the addition of vowel infixes. There are two types of vowels in Arabic: short vowels and long vowels. Short vowels are often ignored in written text. Motivation Motivated by the lack of tools for Arabic and native proficiency in the language, 8 we examine the projected sense-annotations onto MSA Arabic target tokens. Evaluation Corpora The corpus that is evaluated is the SV2AW parallel corpus, pseudo-translated into Arabic using the Al-Misbar (AM) machine translation system.9 The corpus comprises 242 lines. Preprocessing stop @, upper case characters for emphatics such as H and D, corresponding to oand , respectively; and finally P is the sh sound . The English phoneme P does not exist in Arabic. 8The author possesses native proficiency in Arabic 9http://www.almisbar.com 133 The Arabic text is transliterated into Latin script. It is tokenized and lightly stemmed;10 the prefixes and suffixes are separated out from the words; this pro- cess results in the reduction of word surface forms to stems.11 Figure 4.1 illus- trates the first sentence of the SV2AW English corpus with its translation into Arabic and in turn the Arabic is transliterated in the third sentence in the figure and finally tokenized as presented in the fourth sentence. The art of change-ringing is peculiar to the English, and, like most En glish pecu- liarities, unintelligible to the rest of the world. In fn dqAq tgyyr xAS bAlInjlyz, wmvl Akvr AlxwAS AlInjlyzyp, gyr wADH Ila bqyp AlEAlm. In fn dqAq tgyyr xAS bAl Injlyz , wmvl Akvr Al xwAS Al Injlyzyp , gyr wADH Ila bqyp Al EAlm . Figure 4.1: An example of a transliterated Arabic sentence and its tokenization Ontology 10The original tokenization script is developed in collaboration with Kareem Darwish. We do not use a morphological analyzer for Arabic since we assume the availability of minimal resources on the target side of the parallel corpus. 11It is worth noting that this is not equivalent to lemmatization in English. Lemmatization reduces the word to an infinite form devoid of number information; stemming disentangles the words from the associated pronouns; in many cases, stems are left with number and case information. 134 WN17pre (see Chapter 3 for description) is used for the tokens? sense tagging and projection onto the Arabic text. Application of SALAAM SALAAM is applied to the entire corpus as described in Chapter 3.The sense selection criterion is set to MAX. The SV2AW English tokens and their corre- sponding Arabic token alignments are extracted. The alignments have one-to- many correspondents due to the token alignment software GIZA++, where an English token may correspond to more than one Arabic token.12 The one-to- many correspondents are compressed and in the process target compounds are created. The English source nouns are automatically annotated with SALAAM assigned sense tags; the sense tags are then projected onto the corresponding target Arabic tokens. Test Set The English noun instances and their corresponding Arabic alignments are ex- tracted from the compressed tagged corpus, SV2AW. The English noun instances are evaluated against the gold standard described in Chapter 3. 581 English noun instances are deemed correct by scorer2 in the fine grain mode; the correct instances ? those that yield a score%   ? are extracted with their corre- 12Note that GIZA++ allows for one-to-many alignments from the source side of the parallel corpus to the target side but not vice versa. 135 sponding Arabic alignments.13 Accordingly, the Arabic tokens are tagged with the projected sense-annotations. The 581 tagged Arabic token instances com- prise the test set. Evaluation Results The 581 tagged Arabic token instances are manually evaluated. Upon inspection, 526 Arabic word instances are tagged correctly with appropriate senses based on the sense label fit for the Arabic word in question and its surrounding context. Moreover, another 9 Arabic word instances are tagged with approximate senses. A sense is deemed approximate if there are senses used in the tagging that do not fit the Arabic word and its context. For example, the Arabic translation for evening is msAi2yah; SALAAM tags it with WN17pre sense IDs: 1, 2 and 3; judging by the ontology entries as listed below in Figure 4.2, senses 1 and 3 are appropriate tags for the Arabic word in this context, yet sense 2 is not a good fit, nor is it actually an appropriate sense definition for the Arabic word. Accord- ingly, only 2 of the 3 possible sense tags are appropriate, therefore resulting in an approximately correct sense projection onto the Arabic word. Of the 581 correct English tagged instances, 38 instances are misalignments with the Arabic. For instance, the English token cancer aligns with the target Arabic token okhra meaning other. In 12 cases, the MT system, Al-Misbar, does 13We do not evaluate the Arabic correspondents of the incorrectly tagged English instances based on the simplifying assumption that if the English is not correct, then the Arabic probably is not correct. 136 1: evening, eve, eventide: the latter part of the day (the period of decreasing day- light from late afternoon until nightfall); ?he enjoyed the evening light across the lake? 2: evening: a later concluding time period; ?it was the evening of the Roman Empire? 3: evening: the early part of night (from dinner until bedtime) spent in a special way; ?an evening at the opera? Figure 4.2: WN17pre entries for evening not translate the English words into Arabic; the English noun is rendered as is in the translation. 5 instances have the wrong Arabic translation. In one case, the English noun is aligned with an adjective that has the correct meaning, but it is not the correct POS, therefore leading to a misfit between the senses listed in WN17pre and the Arabic word. Table 4.4 summarizes the results. Discussion These results are promising as a start for the process of bootstrapping sense tagging for Arabic. The Arabic tagged data is a result of applying SALAAM to SV2AW using Al-Misbar translations; this condition is not the highest yielding condition for Arabic; therefore, we may extrapolate that data resulting from applying SALAAM to the merged Arabic MT condition ? which yields the highest scores for English tagging using evidence from Arabic in Chapter 137 Evaluation Number of Instances Percent Correct Correct 526 90.5% Approximate 9 1.5% Misalignments 38 6.5% Mistagged 6 1% Mistranslation 12 2% Table 4.4: Accuracy results of projected tagging onto Arabic SV2AW data measured against English WN17pre sense definitions 3? will accordingly improve on the currently obtained result. Obtaining tagged target data in this manner is very appealing since it virtually comes for free as a side effect of applying SALAAM to a source language. As shown in Table 4.4, 90.5% of the Arabic projected sense taggings are con- sidered correct at the appropriate granularity level. If we extrapolate ? as if we have an Arabic WordNet ? from these results to the entire set of English tags, we would expect the overall Arabic performance results to be at approximately 49% precision, given only the correct tags according to our manual evaluation, without taking into consideration the potentially correct tags which could result from misalignments. Such results are extremely encouraging especially if we plot them on the overall performance graph for the English All Words SEN- SEVAL 2 task, we note that the performance on the Arabic data is right in the 138 middle of the graph with many systems? results for English. include graph here from presentation These results are very encouraging as a first pass. We acknowledge here the shortcoming of this evaluation as a post-hoc rather than blind evaluation, there- fore, it is subject to inflated agreement rates. No matter how systematic and rigorous the annotator performing the evaluation, s/he tends to agree with the assigned sense more often than if s/he were to pick a sense from a set of senses rather than from a monolingual ontology. One way of solving this problem is by having several annotators perform a manual post-hoc evaluation of the tag- ging quality. Another method is to translate the corresponding WN17pre entries and glosses into Arabic and then ask an annotator or group of annotators to as- sign senses to the Arabic words without seeing the English translation, therefore rendering it a monolingual evaluation task. 4.3.6 Projected Sense Tagging on Spanish Data Given the encouraging results obtained from manually evaluating the projected sense tagging on Arabic data, we perform a blind evaluation on Spanish target data sense- annotations. Spanish is one of the languages used as a target language by SALAAM in Chapter 3. Spanish is chosen as a target language for SALAAM because the utilized MT systems for producing the pseudo-translations claim good quality translations; moreover, computerized linguistic knowledge resources exist for Spanish. Several 139 teams of computational linguists are currently working on building a Spanish WordNet as part of the EuroWordNet initiative [79]. Spanish WordNet is based on the same conceptual structure as English WordNets. The availability of such a resource allows for a blind evaluation of the projected sense tagging of Spanish target tokens. This section presents an evaluation of the projected sense tagging quality for Span- ish tokens, when used as a target language by SALAAM, against a Spanish WordNet gold standard.14 Gold Standard This all nouns gold standard (AWGS) is modelled after the gold standard for SV2AW, described in Chapter 3. Our aim is to create a comparable test set/gold standard to the SV2-AW gold standard, which comprises 242 sentences. The idea is to tag all possible noun instances in running text. A set of 250 sentences is randomly generated from the Spanish SENSEVAL 2 Lex- ical Sample (SP SV2LS) corpus provided to participants in the SENSEVAL2 Lexical sample task. The sentences are automatically POS tagged by extracting tags from the output of the Spanish parser Connexor.15 The resulting sentences are sent to one of the key sites in Spain, where Spanish WordNet is being developed. All the nouns in 14In this evaluation, we are tightly bound by the available resources. We have to bridge many re- sources owing to the fact that we do not have Spanish WordNet. 15http://www.connexor.com 140 the sentences are manually annotated by a human annotator.16 The sense-annotation is based on the most stable version of the Spanish WordNet, which is partially linked to English WordNet 1.5. The human annotator manually fixes some of the automatically assigned POS tags. All sense tags used in this tag set exist in WordNet 1.5.17 The human annotator uses ?0? to indicate an unassignable tag; mainly, for named entities. Some cases are assigned multiple sense tags, which might include the ?0? tag; these cases indicate that the appropriate sense for the noun instance does not exist in the current Spanish WordNet. We will refer to these cases as approximates. For the purposes of this evaluation, noun instances that are assigned a unique ?0? tag are excluded. Furthermore, 13 sentences, comprising 183 sense tagged noun in- stances, are excluded from AWGS as they exceed the 70 token length limit requirement for the GIZA++ stochastic token alignment software. The final AWGS tag set com- prises 1279 tagged noun instances corresponding to 233 sentences in SP-SV2LS.18 16We would like to acknowledge the annotation work by Irina Chugur, who is a native speaker of Spanish and a computational linguist working on the Spanish WordNet project under the supervision of Dr. Julio Gonzalo at UNED, Madrid, Spain. 1717 tagged noun instances are excluded as their sense IDs do not exist in sense.index file for WordNet 1.5. 18Three more sentences are excluded as they are repeated sentences resulting from the random gen- eration process. 141 Test Set The test set for this evaluation is the set all noun instances occurring in the 233 ran- domly generated sentences from SP SV2LS. Therefore, it is an all words task. We refer to this test corpus as SPSV2AW. SPSV2AW comprises 1279 noun instance test items. Corpora Similar to the experimental setup in Chapter 3 and section 4.2 above, the test cor- pus of 233 sentences is augmented by other corpora in order to apply SALAAM. Similar to the problem faced in Chapter 3, to our knowledge, there are no balanced English-Spanish parallel corpora; therefore, the corpora used for augmentation are the 5 corpora used in the SALAAM evaluation in Chapter 3. The 5 corpora are: Brown Corpus (BC), SENSEVAL 1 Corpus (SV1), SENSEVAL 2 English Lexical Sample Corpus (SV2-LS), The Wall Street Journal Corpus (WSJ), and SENSEVAL 2 All Words Corpus (SV2AW). Throughout the rest of this section we will refer to this corpus as BSSSJ. BSSSJ is pseudo-translated to Spanish using both GL and SYS translation systems, thereby creating two parallel corpora, one corresponding to each MT system. The resulting Spanish pseudo-translated corpus is further augmented with the Spanish SP SV2LS corpus.SP SV2LS comprises trial, training and test data that is provided to participants in the SENSEVAL 2 Spanish language Lexical Sample ex- ercise. SP SV2LS is a multi-topic collection which comprises the created test set 142 SPSV2AW and the whole corpus contains excerpts from newspapers, fiction, and sci- entific articles; like BSSSJ, SP SV2LS does not exist in translation; as manual trans- lation is extremely expensive, we opt for pseudo-translating it into English using both GL and SYS machine translation systems creating source pseudo-translations. There- fore, when augmenting the English side of the utilized parallel corpus with pseudo- translated source English, care is taken that the pseudo-translation on the Spanish side is from the same MT system, i.e. a parallel corpus will have English BSSSJ plus GL translated SP SV2LS, corresponding to the Spanish side with BSSSJ pseudo-translated using GL plus the original Spanish SP SV2LS. BSSSJ is similar in genre to SP SV2LS; they both cover similar domain topics. Table 4.5 lists the relative sizes of the corpora. The sizes presented are of the corpora in the language in which they originated; the numbers for BSSSJ are those of the English side of the corpus, and those for SP SV2LS are for the Spanish side of the parallel corpus. Corpora Lines Tokens BSSSJ 226094 5555039 SP SV2LS 6815 238339 Total 233129 5793378 Table 4.5: Relative sizes of corpora used in projected Spanish tagging evaluation 143 Ontology AWGS is tagged with Spanish WordNet, which has direct links into WordNet 1.5. Therefore, the sense inventory used in this evaluation is WordNet 1.5. WordNet 1.5 is an older version of WN17pre, described in detail in Chapter 3; WordNet 1.5 has the same attributes and structure as WN17pre. Evaluation Metrics The evaluation metrics used are the same as those described in Chapter 3. We used precision (P), recall (R), and coverage (COV). The statistical significance test is the Zscore, described earlier in Chapter 3, measured at 95% confidence level. Baseline Developing an appropriate baseline for this evaluation requires great care. The main is- sue is the degree of overlap between WordNet 1.5, the inventory used by SALAAM for sense annotation and projection, and Spanish WordNet, the inventory used for AW-GS. We acknowledge the overlap, yet there are granularity mismatches and cases where the senses in English simply do not have correspondents in Spanish and vice versa. In an ideal world, we would have the human annotator assigning senses from the proper in- tersection of the two inventories. But since we do not impose that restriction on the human annotator ? the human annotator was performing the task monolingually in Spanish ? and we do not have access to the actual Spanish WordNet that is used in 144 the task, we create a baseline based on WordNet 1.5 alone. The baseline comprises all the aligned Spanish translation tokens of the English noun instances in the SP SV2LS corpus. This results in a set of 34878 noun instances.19 Below, we discuss two possible options for assigning senses to the baseline.20 First Listed Sense Baseline (FSBL) As the naming indicates, FSBL annotates the Spanish word instance that corre- sponds to the English noun instance with the first listed sense ID in WordNet 1.5. Similar to other WordNet Ontologies, the first sense listed is the most frequent sense according to the sense frequency in a semantic concordance (SemCor).21 FSBL is a questionable baseline for unsupervised methods. We consider FSBL to be a supervised baseline since it is based on sense frequencies in a manu- ally annotated corpus. In this current evaluation, the fact that WordNet 1.5 is a bridge inventory ? the actual gold standard comprises sense tags from the Spanish WordNet ? may make FSBL more appealing as a baseline. Yet, we argue that first sense frequency effect carries over cross-linguistically owing to the inherent closeness between the Spanish and English languages [74]. 19We exclude noun instances that align with the NULL token. 20The most appropriate baseline would be a Lesk style annotation of noun instances based on the glosses? word overlap in the Spanish WordNet, but unfortunately we do not have access to the Spanish WordNet. 21SemCor is a corpus of roughly 200k manually sense-annotated words in running text extracted from the Brown Corpus. 145 Random Baseline (RBL) For this baseline, a sense is randomly chosen from the set of senses for a given noun instance in WordNet 1.5. The RBL baseline results are based on averaging 10 runs of the random sense generator for each instance in the baseline set. RBL is a more appropriate baseline compared to FSBL for an unsupervised method. In the absence of a Lesk based approach, it is used as the baseline for the current evaluation. Experimental Conditions For all experimental conditions, the SALAAM resulting WordNet 1.5 sense tags of the English corpus are projected onto the Spanish words in SP SV2LS which includes the test set SPSV2AW. The aim is to measure the quality of sense-annotations of the projected sense tags onto the Spanish tokens in the SPSV2AW test set. 1. Spanish All-Words with GL (AWGL) SP SV2LS and BSSSJ are pseudo-translated using the GL machine translation system. The pseudo-translated portion of the corpora is GL for both directions, i.e. GL Spanish translations corresponding to source English BSSJ and GL En- glish translations corresponding to the Spanish SP SV2LS. 2. Spanish All-Words with SYS (AWSYS) 146 Similar to Condition 1 where SP SV2LS and BSSSJ are pseudo-translated using the SYS machine translation system. 3. Spanish All-Words with intralanguage pseudo-translation merge post-alignment (AWGLSYS) The aligned corpora resulting from SYS and GL are merged before the creation of source sets in the SALAAM tagging cycle. This condition aims at increasing the variability of contexts, thereby allowing for more source type sets. 4. Spanish All-Words with post-tagging translation intersection merge (AWGLSYS-I) The tagged test set resulting from conditions 1 and 2 are intersected where only common sense tags of the shared noun instances are evaluated with the rest of the uniquely tagged words in SPSV2AW. The intersection of the tag sets weeds out some of the possible noisy tags from the tag set. Some noun instances are excluded if they occur in both test tag sets resulting from conditions 1 and 2 and they share no tags in common. 5. Spanish All-Words with post-tagging translation union merge (AWGLSYS-U) The tagged test set resulting from conditions 1 and 2 is merged with a union operation where all sense tags for shared noun instances are evaluated with the rest of the uniquely tagged words in SPSV2AW. This condition allows for more 147 coverage of the data. The union of the tag sets allows for the inclusion of more noisy tags but improves coverage. Experimental Parameters The parameters used here are the same as those used for SALAAM in Chapter 3. Moreover, in this evaluation, the sense selection criterion is a parameter, and it is set to MAX. Hypotheses 1. Hypothesis 1 Results from Condition 1 AWGL and Condition 2 AWSYS will illustrate signif- icant precision improvement over the RBL baseline. SALAAM is more infor- mative in its sense-annotations than random sense choice. 2. Hypothesis 2 Results from Condition 4 AWGLSYS-I will show precision improvement over conditions AWGL and AWSYS. AWGLSYS-I is an exclusive voting scheme that aims at improving the tagging quality in terms of precision. We expect the recall to decrease since several items will be excluded. 3. Hypothesis 3 Results from Condition 5 AWGLSYS-U will show recall improvement over 148 Condition 1 AWGL and Condition 2 AWSYS. AWGLSYS-U is an inclusive voting scheme which aims at maximizing the coverage of the test data. Since scorer2 rewards partial credit, then we expect precision to decrease owing to the introduction of noisier tags. 4. Hypothesis 4 Results from Condition 4 AWGLSYS-I will illustrate better precision than Con- dition 5 AWGLSYS-U. 5. Hypothesis 5 Results from Condition 5 AWGLSYS-U will illustrate better recall than Condi- tion 4 AWGLSYS-I. 6. Hypothesis 6 Results from Condition 3 AWGLSYS will show precision and recall improve- ment over Condition 1 AWGL and Condition 2 AWSYS. 7. Hypothesis 7 Results from Condition 3 AWGLSYS will show comparable precision to Con- dition 4 AWGLSYS-I and comparable recall to Condition 5 AWGLSYS-U. The key ingredient here is the variability in translation achieved by condition AWGLSYS. 149 Results Table 4.6 illustrates the results obtained by applying the different conditions to SPSV2AW. The tagged test set for each condition is evaluated against the gold standard, AWGS, using scorer2 software set to the fine-grain evaluation mode. Conditions P% R% COV% RBL 27.7 21.9 79.01 AWGL 38.6 18.3 47.45 AWSYS 36.9 17.2 46.76 AWGLSYS 39.1 21.5 55.02 AWGLSYS-I 39.8 18.3 45.99 AWGLSYS-U 37.1 21.2 57.18 Table 4.6: Results in % for RBL, and the different evaluation conditions of test set SPSV2AW Results for conditions AWGL and AWSYS achieve statistically significantly bet- ter precision scores than RBL using the Zscore significance test. Hypothesis 1 is accepted. We note the significant drop in coverage from 79.01% for RBL to 47.45% and 46.76% for AWGL and AWSYS, respectively. It is worth noting the relatively low coverage of the baseline in general. This RBL coverage score demonstrates the fact that some sentences are not aligned as they exceeded the cap of 70-token per sentence length set by the automatic alignment software. More importantly, the coverage level 150 reflects the fact that not all tagged Spanish nouns in AWGS corresponded to nouns on the English side of the corpus. AWGLSYS-I condition shows better precision results than AWGL and AWSYS conditions, which allows us to accept hypothesis 2. However, AWGLSYS-I maintains the same level of recall with a slight loss in coverage relative to the coverage achieved by conditions AWGL and AWSYS. In accordance with hypothesis 3, AWGLSYS-U condition exhibits statistically sig- nificantly better recall results than conditions AWGL and AWSYS, respectively. More- over, we observe a significant improvement in coverage, from a maximum of 47.45% for the AWGL condition to 57.18% for the AWGLSYS-U condition. We also note that precision for condition AWGLSYS-U is at an expected midpoint between precision for AWGL and AWSYS. Results obtained in condition AWGLSYS-I achieve better precision than AWGLSYS- U supporting hypothesis 4. A significant increase in recall and coverage is obtained in condition AWGLSYS-U over AWGLSYS-I, which supports hypothesis 5. The intralanguage post-alignment merge condition, AWGLSYS, produces the best results on all three measures relative to the individual MT system conditions AWGL and AWSYS. We see an improvement in precision and a significant improvement in both recall and coverage supporting hypothesis 6. AWGLSYS scores a precision of 39.1% as opposed to a precision of 39.8% for 151 condition AWGLSYS-I. There is no significant difference between the two conditions on precision. Similarly for recall, conditions AWGLSYS and AWGLSYS-U achieve comparable results with no significant difference. These results support Hypothesis 7. In terms of coverage, AWGLSYS condition yields comparable results to AWGLSYS-I. In addition to the results reported in Table 4.6, FSBL yields a precision of 43.2% and a recall value of 34.1% at a coverage of 79.01%. FSBL is not included in the results table as it is not an appropriate baseline for this task, as discussed earlier in the baseline section. We note, however, that FSBL achieves significantly better results than any of the SALAAM conditions on all measures. Discussion This evaluation lays the basis for a robust system of bootstrapping sense tagging for a new language with scarce automatic knowledge resources. We practice caution be- cause this evaluation has many approximations due to the limitation on resources avail- able, nonetheless, to our knowledge, this is the first attempt at bootstrapping sense tagging automatically for a language with limited resources. Despite the modest re- sults when compared to SALAAM source tagging as discussed in Chapter 3, precision results are encouraging as they indicate a significant departure from the RBL preci- sion results, which are at a noticeably lower coverage level than those seen in Chapter 3. The recall and coverage are very low compared to the results achieved for these measures on the source language. 152 But before discussing the details of these results, the results from FSBL beg the question of why use SALAAM at all if FSBL achieves higher scores on all measures. The response lies in language distance and homonymy. In cases where source and target languages are close and one of the languages has an ontology that is arranged with the most frequent sense listed first,22 using FSBL as a bootstrapping method is worthwhile. Spanish and English are relatively close languages as they have a shared ancestor among other things. The common origin results in preserving ambiguity, leading to many cases of semantic overlap. In these cases polysemous nouns tend to be used in the same way with regular polysemy and metonymy, even with homonymy. For example, the polysemous word interest in English has the same meaning as the word interes in Spanish. The problem arises when using FSBL where languages are distant. The languages grow apart and pragmatic differences start playing a significant role in the correct ordering of senses. It is especially worse where an English word is homonymous. For instance, the first sense of bank in WordNet 1.5 is side of the river sense. Yet, when bank is aligned with bnk in Arabic, the appropriate sense for the Arabic word is the financial institution sense of bank. The following is a discussion of the different factors that affect precision, recall and coverage. Precision The first factor that affects precision is translation quality. Both MT systems pro- 22Or default sense or most typical sense 153 duce close to gisting quality translations on the English side of the parallel cor- pus where many of the ambiguous words are not even translated. Qualitatively inspecting the English translation output suggests that the quality of translation into English is worse than the translation from English for both MT systems. For example, the following Spanish sentence: Las artes caminan hasta que se produce una quiebra; entonces su presencia rompe con lo que fueron modelos arregostados en co?modas repeticiones. is translated into English as follows: The arts walk until a crash takes place; then your presence breaks up with what you/they were model arregostados in comfortable repetitions. The presence of the Spanish words in the source type sets is noisy resulting in deflated precision. Moreover, some ambiguous words that are homonymic in nature are translated into the wrong word in English simply because the MT system defaults to the most common sense for an ambiguous word. pseudo-translating the source side of the parallel corpus has a cascading negative effect on the automatic POS tagging quality since many of the words are not 154 translated. Many tokens are mistagged as nouns which are eventually included in source type sets ? tokens being identified as nouns which are not nouns. Another issue that affects precision is the presence of faux amis. This is a phe- nomenon that is present in languages that are close to one another. Faux amis, occur when a word in Spanish is left untranslated, and it exists in English in the same orthographic form. For example, sensible in Spanish corresponds to sensitive in English, not reasonable, which is the meaning of the English word sensible. Misalignments constitute a huge bottleneck that seriously affect precision. Ad- mittedly, as mentioned in Section 4.2, MT alignments are more consistent than HT alignment, yet MT is still a source of considerable noise in the source sets. For example, in the source set (ABANDON ABANDONMENT DERELICTION DROPOUT FEELING NEGLECT), FEELING is an obvious outlier even though it is a related word; it results from a misalignment. As discussed in Chapter 3, such misalignments, especially if they are monosemous, could yield bias in the wrong direction for the NG sense selection algorithm. Sense granularity mapping between the Spanish WordNet and the English Word- Net 1.5. is an issue in this evaluation. As mentioned in Section 4.3.6, there are senses in AWGS that do not exist in WordNet 1.5, and vice versa, which is re- flective of the different granularity size of the concepts in these two languages. For example, for AWGS, there exist 50 approximate cases as described in the 155 section describing the gold standard. Twenty five of these cases are tagged by some SALAAM condition with only one of them tagged correctly. Recall Several factors affect recall. Due to the quality of the pseudo-translations, many of the words are left untranslated, which leaves them un-amenable to forming source type sets; consequently, they are left untagged. Approximately 20% of the potential noun instances form singleton source type sets, which means they are not passed onto the NG sense selection algorithm. For example, out of the total 57791 noun instances in experimental condition AWGLSYS, more than 9753 noun instances form singleton source type sets, therefore, they are excluded from the tagging process. Three of the sentences, comprising 24 sense tagged noun instances, are excluded from SPSV2AW since they exceed the length limit set by the token alignment software. Divergences in the POS tags between Spanish and English lead to low recall. These divergences result from both poor quality of the automatic POS tagging of the pseudo-translated English, and genuine divergences where some nouns in English are translated into other POS tags in Spanish and vice versa. Furthermore, the human annotator manually altered some of the POS tags in the corpus. She also changed the tokenization of several instances, thereby creating 156 compound nouns in Spanish. This resulted in 21 cases of compound nouns in AWGS. Only one of these compound nouns was found and correctly tagged in experimental condition AWGLSYS.23 Coverage The same factors that affect recall affect coverage. The coverage scores obtained from FSBL and RBL clearly indicate an upper bound on coverage achieved in this evaluation. The scores indicate that more than 20% of the Spanish noun instances tagged in AWGS do not exist for SALAAM. This is mainly due to the POS divergences, which is discussed above as one of the factors affecting recall. The even lower scores yielded by the SALAAM conditions are a reflection of the nouns that are excluded due to sentence length problems or singleton source type set issues. 4.3.7 General Discussion We note the difference in performance for the Arabic and Spanish projected tagging. Arabic yields better results in terms of overall precision. Yet, it is hard to compare across both evaluations. As an experimental setup, the Spanish evaluation is blind where the annotator re- lies on a monolingual resource for tagging the Spanish text without having access to 23Compounds are automatically created on the target side of the parallel corpus when there is a one- to-many correspondence between the source and target alignments produced by GIZA++. 157 the English translations at all. Yet, this evaluation suffered the effect of relying on a pseudo-translated source corpus. A more realistic approach would be to have the SP SV2LS manually translated to English and then perform the same evaluation with good quality English source data. Nonetheless, this section provides a rigorous frame- work for performing the task of evaluating projected sense tags on the target language side of a parallel corpus. 4.3.8 Summary In summary, SALAAM is devised as a new technique for word sense tagging a tar- get language with source language resources. The quality of tagging of the target language using SALAAM is evaluated for two languages: Arabic and Spanish. The results obtained from Arabic demonstrate that of 90.5% of the correct tags for English noun instances are correct tags for Arabic zoning in on the commonality of sense us- age cross-linguistically, in effect, quantifying meaning characterizations for a language with poor resources via its shared sense usages with rich resources. On the other hand, we perform a fully automated blind evaluation of the quality of projected tagging for Spanish data. The results obtained are modest even though they significantly improve on a random baseline. The main reason for the modest performance is attributed to the use of source pseudo-translations accompanied with inconsistencies in alignments, therefore detrimentally affecting the quality of the tagging. But nonetheless, the tech- nique presented is a new technique that is fully automated and, except for the parallel 158 corpus and gold standard set, requires minimal resources. 4.4 Feasibility of bootstrapping a WordNet style ontol- ogy for Arabic 4.4.1 Introduction Efforts in the domain of ontology creation have mostly been manual. EuroWordNet [79] exists for several languages: Dutch, Spanish, French, Czech, Italian and Estonian; EuroWordNet interfaces these different Ontologies with the Internal Language Index (ILI). The bootstrapping method starts with monolingual dictionaries for the new lan- guage, and an ontology is created in the WordNet format. Apart from the immense time investment in the bootstrapping phase, the researchers are faced with the chal- lenge of linking the created WordNet with existing WordNets and dealing with sense granularity issues which is one of the biggest challenges facing such an endeavor. Having a method that leverages existing resources is a big plus as the manual task of creating an ontology such as WordNet is extremely expensive and genuinely daunt- ing. The problem becomes even more challenging when the language in question is a language with scarce automatic knowledge resources such as Arabic. The method we are proposing here, in fact, a side effect of applying SALAAM to a parallel corpus, automatically bootstraps a WordNet for a new language by obtaining the mappings cross-linguistically, thereby bootstrapping the conceptual mapping. Given a large and 159 diverse enough parallel corpus with good quality token alignments, this method can help bootstrap a large ontology for a new language from scratch. In this section, we investigate the feasibility of bootstrapping a WordNet ontology for Arabic. The appeal of building a WordNet for Arabic is not only based on empirical grounds for computational linguistic applications, but also it allows for an exploration of interesting lexical semantic cross-linguistic variations ? albeit at this stage exclu- sively paradigmatic. Like other languages, Arabic lexemes exhibit the full range of ambiguity attributes from regular polysemy to metonymy and homonymy. Lexical ambiguity in Arabic is further compounded by the writing system; as mentioned ear- lier, written texts in Arabic typically omit the short vowels leading to more ambiguity, creating false homonyms. For instance, the word klya in the written form could refer to kidney, faculty ? college sense ? or completion. In fact, klya is pronounced differ- ently depending on the intended meaning; therefore, when it is referring to kidney, it is pronounced kilya, and when it is referring to faculty it is pronounced koleya. Yet, the writing system does not capture this difference. Context is constantly used by speakers and readers of Arabic text to resolve this ambiguity online. In this particular example, faculty and completion is a case of genuine homonymy, though the Arabic completion sense is more of an adjective than a noun. Methods relying on context and/or vowel restoration are very useful in this level of lexical ambiguity resolution. 160 4.4.2 Evaluation With that intent in mind, we evaluate the 526 word instances of Arabic that are deemed correctly tagged using the English WN17pre (see Section 4.3.5). Same level sense granularity: Arabic and English words are equivalent We observe that a majority of the ambiguous words in Arabic are also ambiguous in English; they preserve ambiguity in the same manner; in Arabic, 368 noun tokens corresponding to 162 noun types,24 are at the closest granularity level with their English correspondent;25 For instance, all the senses of care apply to its Arabic translation E3nAyA; this is illustrated in Figure 4.3. It is worth noting that the cases where ambiguity is preserved in English and Ara- bic are all cases where the polysemous word exhibits regular polysemy and/or metonymy. The instances where homonymy is preserved are borrowings from English. Metonymy is more pragmatic than regular polysemy [14]; for example, tea in English has the following sense: This sense of tea in Figure 4.4 does not have a correspondent in the Arabic shay. Yet, a word like lamb in English has the metonymic sense of MEAT and this is preserved in Arabic. Researchers building EuroWordNet have been able to devise a number of consistent metonymic relations that hold cross linguistically 24Arabic words are not lemmatized; therefore, some cases are included as both plural and singular forms. 25This means that all the English senses listed for WN17pre are also senses for the Arabic word. 161 1: care, attention, aid, tending: the work of caring for or attending to someone or something; ?no medical care was required?; ?the old car needed constant atten- tion? 2: caution, precaution, care, forethought: judiciousness in avoiding harm or dan- ger; ?he exercised caution in opening the door?; ?he handled the vase with care? 3: concern, care, fear: an anxious feeling; ?care had aged him?; ?they hushed it up out of fear of public reaction? 4: care: a cause for feeling concern; ?his major care was the illness of his wife? 5: care, charge, tutelage, guardianship: attention and management implying re- sponsibility for safety; ?he is under the care of a physician? 6: care, maintenance, upkeep: activity involved in maintaining something in good working order; ?he wrote the manual on car care? Figure 4.3: English WN17pre entries for care such as fabric/material, animal/food, building/organization [78, 82]. In Arabic these defined classes seem to hold, yet this specific case of tea and party does not hold. In Arabic, the specific sense is expressed as a tea party or Haflet shay. Arabic word equivalent to English word subsense In this evaluation set, there are 122 instances where the Arabic word is equivalent to a subsense only of the English word. The 122 instances correspond to 78 word types. An example is illustrated in Figure 4.5; the correct sense tag assigned by 162 3: a reception or party at which tea is served; ?we met at the Dean?s tea for newcomers? Figure 4.4: Metonymic sense of tea in WN17pre SALAAM to ceiling in English is sense 1, which is correct for the Arabic word sqf. Yet, the other 3 senses are not correct translations for sqf; for instance, sense 2 would be translated as Irtifa3 and sense 4 as 3low. 1: ceiling: the overhead upper surface of a room; ?he hated painting the ceiling? 2: ceiling: (meteorology) altitude of the lowest layer of clouds 3: ceiling, cap: an upper limit on what is allowed: ?they established a cap for prices? 4: ceiling: maximum altitude at which a plane can fly (under specified conditions) Figure 4.5: English WN17pre senses for ceiling This case is particularly dominant where the English word is homonymic. By definition, homonymy is when two independent concepts share the same ortho- graphic form, in most cases, by historical accident. Homonymy is typically preserved between languages that share common origins or in cases of cross- linguistic borrowings. Owing to the family distance, preserving homonymic ambiguity holds the least between English and Arabic. For example, tower in English has the following sense illustrated in Figure 4.6, which does not exist at 163 all for the Arabic word brj. 3: a powerful small boat designed to pull or push larger ships Figure 4.6: Homonymic sense for tower in WN17pre Therefore, for most homonymic polysemous words in English, the Arabic trans- lation corresponds to one of the homonymic senses only. English word equivalent to Arabic subsense 35 instances, corresponding to 18 type words in Arabic, are manually classified as more generic concepts than their English counterparts. For these cases, the Arabic word is more polysemous than the English word. As an example, Figure 4.7 shows the word experience listed with 3 senses in WN17pre. All 3 senses are appropriate meanings of the Arabic word tjrba but they do not include the SCIENTIFIC EXPERIMENT sense covered by the Arabic word. From the above points, we find that 62% of the ambiguous Arabic words evaluated are conceptually equivalent to ambiguous English words. This finding is consistent with the observation of the builders of EuroWordNet. Vossen, Peters, and Gonzalo (1999) find that approximately 44-55% of ambiguous words in Spanish, Dutch and Italian have relatively high overlaps in concept and the sense packaging of polysemous words [78]. 31% of the ambiguous Arabic words correspond to specific subsenses of the English word and 7% of the Arabic words are more generic than the English words. 164 1: experience: the accumulation of knowledge or skill that results from direct par- ticipation in events or activities; ?a man of experience?; ?experience is the best teacher? 2: experience: the content of direct observation or participation in an event; ?he had a religious experience?; ?he recalled the experience vividly? 3: experience: an event as apprehended; ?a surprising experience?; ?that painful experience certainly got our attention? Figure 4.7: WN17pre senses for experience The encouraging results obtained from the manual analysis of a sizeable sample of the Arabic tagged data suggests that bootstrapping an Arabic WordNet style ontology is a feasible task. 4.4.3 Levels of representation As mentioned earlier, Arabic has a templatic syntax; roots are transformed into stems based on a templatic fit. For example, the root ktb becomes kitab based on the tem- plate fi3al. Stems are usually embedded with prefixes and suffixes creating surface forms that are the words as they appear in text. Reducing a surface form to a stem is relatively easy given a light stemmer [16]. In traditional Arabic monolingual dictio- naries, the entries are in root form. Yet, the writing system hardly ever has the roots in raw form. In the following discussion, we examine issues regarding the appropriate 165 representation level for an Arabic WordNet. Roots As mentioned earlier, words in Arabic, as a Semitic language, have roots. Roots are the underlying forms from which stems and surface forms generate. The dy- namic role attributed to roots might be a result of pedagogical factors: language is taught in schools with an emphasis on roots; dictionary entries are indexed by their roots. Most words in Arabic can be reduced to 3 or 4 letter roots. Roots are typically consonant based. Arabic has generative templates that lead to the creation of stems. Roots are highly generative and typically very ambiguous. For instance, a word like sh3r means hair, poetry or to feel. This could be treated as a case of homonymy that is resolved by applying the appropriate template; therefore, the stem for hair is shA3r, for poetry shi3r and for to feel it is shaA3rA. Likewise, the root Hrm generates Haram as in shrine, sanctuary, wife or forbidden; it is also the root for the clothes worn by pilgrims as in iHram, as well as the root for thief as in Haramy. Due to the pervasive ambiguity in the root representation, one would expect a huge overlap between the different POS databases in an Arabic WordNet. We find the option of creating an ontology based on roots theoretically elegant, especially if the templates are not ambiguous. A root based ontology will have 166 to be generative and underspecified. The main bottleneck is extracting the root from a surface level representation since words do not occur in their root form in written nor spoken Arabic. Several off-the-shelf morphological analyzers may be utilized to reduce surface forms to their corresponding roots, yet coverage remains a severe bottleneck [16, 11]. Stems A stem based ontology is a more direct approach to building an enumerative WordNet style sense inventory. Empirically, Arabic stems are more accessible by computational systems. Texts are written in surface form but easily trans- formed to stems (see Section 4.3.5 above). Stems are more distinguishable as different POS tags based on the templates they correspond to in Arabic. The main problem with stems is normalization; the same words meaning the same thing may be written in various ways. For example, the word for schools in Ara- bic maybe madares or madrasat. The second form is mostly predictable but the former form is not. Issues also arise with infixing, depending on the case of the word in question; words may have different endings. For example, authors in Arabic is either mo2leffeen or mo2leffwn. These minor hurdles are surmount- able with the availability of good tokenizers and morphological analyzers. Choosing the appropriate level of representation is an issue worth in-depth investi- gation. Our preliminary qualitative assessment calls for using the most direct approach 167 at the beginning and then refining the ontology with some form of hybrid representa- tion of both roots and stems in a multidimensional WordNet representation. 4.4.4 Summary SALAAM is explored as a method for seeding a WordNet style ontology for Arabic. By quantitative inspection, the approach seems promising. We discuss different issues of representation for Arabic specifically. We conclude that stems as a first step are the appropriate level of representation for the entries in such an ontology. 168 Chapter 5 Exploration into Bootstrapping Supervised WSD 5.1 Introduction It has been established that supervised WSD systems yield better results than unsu- pervised systems [40]. Yet, tagged data is not always available for training. Indeed, lack of training data is a very severe bottleneck for supervised systems. One of the goals of the SENSEVAL exercises is to create large amounts of sense-annotated data for supervised systems [40]. The problem is ever more challenging when dealing with a language with scarce knowledge resources. Typically, when confronted with a new low density language, researchers are preoccupied with building tools and knowledge resources before seriously observing WSD issues. Despite its central role to most NLP applications for any language, WSD is deemed too complicated. One of the goals of SALAAM is to provide large amounts of sense-annotated data in several languages simultaneously to bootstrap supervised WSD systems, thereby, 169 loosening the bottleneck on data acquisition for training supervised systems. Sense annotations yielded by SALAAM are noisy compared to manually tagged data but in the absence of an alternative they serve as a good initial launching board. Explicitly, this chapter explores the nature of the trade-off between small amounts of cleanly tagged data versus large amounts of noisy data for training in a supervised setting. Most supervised WSD systems follow the canonical training-testing paradigm. The key idea in most supervised WSD systems is that senses are viewed as classes, rendering the problem an explicit classification problem. The goal of the WSD system is to assign test data items to the correct classes based on learning properties of the classes in a training period. Most supervised systems utilize machine learning algo- rithms. The machine learning paradigm may be briefly described as follows: In the Training Phase, given sufficient training examples per class, the system extracts relevant features from the context of the word in question creating a feature vector;1 valid features could be part of speech tags [10], syntactic features [25], con- text n-grams [61, 63], or a combination of the different contextual features [12]; the machine learning system ? the learner ? learns estimated parameters based on as- sociating a class (sense) or set of classes with features extracted from the training ex- amples. In summary, the learner learns parameters from explicit associations between the class and the features, or combination of features, that characterize it. 1The features are at the crux of any classification system; different types of classifiers and ensemble classifiers have different merits, however, the importance of the features can not be over stressed. 170 In the Testing Phase, given a new test item, the supervised WSD system extracts features based on the same conditions that are used in the training phase. According to the learned estimated parameters acquired in the training phase, a prediction process takes place where the learner predicts the best class for the new test item. Conse- quently, the test item is annotated. Needless to say, such systems are very sensitive to the training data. Training data should provide ample coverage of the potential classes. Majority of approaches for WSD within a supervised framework attempt to ensure that the training and test data are from the same genre, domain, and that the training data provides sufficient coverage of the possible senses. In this chapter, we examine various issues in connection with bootstrapping a typ- ical supervised method for WSD using SALAAM annotated data for training. This method aims at alleviating the training data annotation bottleneck for most supervised systems. We investigate the first phase in an iterative approach to bootstrapping a su- pervised system using unsupervised sense-annotations. In a bootstrapping approach, the need arises for examining the different factors affecting the supervised system?s performance. Accordingly, we discuss different parameters as components in a fitness function that can potentially be automatically applied to the unsupervised training data in order to ensure/predict good classification performance for the supervised WSD sys- tem. It is worth emphasizing that training on data that results from an unsupervised approach renders the whole approach here unsupervised for this task even though it 171 utilizes the canonical learning paradigm. The layout for this chapter is as follows. After stating the problem being addressed and the motivation behind this work in Section 5.2, we discuss related work in the area of bootstrapping WSD systems in Section 5.3; Section 5.4 reviews the particu- lar supervised model and application that is used as a test bed for the bootstrapping technique presented here; Section 5.6 describes an empirical investigation into the fea- sibility of such an approach; this is followed by a general discussion of the results with a close look at the different parameters affecting the performance of the bootstrapping approach in Section 5.8. 5.2 Motivation The availability of sense-annotated training data is a serious bottleneck for supervised WSD systems. Previously, researchers have investigated using dictionaries and super- vised methods with clean data for iteratively bootstrapping the tagging effort [58, 85]. These approaches do a good job when the resources are available for a language. The problem is ever more challenging when we migrate to a new language with scarce knowledge resources. Unlike previous approaches, we propose to start the supervised tagging system with an unsupervised seed set. Unsupervised systems do not have as much knowl- edge/tool requirements as supervised systems; moreover, they are less language and corpus dependent. Therefore, in this chapter, SALAAM is presented as a means of 172 providing large amounts of sense-annotated data with the aim of relieving supervised systems from the training data acquisition bottleneck. SALAAM is an appealing approach as it not only provides sense tagged data in one language, rather in two simultaneously. In Chapter 4, we obtain encouraging results for the projected sense tags on a second language; furthermore, the results are significantly better than a random baseline and therefore should provide the appropriate signal for supervised learners amidst the noise. Accordingly, SALAAM has the advantage of providing a multilingual framework for solving this problem. 5.3 Related Work This chapter relates to work in monolingual bootstrapping by Gale, Church and Yarowsky [27], Yarowsky [85], and Mihalcea [58]. The first study by Gale et al. (1992) is the earliest study to our knowledge which directly discusses the feasibility of bootstrapping a WSD system using noisy data. They give an empirical evaluation of the level of degradation in the WSD system?s performance as they introduce different levels of error. (For a review of that paper see Section 2.4.2, Chapter 2) They conclude that their system is tolerant to noise in the training data. It is worth noting that they look at 6 data items each with 2 senses only and with a bounded number of examples per item. In this chapter, we ask the same question: can we bootstrap supervised WSD systems using noisy examples for training. 173 Research by Yarowsky and later Mihalcea is different from the research presented by the previous study. The question is asked differently. The focus is more on the bootstrapping technique rather than on the quality of the data. They address the issue of data quantity while maintaining the good quality level of the training examples. Both investigations present algorithms for bootstrapping supervised WSD systems using clean data based on a dictionary or ontology resource. The general idea is to start with a clean initial seed and iteratively increase the seed size to cover more data. In Yarowsky?s work [85], he starts with a few tagged instances to train a decision list approach for tagging unlabeled data. The initial seed is manually tagged with the correct senses based on entries in Roget?s Thesaurus. The approach is unsuper- vised. He reports very successful results ? 95% ? on a handful of data items. A directly comparable study to our exploration in this chapter, however, is work by Mihalcea [57, 58]. She bases her bootstrapping approach on a generation algorithm, GenCor. GenCor creates seeds from monosemous words in WordNet, Semcor data, Sense tagged examples from the glosses of polysemous words in WordNet, and other hand tagged data if available. This initial seed set is used for querying the Web for more examples and the retrieved contexts are added to the seed corpus. The words in the contexts of the seed words retrieved are then disambiguated. The disambiguated contexts are then used for more querying of the Web for more examples, and so on. It is an iterative algorithm that incrementally generates large amounts of sense tagged data. The words that are found are restricted to either part of noun compounds or 174 internal arguments of verbs. Mihalcea reports results of 69.3% precision on the English SENSEVAL 2 allwords task using the bootstrapped corpus for training an instance-based-learning supervised WSD system. When applying GenCor results as the training examples for her super- vised system in the SENSEVAL 2 Lexical Sample English exercise, she shows that the approach yields results comparable to those obtained when training with hand tagged data. Mihalcea reports the results for 6 items of the 29 items in the Lexical Sample exercise. Table 5.1 compares her results obtained by training the learning supervised system on hand annotated examples against those obtained by training on the auto- matic generated corpus using GenCor. We show only the precision percentages using scorer2 in the fine grain mode; we have added a column here in the table where we calculate the Performance Ratio (PR) (see equation (5.2) below) of scores obtained using GenCor examples to those using hand tagged examples. 5.4 Empirical Layout Similar to the experimental design presented by Mihalcea [58] and described in Section 5.3 above, we compare results obtained by a supervised WSD system for English using human tagged examples against SALAAM tagged examples for training. We use the same test set used by Mihalcea, the data from the SENSEVAL 2 English Lexical Sam- ple task. The supervised system we use is the system developed and tested by Univer- sity of Maryland for the SENSEVAL2 English Lexical Sample exercise, UMSST [12]. 175 Hand Tagged Data GenCor Data Nouns Training Size Prec. Training Size Prec. PR art 123 65.4% 265 73.1% 1.12 chair 121 82.5% 179 87.3% 1.05 channel 78 34.1% 1472 40.9% 1.19 church 81 63.9% 189 58.3% 0.91 detention 46 87.5% 163 83.3% 0.95 nation 60 73.1% 225 69.5% 0.95 Table 5.1: Comparative results obtained by Mihalcea?s bootstrapping system when training an instance-based learning supervised WSD system using both human tagged data and GenCor tagged data as training examples This supervised system utilizes a Support Vector Machine (SVM) learning paradigm for the classification and tagging of the items in question. 5.5 University of Maryland Supervised Sense Tagging system (UMSST) The UMSST system is created in the classic supervised learning framework. Each word ? test item ? that will be tagged is considered an independent classification problem. In the learning phase, the training examples pertaining to an ambiguous 176 word are analyzed, creating feature-value pairs labelled with the correct class (sense). These data are used to estimate the parameters for the learner in the supervised system producing a trained classifier. The classifier is then applied to the unseen test data instances; for each instance, the classifier predicts the appropriate sense tag. The features used for UMSST are contextual features with weight values associ- ated with each feature. The features are extracted from the immediate context of the labelled word. The text is tokenized with an English specific tokenizer. Then three types of features are extracted from the tokenized text: wide context features; narrow context features; and grammatical features. The wide context features use all the tokens in the paragraph where the labelled instance occurs. The narrow context feature is a collocational feature; it only takes the tokens within a fixed window size surrounding the labelled word. In this implemen- tation of UMSST, is set to 3 tokens on each side of the labelled instance. As for grammatical features: the context of the instance is parsed using a dependency parser [48, 51]; syntactic tuples such as verb-obj, subj-verb, etc. are extracted. For example, if the word feature in the second sentence in the preceding paragraph is the word of interest, its narrow collocational features will be     &    for the words to the left and right of the word feature when is specified as 3 tokens. The grammatical features are subj-of(be, feature) and mod-n(feature, context). As for the wide context features, those are all the words in the paragraph as follows, the, wide, context, feature, use, all, token, in, paragraph, where, the, label, instance, occur, 177 ., narrow, is, a, collocational, ;, it, only, take, ..., etc.. Each feature extracted is associated with a weight value. The weight calculation is a variant on the Inverse Document Frequency (IDF) measure in Information retrieval. The weighting in this case is an Inverse Category Frequency (ICF) measure where each token is weighted by the inverse of its frequency of occurrence in the specified context of a specific labelled instance. For example, if the feature is the token this and it co-occurs with all instances of the senses of a given word, the ICF value for this is very low, indicating that this does not contribute significant information as a discerning feature for any of the senses of the ambiguous word. On the other hand, if the ICF of a feature is high, it indicates that feature has discriminatory power among the senses of the polysemous word. The learning approach used by UMSST is a Support Vector Machine (SVM) algo- rithm.2 Similar to other learning paradigms, the system takes in the training instances for the word in question and yields a classifier; it takes a feature vector as input and produces a confidence function over all possible categories observed in the training data. SVM is chosen as an appropriate learning framework because it can achieve high performance with very large numbers of features. Moreover, SVMs are known for their interpretability and robust theoretical basis. The version that is used in the UMSST 2see http://www.computer.org/intelligent/ex1998/pdf/x4018.pdf for interesting discussions about the advantages of Support Vector Machines. 178 experiments is an off-the-shelf implementation, '(  & $' by Joachims [36].3 For each word in the Lexical Sample task, a family of classifiers is constructed, one for each of a given word?s senses. All the positive examples for a sense  are considered the negative examples of   where   ). In the testing phase, similar feature vectors are created for the test data; the feature vectors are run with the SVM classifiers based on the parameters estimated in the training (learning) phase. The sense that yields the strongest positive response is selected. UMSST trains on the hand- tagged provided English Lexical Sample training data and is tested on the test data for the same exercise. The contexts for the English Lexical Sample data are defined by the organizers of the SENSEVAL2 task; A typical context spans 2 to 4 sentences in length on average. The approach as described so far yields the results for the UMSST trained using human annotated examples. For purposes of this current evaluation, we are only inter- ested in nouns. 5.6 Bootstrapping Evaluation In this evaluation, we use the large amounts of English data sense tagged by SALAAM as described in Chapter 3. We create a system that generates the contexts for the SALAAM tagged examples automatically in the format of the SENSEVAL2 Lexical Sample training and testing data. 3SVMlight is available at http://www.ai.cs.uni.dortmund.de/svmlight. 179 5.6.1 Test data The SENSEVAL2 Lexical Sample test data (SV2LS-test) comprises 29 nouns with varying numbers of example contexts per noun. The nouns range in polysemy from 2 senses to 19 senses per noun. Table 5.2 lists some of the test items and their character- istic features. The perplexity is calculated as !. Entropy is measured as follows [40]:  *+    (5.1) ( where  is a sense for a polysemous noun and + is the set if its senses. Like entropy, perplexity is a measure of the bias in the distribution of the contexts ? the training examples ? for the different senses. The less perplexity, the less en- tropy, the more bias there is, i.e. the higher the perplexity the more the distribution of contexts is closer to a uniform distribution. Perplexity is thought of as the weighted average sum of choices a random variable has to make [37]. therefore if the senses are distributed uniformly, then every sense is equally likely to be chosen. For example if there are 3 senses to choose from and the training contexts are the same for all 3, then the perplexity is going to be 3 which means that the learner is choosing from all 3 senses with equal probability. The average perplexity in the test set is 3.47. The average number of senses is 7.93. The total number of contexts for all senses of all words in the test set is 1773. 180 5.6.2 Hand-Tagged Training Data The hand-tagged training data is obtained from the SENSEVAL2 Lexical Sample data provided by the organizers of the SENSEVAL exercise.4 This training data corpus comprises 44856 lines and 917740 tokens. Table 5.3 illustrates the characteristics of the hand-tagged training data. As depicted in Table 5.3, similar to the test set, there are 29 nouns ranging from 2 senses to 19 senses per noun; they average 7.93 senses per noun. Perplexity is measured by equation (5.1), the average perplexity across the different nouns in the hand-tagged training data is 3.75. The total number of contexts is 3587 contexts for the whole set. As expected, there is a close affinity between the test set and the hand-tagged data used for training; The perplexity values are close with 3.47 in the test set and 3.75 in the hand-tagged training data. The number of contexts in the training data nearly dou- bles that of the test set with 3587 contexts in the hand-tagged data and 1773 contexts for the test set. Figure 5.1 plots the Pearson R correlation coefficients? of the sense distributions for the test set and the hand-tagged training data across the 29 nouns. As depicted by the graph, the correlations are all positive ranging from 0.90 to 1.0. Figure 5.2 below further emphasizes the relationship between the hand tagged training data and the test data. In the graph, we plot the trend lines with the test data perplexity values depicted with a solid line and the hand-tagged training data 4http://www.senseval.org/ 181 Figure 5.1: Sense distribution correlations across different nouns in the test data and hand-tagged training examples perplexity values represented as the hashed line. The X-axis represents the nouns in alphabetical order while the Y-axis the perplexity values. As shown in the graph, the two lines almost overlap. The close correspondence between the training and the test data in terms of the sense distribution and perplexity shows that the former is a very good representation of the latter. 182 Figure 5.2: Trend lines of the perplexity measure for test data and hand-tagged training data 5.6.3 Gold Standard In this evaluation, we are comparing the performance of the same supervised system UMSST trained on SALAAM tagged data (UMSST-SALAAM) against the perfor- mance of the system when trained on hand-tagged data (UMSST-human). Therefore, UMSST-human?s performance is the gold standard. It is worth noting that in terms of the SENSEVAL 2 results, UMSST-human is a very representative supervised system as its scores are middle-of-the-pack scores. Table 5.4 shows the scores obtained by UMSST-human on the test data described in Section 5.6.1 when trained on the data 183 described in Section 5.6.2. The scores are calculated using scorer2 in the fine grain mode. The average precision score over all the items is 65.3%. Throughout the rest of this chapter, we only show precision scores; coverage is at a 100%, recall and precision are the same. 5.6.4 SALAAM Training Data Corpora We use the corpora sense tagged in Chapter 3 and the sense tagged HT corpora from Chapter 4 as training data. For the purposes of this evaluation, we are using the En- glish corpora only. The training data in this case is not hand annotated. We exclude the SENSEVAL2 Lexical Sample test data (SV2LS-test)(see Section 5.6.1) from the training data in this evaluation. The corpora are tagged with senses from WN17pre ? the ontology used for SV2LS-test data. We create three sets of SALAAM tagged examples based on the different corpora utilized: SV2LS-TR: English SENSEVAL2 Lexical Sample trial and training corpora MT: The English Brown Corpus, SENSEVAL1 trial, training and test corpora, Wall Street Journal corpus, and SENSEVAL 2 All Words corpus HT: UN English corpus Table 5.5 shows the relative sizes of the different corpora SALAAM-tagged and used as training data. 184 5.6.5 SALAAM-tagged Training Data Creation The goal is to render the SALAAM-tagged corpora in a similar format to that of the human-tagged training corpus provided by the organizers of SENSEVAL2 and also utilized by UMSST. These training corpora are sense tagged using SALAAM. The sense tagging for MT and SV2LS-TR is based on using SALAAM with machine trans- lated parallel corpora. The HT corpora are tagged based on using SALAAM with the English Spanish parallel corpus where the alignments are fixed (see Chapter 4, exper- imental condition 4). The SALAAM-tagged corpora are divided into instance contexts for the different nouns in the test set. A context ranges from 2-4 lines long while respecting document boundaries. For example, in the Wall Street Journal the context for a word in the last line of a document within the corpus includes the two lines before the line where the word of interest occurs; the context does not span the beginning of the following document. Instance contexts may overlap. Once the corpora are in the appropriate format, features are extracted as described in Section 5.5 for training the SVM learning algorithm. 5.6.6 Experimental Conditions We have several factors that are varied to create the different experimental conditions for this evaluation. The factors are based on the SALAAM tagging parameters. Corpus 185 We vary the corpus used for training. We have 4 different combinations for the training corpus: MT and SV2LS-TR (MT+SV2LS-TR); MT and SV2LS- TR and HT (MT+HT+SV2LS-TR); HT and SV2LS-TR (HT+SV2LS-TR); or SV2LS-TR alone (SV2LS-TR). Language This is the context language of the parallel corpus used by SALAAM to ob- tain the sense tags. We have 3 options: French (FR), where the sense tags are obtained using SALAAM in condition GLSYS-FR M or GLSYS-FR T(see condition 3, Chapter 3); Spanish (SP), where the sense tags are obtained using SALAAM in condition GLSYS-SP M or GLSYS-SP T (see condition 3, Chap- ter 3); or, Merged languages (ML), where the results are obtained by merging the output of FR and SP. Threshold This is the threshold to which SALAAM?s sense selection criterion is set. We have two options: MAX (M); and THRESH (T) (see Section 3.7.7 in Chapter 3 for detailed description). Merge Type For the ML cases, we merge the results of two languages in 4 different ways: ? Intersection (I) 186 The tagged noun instances that are common to the results of the sense tag- ging of the two merged languages are intersected. Unique noun instances to either language are kept in the tagging set. ? Union (U) The tagged noun instances that are common to the results of the sense tagging of the two merged languages are union merged. Unique noun in- stances in both languages are kept in the tagging set. ? Strict Intersection (SI) This is similar to the I case above in that it is the intersection of the results of sense tagging the two languages, but it is more restrictive in that only noun instances that are common to both languages are included in this set. i.e. Unique instances to either language are excluded from the tag set. ? Strict Union (SU) This is similar to the U case above, the SALAAM tagging results of the two languages are union merged, but only noun instances that are common to both languages are included in this set. These factors result in 48 conditions.5 We exclude 7 of these conditions due to the extreme sparsity of the contexts, i.e. nearly none of the nouns in the test set have 5We are not listing the conditions here for space considerations. An excel workbook instanceS- tats.xls that details the different conditions for the different nouns in the test set may be viewed at http://www.umiacs.umd.edu/mdiab. 187 contexts for training. We are left with 41 conditions. 5.6.7 Evaluation Metric In this evaluation, we use the Performance Ratio (PR) between two precision scores on the same test data. Precision is measured as described in Chapter 3.        " "  (5.2)  *  5.6.8 General Experimental Hypothesis Training a supervised WSD system using large amounts of SALAAM-tagged data for training results in good PR when measured against the same supervised sys- tem using hand-tagged data. For the purposes of this evaluation, we consider a PR of  an acceptable, good, performance ratio, as this is the average score of UMSST?s performance in SEN- SEVAL2 exercise. 188 5.7 Results In this evaluation, we view the different conditions as independent automatic taggers, that may be activated simultaneously.6 Therefore, we present the results in two dif- ferent ways. In Table 5.6, we illustrate the maximum results obtained by any of the conditions (taggers). In Table 5.7, we show the five best individual conditions. Table 5.6 shows the max results achieved by any of the taggers. The first column in table shows the precision (P) obtained on the test data SV2LS-test when training UMSST using SALAAM-tagged data, UMSST-SALAAM. These are the best results obtained per noun.7 These results are compared against the second column in the ta- ble, which demonstrates results of UMSST using hand-tagged training data, UMSST- human. The third column is PR measured according to equation (5.2); PR is mea- sured between UMSST-SALAAM and UMSST-human precision results. The rows are sorted in descending order based on PR, for convenience. The last row in the table il- lustrates the overall performance average. UMSST-SALAAM achieves 45.1% average precision on the test set SV2LS-test; UMSST-human yields 65.3% average precision across the different nouns on the same test set. This results in an overall PR of 0.69 for UMSST-SALAAM against UMSST-human. As shown in Table 5.6, 9 nouns yield the highest ratio of 1.00. In fact, in the first 6Even though there are conditions that seem to be very highly related, these conditions are indepen- dent. 7The results in these columns are from applying different conditions 189 two nouns detention and chair, UMSST-SALAAM scores better results than UMSST- human leading to ratios of 1.05 and 1.02, respectively. 3 nouns, art, child and material, yield ratios above 0.9, followed by 4 nouns yielding ratios above 0.65. If we are to include only nouns that achieve ratios of  ? the first 16 nouns in Table 5.6 ? the overall precision of UMSST-SALAAM is significantly increased to 63.8% and the overall precision of UMSST-human is increased to 68.4% leading to a PR of 0.93. The following Table 5.7 shows the results as obtained by the best 5 individual conditions. The top 16 test nouns in Table 5.6 achieve the highest PR across these 5 conditions. The maximum number of high PRs ?  ? yielded by any individual condition is 12. In Table 5.7, the ratios that exceed 0.65 are typed in italics. The last row in the table shows the average PR across the different conditions. The five conditions yield similar average PRs. We note the presence of extremes in these results, for instance, mouth has a PR of 0.73 for condition 1 but approximately zero PR across the board, which is a reflection on the lack of training data for that item in the respective condition. It is worth noting that they are not as high as the average ratio taken over the top 16 test nouns in Table 5.6 since some of the test nouns in these conditions do not yield the maximum possible precision which is reported in Table 5.6; for instance, none of these 5 conditions achieve the max precision for the noun child of 57.1% at a ratio of 0.97, in all these conditions the PRs for this noun range from 0.87 to 0.89. We note that these five conditions use the HT corpus and four of the five conditions 190 are the result of merged languages in the tagging using SALAAM. Comparing our approach to that of Mihalcea, she reports success for her bootstrap- ping approach ? achieving   ratio ? on 6 nouns out of the 29 nouns in the SV2LS-test set using clean data as presented in Table 5.1. In our study with noisy SALAAM training data, we achieve similar success rates with 12 nouns out of the 29 nouns in the test set, including 4 of the nouns used in her study. Moreover, 4 additional nouns yield ratios  . Figure 5.3 illustrates the results, specifically PR values, obtained using SALAAM training data versus PR values obtained by Mihalcea?s system. The hashed bars in the graph are the PR values obtained by Mihalcea while the solid bars are those of SALAAM. 5.8 Discussion The results obtained are very encouraging, especially when compared against clean data approaches such as the study by Mihalcea (see Section 5.3 above). An interesting observation worthy of noting is the precision obtained on the SV2LS- TR corpus for the 29 nouns using SALAAM. SV2LS-TR is common to all the training conditions in this evaluation. We obtain an evaluation of the quality of the tagging in the training data by evaluating the SV2LS-TR tagging since it constitutes a signifi- cant sample of the UMSST-SALAAM training data set. The idea in carrying out this comparison is to get a feel for how noisy is the tagging for the different items under 191 Figure 5.3: Comparison between Mihalcea?s results and SALAAM results on the same test set investigation. Since we have the hand tagged annotations for the different contexts of the nouns for the SV2LS-TR corpus provided by the organizers of the SENSEVAL2 Lexical Sample task, we calculate the precision of the SALAAM tagged words using scorer2 in the fine grain mode. Figure 5.4 plots the trend lines for the results presented in Table 5.8. Figure 5.4 shows the trend lines of the performances of both the supervised UMSST- SALAAM system and the quality of the SALAAM unsupervised (SALAAM-SV2LS- TR) annotations on a portion of the training data used for the 29 nouns in this eval- 192 Figure 5.4: Trend lines for the precision obtained by SALAAM-SV2LS-TR and UMSST-SALAAM uation. The X-axis is the nouns and the Y-axis is the precision score. The solid line is the trend line for the precision scores obtained by UMSST-SALAAM-tagged in the conditions that yield the max results illustrated in Table 5.6; the hashed line represents the precision of SALAAM-SV2LS-TR. As expected, the trend lines nearly overlap for several nouns such as chair, channel, day, detention. Yet surprisingly, we observe nouns where UMSST-SALAAM outperforms SALAAM-SV2LS-TR, i.e. the preci- sion obtained by UMSST-SALAAM is higher than the precision of the sample it trains on. This is further illustrated in Table 5.8 below. The rows in Table 5.8 are sorted in a descending order by the SALAAM-SV2LS- 193 TR column. 11 out of 29 nouns have lower precision in SALAAM-SV2LS-TR training data sample than that obtained by the supervised system UMSST-SALAAM. The rest of the nouns illustrate higher precision in the sample training data. We expect noisy training data to produce good quality results such as the case with detention, spade, stress and yew where the training data is noisy but still the preci- sion in SALAAM-SV2LS-TR is higher than that of UMSST-SALAAM. Nouns such as dyke and fatigue exhibit intriguing behavior as they both yield PRs of 1.0 accord- ing to Table 5.6, but SALAAM-SV2LS-TR?s precision is less than that of UMSST- SALAAM; for dyke, SALAAM-SV2LS-TR has a precision of 82.9% while UMSST- SALAAM achieves a precision of 89.3%. All the more surprising are the scores for the word art; for UMSST-SALAAM, art achieves a PR value of 0.98 against UMSST- human, yet the SALAAM-SV2LS-TR precision is very low at 25%. We count, all in all, 11 nouns where this phenomenon occurs. Such seemingly contradictory results allow us to entertain two hypotheses about the robust performance of the supervised WSD system. The first hypothesis lays the burden of the difference in performance on the rest of the tagged corpora used for train- ing. The idea is that the augmented corpora, MT and HT, have better quality tagging for those noun items than SV2LS-TR, therefore leading to an increase in precision for the UMSST-SALAAM on the test set for the 11 nouns where the performance in UMSST- SALAAM exceeds that of SALAAM-SV2LS-TR; similarly, augmenting SV2LS-TR with HT and MT introduces noise to the tagging quality for nouns such as restraint ? 194 which has a SALAAM-SV2LS-TR precision of 100%, yet UMSST-SALAAM only achieves 33.3% ?contributing to the reduced performance in UMSST-SALAAM. The second hypothesis is in tune with an interesting observation noted by Yarowsky [85] about noise in a machine learning environment, which states that noise is usually tolerated. In such an environment, correct parameters pertaining to a certain class ? sense ? are obtained from all of its occurrences, while incorrect parameters are distributed among all the different classes, therefore they do not produce statistically significant patterns. The results obtained in this evaluation lend support to Yarowsky?s observation as they show the robustness of the learning algorithm and the discrimina- tory power of the utilized features. One can easily visualize using other types of noisy data ? for example a Lesk based approach [43] ? as training data and obtaining similar performance.8 Nonethe- less, using SALAAM annotated data is interesting for two main reasons: SALAAM is able to annotate large amounts of source and target language data, therefore, we can visualize a system of bootstrapping a supervised WSD system for Arabic using the tags projected from the English WordNet ontology (see Chapter 4). Owing to the exploitation of multilingual evidence, SALAAM is able to sense 8It is worth stressing the inability to use f?irst sensea?pproaches to yield good results in classical machine learning approaches to WSD. Machine learning approaches need the variation in environments ? negative and positive ? of training examples in order to make predictions. 195 tag words that are not typically accessible to methods that use monolingual con- texts alone. This is illustrated by the complementary results presented and dis- cussed in Chapter 3, Section 3.9. 5.8.1 Analysis of factors affecting PR The results obtained are very encouraging, indeed, as an initial investigation into boot- strapping a supervised learning WSD system using noisy data. But we would like to take this approach one step further. Our goal is to automatically predict, given noisy training data, which data items are taggable by this approach and which are not. We quantify taggable as possessing the potential to yield acceptable PR values, where acceptable is set at  . The approach aims to alleviate the tagging acquisition bottleneck. Therefore, if we have a system that is capable of predicting ahead of time which candidate items need to be hand annotated and which could be annotated au- tomatically in the manner described in this chapter. The following analysis closely examines the different factors that affect the bootstrapping effort. This exploration aims at garnering a better understanding of the factors in order to be able to utilize them in a future automated system. We classify the factors into two types: First, characteristics of the training data alone, such as number of training examples, number of senses per noun, perplexity of the senses, and semantic translation entropy; second, factors that are attributes of the relation between the test data, SV2LS-test, and SALAAM training data, such as 196 the correlation between their respective perplexity measures, and correlation between their sense distributions across the different nouns. Finally, we address an attribute of the senses of the nouns in question and attempt to devise a measure of their con- text confusability based on sense similarity within a word. We discuss these factors by focusing on the best results obtained from all the automatic taggers (experimental conditions) as illustrated in Table 5.6. In the next portion of this section, we explore the effect of these different factors with respect to PR as reported in Table 5.6 against the best PR values obtained by UMSST-SALAAM. 1. Number of senses & Number of training examples in SALAAM training data It is well known within supervised learning paradigms that there is a close re- lationship between the number of examples given and the performance of the learning algorithm. Table 5.9 shows the number of training contexts and the PR yielded. Also listed in the table is the number of senses for each word. In Table 5.9, the average number of senses is 7.9 and the median is 7 senses. The cases where there are many senses such art with 17 senses, material with 16 senses, mouth with 10 senses and post with 12 senses, exhibit good perfor- mance at PRs of 0.98, 0.92, 0.73 and 0.66, respectively. The linear Pearson R correlation coefficient between number of senses and PR of UMSST-SALAAM is   which is not significant ,      %  ; it is a weak negative correlation indicating that when the number of senses increases PR ? weakly 197 ? tends to decrease. The correlation coefficient between the number of contexts and PR is even weaker at  which is not significant ,       %   indicating the lack of any significant linear correlation between the number of training contexts and performance. More interestingly, we observe cases where there are only 5 train- ing examples yet PR is 1.0 as the noun hearth shown in Table 5.9. The lack of any significant correlation between the number of training instances and the PR contradicts the hard held belief in machine learning circles, that there is a direct correlation between supervised systems performance and the number of examples seen by the learner. 2. Perplexity of senses in SALAAM training data Perplexity is measured according to equation (5.1). As explained earlier, there is a direct relation between perplexity and entropy. Entropy is a measure of confusability in the senses? contexts distributions, if it is uniform, entropy is high. A skew in the sense contexts distributions indicates low entropy, therefore low perplexity. The lowest possible perplexity value is 1 indicating an entropy of 0. Perplexity, accordingly, is the number of senses that are confusable due to the level of uncertainty in the sense contexts distributions. This characteristic is directly measurable on the SALAAM-tagged training data. For example, bar has the highest perplexity value of 9.85 for its 19 senses; and day with 16 senses has 198 a relatively much lower perplexity of 1.3. Figure 5.5 illustrates the trend lines for the different sense contexts distributions of bar and day. The solid line depicts the distribution for bar which is almost a straight line indicating the close to uniform distribution, thereby reflecting an expected high perplexity. The hashed line depicts the sense contexts distribution for day; as we can see the spike in the graph reflecting the skew in day?s senses contexts distribution, hence the low perplexity. Figure 5.5: A plot of the distribution of the senses? contexts of bar and day Table 5.10 illustrates the perplexity per noun in the SALAAM-tagged training data and PRs obtained by UMSST-SALAAM. The rows are sorted in an ascend- ing order by the Perplexity values. On the one hand, we observe nouns with high 199 perplexity such as bum, with a perplexity value of 3.03, yet, achieving a high PR value of 1.0. On the other hand, nouns with relatively low perplexity values such as grip yield a very low PR of 0.26. Moreover, nouns with the same perplexity and similar number of senses yield very different PR scores. For example, bum and feeling, both have a perplexity value of 3.031, bum has 4 senses and feeling has 5, but the former yields a PR of 1.0 while the latter achieves a PR of 0.59 only. Furthermore, nature and art have the same perplexity of 2.297; art has 17 senses while nature has 7 senses only, however, art yields a PR of 0.98 while nature yields a PR of 0.44 only. Similarly, examining holiday and child, both nouns have the same perplexity of 2.144 and the number of senses is close with 6 senses for holiday and 7 senses for child, yet the performance is very different; the performance ratio for holiday is 0.08, while that of child is 0.97. Consequently, the data is inconclusive. It does not support the existence of a cor- relation between the perplexity measure and the performance ratio, PR. These observations are further solidified by the low negative linear Pearson correla- tion coefficient of  , which is not significant ,       %   between perplexity and PR. At first blush, one is inclined to hypothesize that, maybe, the combination of low perplexity associated with a large number of senses ? since it is an indication of high skew in the distribution ? is a good indicator of high PR, but reviewing the data, this hypothesis is dispelled by day which has 16 senses with a perplexity of 1.3 yet still yields a very low PR of 200 0.08. In summary, the data does not suggest that there exists any correlation between PR and perplexity. As it stands, perplexity alone is not a good indicator of performance. 3. Semantic Translation Entropy in SALAAM training data Semantic translation entropy is a special characteristic of the SALAAM-tagged training data. Since the source of evidence for SALAAM tagging is multilingual translations, it is natural to evaluate the impact of translation on the tagging. Se- mantic translation entropy measures the amount of translational variation for the source word in the target language. Semantic Translation Entropy is introduced by Melamed [54]. This measure is a variant on the entropy measure described in Section 5.6.1. The equation utilized is expressed as follows:  *        (5.3) $ where is a translation in the set of possible translations ; and  is a source word. The probability of a translation is calculated directly from the alignments of the source nouns with their target language translations. It is the probability calculated via the maximum likelihood estimate of the translation for the word in question. 201 As mentioned in Chapter 3, variation in translation is a desirable feature for SALAAM tagging . Therefore, we would expect there to be a positive corre- lation between the quality exemplified by precision of tagging SV2LS-TR and semantic entropy, since the variation in translation indicates that the source word has several possible translations in the target language. Indeed, we do obtain a positive correlation of 0.33 between the precision of the unsupervised tagging of SALAAM-SV2LS-TR and semantic entropy. The correlation coefficient value is expected to be higher if we have good quality translations and good quality alignments. Table 5.11 shows the obtained semantic entropy values per noun. The row entries in Table 5.11 are sorted in a descending order based in the Se- mantic Translation Entropy column. Based on the values presented in the ta- ble, there exists no clear correlation between Semantic Translation Entropy and Performance Ratio, PR. The linear correlation coefficient is 0.22, which is not significant ,       %   . Several nouns in the table that have a high semantic entropy value exhibit a high PR. This is the case for bum, detention, dyke, stress, and yew. There are data points that exhibit very low Semantic Translation Entropy and PR such as child and holiday. Examining the latter two nouns individually, we observe that child has a semantic translational entropy of 0.08 and it yields a very high performance ratio of 0.97. The low semantic entropy indicates lack of transla- 202 tional variation for this word even though it has 7 senses. Based on condition MT THRESH ML U, which rendered this result for child, we see that child is translated to enfant, enfantile, nin?o, nin?o-pequen?owhich preserve the ambigu- ity in both French and Spanish. Moreover, from Table 5.8, SALAAM-SV2LS- TR precision for child is only 56.1%, but the perplexity is low at 1.69, probably contributing to the good performance ratio. Examining holiday, on the other hand, we notice that it has a relatively high Semantic Translation Entropy value of 0.66, yet it yields the lowest PR of 0.08. Furthermore, upon inspecting the precision results obtained by SALAAM-SV2LS-TR from Table 5.8, holiday has a precision of 10.3% which is partially explainable by the quality of the align- ments which are very noisy. A sample of the alignments is listed as follows: fiesta, vacances, fe?te, di?a-fiesta, preserve, nouveau-engage?, holiday, conge?, los, las, les, assistance. Therefore, it seems that Semantic Translation Entropy alone is not a good indi- cator of PR. 4. Sense Distributional Correlation between test data and SALAAM-tagged training data Sense Distributional Correlation (SDC) is an attribute that results from compar- ing the test data sense context distributions with SALAAM-tagged data sense context distributions. We noted earlier, in Section 5.6.3, that the correlation be- tween the human tagged sense distributions and those of the test data is very 203 strong, with correlations ranging from 0.9 to 1 correlation coefficients across the different nouns in this evaluation. In this subsection, we compare the correlation coefficients of the SALAAM- tagged training data and the test data, SDC, against the performance ratio, PR. Table 5.12 presents those results. Row entries in the table are sorted descending by the SDC column. Observing the data in the table, we notice a strong correla- tion between SDC and the performance ratio, PR. This is further confirmed by Pearson?s correlation coefficient of 0.87 which is significant ,      $  . Figure 5.6: A plot of the SDC and performance ratio on the 29 nouns 204 Figure 5.6 further illustrates the strong correlation. The hashed line in the figure is a plot of the SDC values against the solid line depicting the performance ratios. Upon close inspection of nouns in the table, we observe that the nouns that have a high performance ratio have high SDC values. Yet, it is not always the case that high SDC values predict high performance ratios. For example, circuit and post have relatively very high SDC values, 0.794 and 0.859, respectively; but they score lower performance ratios than detention which has a comparatively lower SDC value of 0.776. Examining these 3 data items in previous tables, we notice that both circuit and post have many senses, 13 and 12, respectively, while detention has 4 senses only. detention has a higher semantic translation entropy as illustrated in Table 5.11; moreover, it has a lower perplexity as shown in Table 5.10. Therefore, we conclude that SDC is a very good indicator of PR but it still lacks vital information if used alone in order to make the correct prediction consis- tently. 5. Absolute Difference between Perplexity of senses of test data and SALAAM tagged training data: Perp Diff There exists a very high Pearson?s correlation coefficient of 0.96 between the perplexity measure of the test data and the human-tagged training data. In con- trast to a relatively low correlation coefficient of 0.43 between the SALAAM- 205 tagged training data and that of the test data. The two perplexity measures are illustrated in the following graph. Figure 5.7: A comparative view of the different perplexity measures in SALAAM- tagged training data and the test data for the 29 nouns In Figure 5.7, the hashed line is the SALAAM-tagged perplexity and the solid line is the test data perplexity. Table 5.13 illustrates the absolute difference between the SALAAM-tagged per- plexity and the test data perplexity, PerpDiff, and the performance ratios, PR, per noun. In Table 5.13, row entries are sorted by PerpDiff values descending. Examining 206 the data, the correlation between PerpDiff and PR is at  . PerpDiff alone is not a good predictor of PR. We observe cases such as holiday with a very low difference indicating that the two perplexity measures are close for that data item, yet the PR is also very low, 0.08. While circuit has a perp diff of 7.23 but it achieves a relatively higher PR of 0.44. On the other hand, items such as art and bum have a relatively high PerpDiff but achieve very high PR scores. 6. Sense Context Confusability This is a characteristic of the words in the training and test sets. Many of the senses of words in WordNet are similar as illustrated by the sims relationship in WN17pre [24, 13]. Similar senses typically lead to similar usages, therefore similar contexts. Such a similarity in contexts causes problems for the learning algorithm since the features extracted from the corresponding contexts for the polysemous word instances will tend to be very similar thereby, detracting from the learning algorithm?s discriminatory power. Such cases are referred to here as confusable sense contexts. A situation of sense context confusability arises when two senses are confusable and they are highly uniformly represented in the training corpus. Upon examining the 29 polysemous nouns in the training and test sets, we re- alize that a significant number of the words have similar senses according to a manual grouping provided by Palmer, in 2002, as part of the SENSEVAL2 207 data distribution. 9 For example, senses 2 and 3 of nature, meaning trait and quality, respectively, are considered similar by the manual grouping. Not all the senses pertaining to words in this test set are considered by the manual grouping, since the test set has subsenses included in the tag set. For instance, the manual grouping only considers senses 1, 2 and 3 from WN17pre for spade ? which are indeed homonymic ?, while, in the current test set, spade has 6 senses. We find senses 2, 4 and 5 in this test set to be similar according to the following definitions as extracted from WordNet: Sense 2: hand shovel Sense 4: garden spade Sense 5: ditch spade Table 5.14 illustrates the nouns sorted by PR values descending. The second column in the table presents the SDC values (see Sense Distribution Correlation above for description). The third column shows the manual grouping of similar senses; some cells in the third column are left blank indicating that the senses for the particular corresponding noun do not comprise groups i.e., the senses are not similar. The last column shows the performance ratio, PR. Inspecting the nouns in Table 5.14, we see the majority of the nouns have mul- 9http://www.senseval.org/sense-groups. This manual sense grouping comprises 400 polysemous nouns including the 29 nouns in this evaluation. 208 tiple groupings. As established earlier, SDC has a significant impact on per- formance ratio. In this section, we argue that SDC is not a good predictor of performance ratio without taking Sense Context Confusability into considera- tion. First, we consider the performance ratio of words that do not have sense group- ings, the blank cells in Table 5.14, detention, dyke, spade, and church. They all achieve high performance ratios of 1.0 except for church which has a per- formance ratio of 0.77. detention, dyke, and spade all have SDCs above 0.97. Moreover, the four words have a close perplexity to the perplexity observed in the test data as illustrated in Table 5.13 with PerpDiff of 0.71, 0.4, 0.16 and 1.31, respectively. However, upon inspecting spade, we note the existence of similar senses, as mentioned above, senses that are not considered in the manual grouping by Palmer since they are subsenses in WN17pre. All the automatic taggers (exper- imental conditions) in this investigation achieve an SDC above 0.95 for spade, and none of them have multiple instances in the similar senses, i.e. all the tag- gers have either 0 or 1 contexts for sense 4 and 5, with over 40 contexts for sense 2, indicating that there is always a skew in the distribution of the contexts for the potential similar sense contexts. Furthermore, the PR scores for all the taggers is  . Accordingly, this observation leads us to believe that if the senses are not con- 209 fusable and there is a sufficient number of contexts available, with a relatively close perplexity to that of the test data, then regardless of the SDC value, a tag- ger should be able to achieve high performance ratio; this is indeed the case with detention, where some of the taggers have a SDC of 0.26, yet they still yield a performance ratio of 1.0. As for the case of church, the perplexity of SALAAM-tagged training data is much lower than that of the test data as illustrated in Table 5.13, SALAAM- tagged perplexity is 1.15 while that of the test data is 2.83. This observation may lead us to conclude that perplexity is the determiner of the PR in this case. However, comparing stress to church, we note that both exhibit similar behavior with regards to the different factors examined here as illustrated in Table 5.15, yet stress achieves a PR of 1.0. As we can see in the Table 5.15, the characteristics for stress and church are very similar with the major difference being the Semantic Translation Entropy and PerpDiff. We inspect the contexts for stress for confusability and we discover no confusability in the contexts exists. Therefore, we may conclude that indeed the predicting factors are a combination of PerpDiff and Semantic Translation Entropy. The nouns that are intriguing in Table 5.14 are the ones that have relatively high SDC values yet their performance ratios are low such as post, nation, channel and circuit. For instance, nation has a very high SDC of 0.962, a low perplexity 210 of 1.3, relatively close to the 1.6 perplexity of the test data, a sufficient number of contexts (4350), yet its performance ratio is at a relatively low 0.59. According to the manual sense grouping listed in the table, senses 1 and 3 are similar, and indeed when we inspect the context distributions, we find the bulk of the senses? instances from senses 1 and 3 which create confusable contexts for the learning algorithm. We have established the importance of taking Sense Context Confusability into consideration when attempting to predict the performance of a tagger. But we are faced with the problem of quantifying this factor. Fortunately, we are able to use Resnik?s information theoretic similarity measure to quantifiably measure the similarity between the senses of polysemous words (see Chapter 3 for details) [68]. We conducted a preliminary experiment to investigate to what extent does the automatic similarity measure concur with the manual similarity grouping of senses. We evaluated 7 words, ranging from 2-7 senses per word: yew, dyke, spade, church, holiday, nature, and nation.10 The automatic measure assigned high similarity scores to 7 out of the 8 groups deemed similar by manual group- ings including our additional manual grouping of senses 2,4, and 5 for spade, as 10We use Resnik?s measure as a first step, but for this task the Lin [51] information theoretic measure is also appropriate. The difference between the two measures lies in normalization, the Lin similarity measure produces a similarity score between 0 and 1; while that of Resnik is not normalized. 211 well as senses 4 and 5 for church. In fact, the one case that was not considered similar by the automatic measure was a case of metonymy for yew where sense 1 means tree and sense 2 wood, which is a debatable case. But crucially the auto- matic measure did not group any senses that were not manually deemed similar. Therefore, one can easily use an information theoretic based similarity metric to quantify sense similarity. Using such a measure in conjunction with inspections of the uniformity of the distributions among the similar senses, allows for the quantification of the Sense Context Confusability. 5.9 Combining factors We have analyzed the different potential factors that could have an impact on the PR score. Our ultimate goal is to be able to automatically predict which words are good candidates for bootstrapping. Implementing a learning model to automatically predict good performance ratios is a matter of future work but we will consider several relevant features based on our discussion above. A first step would be to assign weights to the different factors affecting the perfor- mance ratio, PR, by training a learning model on the relevant ones. Such a learning framework will have to be implemented for each word independently. The learning model deduces the significant weight for each predictor. The overarching question is which predictors are relevant. From our detailed ex- ploration, the data suggests that SDC, Sense Context Confusability have a direct im- 212 pact on PR, yet, Semantic Translation Entropy and PerpDiff play an indirect role in the prediction. The impact of the number of examples seems more relevant in a clean training/testing environment; we observe cases that only have 5 training examples, yet achieved a PR of 1.0; moreover, there is the notion of sufficient number of examples, we could not deduce from the data what that magic number is except to note that it has to be more than 5 given an SVM learning paradigm for this specific application type. Therefore, we would want to combine the number of contexts as a factor. The number of senses does not seem directly relevant based on our analysis. Fortunately, there are several learning candidates robust enough for such an in- vestigation that range from a simple regression model to algorithms such as Decision Trees, Instance Based Learning, and Nai?ve Bayes frameworks. The predictors in such a framework will be a combination of nominal and numeric values. The nominal predictors for this evaluation are: language, for example, we have French, Spanish and Merged languages; corpus type, for instance, HT+SV2LS-TR, MT+SV2LS-TR, SV2LS-TR, or SV2LS-TR; and sense selection criterion, MAX or THRESH. The numeric predictors are SDC, PerpDiff, Number of Contexts, Semantic Translation Entropy, and Sense Context Confusability. The learning framework will be given the predictors based on all the taggers and the value the learner is trying to predict is an acceptable performance ratio. This may be cast in a binary framework by setting a threshold on the acceptability value. It is worth emphasizing that two of the identified factors are dependent on the 213 test data, SDC and perp diff. Given the fact that the test data size is small relative to the hand tagged training data size required by a classical supervised system for WSD, SALAAM-tagged training is still a viable solution to the annotation acquisition bottleneck. 5.10 Summary In this chapter, we have introduced a new approach that combines an unsupervised and supervised learning method for WSD that makes significant strides toward easing the annotation bottleneck. This is accomplished by means of a trade-off between qual- ity and quantity. SALAAM produces large amounts of noisy data for training. We demonstrate the value of this approach using a precision ratio metric, PR, comparing a supervised WSD system trained on hand tagged data against that same system trained on SALAAM tagged data. The bootstrapping approach evaluated yields superior re- sults to those obtained by the only comparable approach which is tested on the same data set but bootstrapped using clean data. Essentially, the method introduced here is entirely unsupervised yet it is able to rival results obtained by a supervised method for 12 out of 29 noun items. Moreover, we explore, in depth, the question of when it is safe to use SALAAM tagged data as training data, since the approach works less optimally for several data items. We render a detailed analysis of the different factors affecting the performance ratio, PR. Finally, we make suggestions on how to combine the relevant factors toward 214 the goal of automatically predicting good performance ratio. 215 Nouns # Senses Perp # Cont Nouns # Senses Perp # Cont art 17 4.92 104 fatigue 6 1.88 40 authority 9 3.73 99 feeling 5 2.83 52 bar 19 6.5 136 grip 6 3.25 50 bum 4 1.78 40 hearth 3 2.14 33 chair 7 1.66 63 holiday 6 1.73 31 channel 7 4.92 71 lady 8 2.83 52 child 7 2.14 63 material 16 4.92 74 church 6 2.46 65 mouth 10 4 71 circuit 13 9.19 85 nation 4 1.64 36 day 16 3.73 162 nature 7 4.59 52 detention 4 2.64 32 post 12 5.66 72 dyke 2 1.4 28 restraint 8 4.92 47 facility 5 2.83 58 sense 8 4.92 54 spade 6 2.3 32 stress 6 3.25 43 yew 3 1.85 28 Table 5.2: Test items for SV2LS-test 216 Nouns # Senses Perp # Cont Nouns # Senses Perp # Cont art 17 6.5 219 grip 6 3.48 100 authority 9 4.92 196 hearth 3 2.46 66 bar 19 7.46 271 holiday 6 1.93 65 bum 4 1.99 81 lady 8 2.64 99 chair 7 2 138 material 16 6.06 149 channel 7 5.28 138 mouth 10 3.48 130 child 7 2.83 130 nation 4 1.8 78 church 6 2.46 136 nature 7 4.59 100 circuit 13 8.57 174 post 12 5.66 145 day 16 4.59 321 restraint 8 4.92 94 detention 4 2.3 63 sense 8 4.92 111 dyke 2 1.61 59 spade 6 2.83 65 facility 5 3.25 120 stress 6 3.48 89 fatigue 6 2.14 77 yew 3 1.65 57 feeling 5 2.83 116 Table 5.3: Characteristics of hand-tagged training data for the SENSEVAL2 English Lexical Sample task 217 Nouns UMSST-human% Nouns UMSST-human% art 47.9 grip 58.8 authority 62 hearth 75 bar 60.9 holiday 86.7 bum 85 lady 72.5 chair 83.3 material 55.9 channel 62 mouth 55.9 child 58.7 nation 78.4 church 73.4 nature 45.7 circuit 62.7 post 57.6 day 62.5 restraint 60 detention 65.6 sense 39.6 dyke 89.3 spade 75 facility 54.4 stress 50 fatigue 80.5 yew 78.6 feeling 56.9 Table 5.4: Current evaluation gold standard precision results obtained by UMSST- human 218 Corpora Lines Tokens SV2LS-TR 61879 1084064 MT 151762 37945517 HT 71672 1734001 Total 285313 40763582 Table 5.5: SALAAM-tagged training corpora sizes 219 Nouns UMSST-SALAAM P. UMSST-human P. PR detention 68.8 65.6 1.05 chair 84.8 83.3 1.02 bum 85 85 1.00 dyke 89.3 89.3 1.00 fatigue 80.5 80.5 1.00 hearth 75 75 1.00 spade 75 75 1.00 stress 50 50 1.00 yew 78.6 78.6 1.00 art 46.9 47.9 0.98 child 57.1 58.7 0.97 material 51.5 55.9 0.92 church 56.2 73.4 0.77 mouth 40.7 55.9 0.73 authority 43.5 62 0.70 post 37.9 57.6 0.66 nation 45.9 78.4 0.59 feeling 33.3 56.9 0.59 restraint 33.3 60 0.56 channel 32.4 62 0.52 facility 27.6 54.4 0.51 circuit 27.7 62.7 0.44 nature 19.6 220 45.7 0.43 bar 18 60.9 0.30 grip 15.7 58.8 0.27 Nouns 1 2 3 4 5 detention 1.05 1.05 1.05 1.05 1.05 chair 1.02 1.02 1.02 1.02 1.02 bum 0.09 0.15 0.94 0.03 0.03 dyke 1.00 1.00 1.00 1.00 1.00 fatigue 0.03 1.00 1.00 0.03 0.03 hearth 1.00 1.00 0.17 1.00 1.00 spade 0.92 1.00 1.00 0.96 1.00 stress 0.11 0.05 0.05 1.00 1.00 yew 1.00 1.00 1.00 1.00 1.00 art 0.98 0.98 0.98 0.98 0.98 child 0.89 0.87 0.89 0.87 0.89 material 0.87 0.82 0.58 0.89 0.84 church 0.75 0.75 0.75 0.72 0.75 mouth 0.73 0.00 0.03 0.03 0.03 authority 0.60 0.60 0.58 0.67 0.70 post 0.63 0.66 0.66 0.47 0.58 Ave. PR 0.73 0.75 0.73 0.73 0.74 Table 5.7: PRs of the best individual conditions using UMSST-SALAAM train- ing data on the top 16 test nouns in Table 5.6: 1 is condition MT+HT+SV2LS- TR THRESH SP; 2 is condition HT+SV2LS-TR THRESH ML I; 3 is condition HT+SV2LS-TR THRESH ML U; 4 is condition MT+HT+SV2LS-TR MAX ML I; and 5 is condition HT+SV2LS-TR MAX ML I 221 Noun SALAAM-SV2LS-TR P UMSST-SALAAM P hearth 100 75 restraint 100 33.3 chair 86.7 84.8 yew 83 78.6 dyke 82.9 89.3 spade 75.4 75 fatigue 75 80.5 detention 71.9 68.8 bum 69.5 85 stress 63.6 50 material 60.3 51.5 child 56.2 57.1 bar 52.8 18 church 50.9 56.2 mouth 46.2 40.7 feeling 42.7 33.3 authority 40.7 43.5 channel 36.7 32.4 circuit 34.1 27.7 lady 31.9 11.8 nation 30.9 45.9 grip 25.3 15.7 art 25 222 46.9 sense 20.9 9.4 post 20.6 37.9 Noun # Senses # Contexts PR art 17 111 0.98 authority 9 656 0.70 bar 19 175 0.30 bum 4 122 1.00 chair 7 432 1.02 channel 7 213 0.52 child 7 694 0.97 church 6 342 0.77 circuit 13 213 0.44 day 16 1247 0.08 detention 4 148 1.05 dyke 2 37 1.00 facility 5 358 0.51 fatigue 6 95 1.00 feeling 5 217 0.59 grip 6 195 0.27 hearth 3 5 1.00 holiday 6 177 0.08 lady 8 404 0.16 material 16 681 0.92 mouth 10 55 0.73 nation 4 4350 0.59 nature 7 223 457 0.43 post 12 365 0.66 restraint 8 11 0.56 Noun # Senses Perplexity PR dyke 2 1.000 1.00 facility 5 1.050 0.51 grip 6 1.058 0.27 church 6 1.149 0.77 channel 7 1.189 0.52 nation 4 1.301 0.59 day 16 1.347 0.08 fatigue 6 1.357 1.00 chair 7 1.385 1.02 yew 3 1.580 1.00 holiday 6 1.682 0.08 child 7 1.693 0.97 sense 8 1.853 0.24 detention 4 1.932 1.05 circuit 13 1.959 0.44 spade 6 2.144 1.00 stress 6 2.144 1.00 lady 8 2.144 0.16 nature 7 2.297 0.43 art 17 2.297 0.98 authority 9 2.462 0.70 hearth 3 2.639 1.00 post 12 224 2.639 0.66 mouth 10 2.828 0.73 bum 4 3.031 1.00 Noun Semantic Translation Entropy PR bum 0.799 1.00 detention 0.758 1.05 holiday 0.664 0.08 restraint 0.623 0.56 post 0.615 0.66 stress 0.542 1.00 spade 0.492 1.00 yew 0.491 1.00 dyke 0.474 1.00 hearth 0.462 1.00 circuit 0.42 0.44 nature 0.418 0.43 fatigue 0.385 1.00 authority 0.381 0.70 grip 0.364 0.27 channel 0.361 0.52 sense 0.332 0.24 chair 0.327 1.02 feeling 0.317 0.59 mouth 0.299 0.73 material 0.297 0.92 bar 0.251 0.30 lady 2205.249 0.16 church 0.246 0.77 art 0.236 0.98 Noun SDC PR dyke 1.000 1.00 bum 0.999 1.00 chair 0.998 1.02 fatigue 0.995 1.00 hearth 0.990 1.00 yew 0.989 1.00 spade 0.975 1.00 mouth 0.964 0.73 nation 0.962 0.59 material 0.887 0.92 authority 0.860 0.70 post 0.859 0.66 art 0.830 0.98 child 0.800 0.97 church 0.795 0.77 circuit 0.794 0.44 detention 0.776 1.05 stress 0.770 1.00 channel 0.647 0.52 restraint 0.532 0.56 feeling 0.329 0.59 lady 0.215 0.16 bar 2206.200 0.30 sense 0.169 0.24 facility 0.156 0.51 Noun PerpDiff PR holiday 0.05 0.08 spade 0.16 1 feeling 0.20 0.59 yew 0.27 1 chair 0.28 1.02 nation 0.34 0.59 dyke 0.40 1 child 0.45 0.97 hearth 0.50 1 fatigue 0.52 1 lady 0.69 0.16 detention 0.71 1.05 stress 1.11 1 mouth 1.17 0.73 material 1.19 0.92 restraint 1.19 0.56 bum 1.25 1 authority 1.27 0.7 church 1.31 0.77 facility 1.78 0.51 grip 2.19 0.27 nature 2.29 0.43 day 2227.38 0.08 art 2.62 0.98 post 3.02 0.66 Noun SDC Grouped Senses PR detention 0.776 1.05 chair 0.998 (1,4)(2,3) 1.02 dyke 1.000 1.00 bum 0.999 (1,2,3) 1.00 fatigue 0.995 (1,3) 1.00 hearth 0.990 (1,3) 1.00 yew 0.989 (1,2) 1.00 spade 0.975 1.00 stress 0.770 (1,4)(2,5) 1.00 art 0.830 (1,4)(2,3) 0.98 child 0.800 (1,3)(2,4) 0.97 material 0.887 (1,4)(2,3) 0.92 church 0.795 0.77 mouth 0.964 (1,2)(3,4,8) 0.73 authority 0.860 (1,6)(2,5)(3,7) 0.70 post 0.859 (1,2)(4,6)(5,7,8) 0.66 nation 0.962 (1,3) 0.59 feeling 0.329 (2,6)(4,5) 0.59 restraint 0.532 (1,3,4) 0.56 channel 0.647 (1,7)(2,4,6) 0.52 facility 0.156 (1,4,5)(2,3) 0.51 circuit 0.794 (2,3)(5,6) 0.44 nature 0.013 228 (1,4)(2,3) 0.43 bar 0.200 (1.2.13)(3,5,10) 0.30 grip -0.020 (1,6)(2,3) 0.27 Factor stress church SDC 0.770 0.795 Semantic Translation Entropy 0.54 0.246 PerpDiff 1.11 1.31 Number of Senses 6 6 Number of Contexts 302 342 Table 5.15: Characteristics of the nouns stress and church 229 Chapter 6 Facets of Similarity 6.1 Introduction The notion of similarity is endemic to most scientific endeavors. The research agenda boils down to seeking generalizations about the world; in most cases generalizations are made based on explicit or implicit groupings of phenomena or ideas. Word Sense Disambiguation (WSD) is not different from any other scientific pursuit in that respect. Similarity plays a vital role in this field; majority of WSD systems have within them formulated some variant on a similarity measure, a way for mapping observables to a set of predetermined or undetermined classes. All WSD systems, whether supervised or unsupervised, have an embedded similarity (generalization) function that maps a set of features from some defined Context onto a set of classes. In this chapter, we examine ways of automatically modeling semantic similarity; we are interested, in particular, in how they compare to human similarity judgments. 230 There is a close relationship between understanding how linguistic representations are used and acquired and the manner in which semantic similarity is modeled. Models differ in their assumptions about features. Some models of similarity, such as Tver- sky?s (1977), assume an explicit set of features over which a similarity measure can be calculated; some distributional methods for measuring word similarity may be viewed as an empirical implementation of such a model [9, 75]; in these methods, distribu- tional features of words are acquired from the analysis of large corpora. Other se- mantic models define the features implicitly focusing more on relations among lexical items in a semantic network type framework, a? la? Quillian (1968); methods that utilize such models compute similarities among words represented in a taxonomy exploiting its hierarchical structure. In several of these methods, the measure takes into account some corpus-based features such as frequency information (e.g., Rada, Mili, Bicknell, & Blettner, 1989; Lin, 1999; Resnik, 1999). There are a myriad of semantic similarity measures, which tend to look at differ- ent sources of evidence. Resnik [73] proposes using human similarity judgments as a reference point for comparing the different measures. Indeed, humans think in multi- dimensional space, there is plenty of evidence supporting the hypothesis that people tap into different knowledge resources to make judgments about the world [42]. In Resnik?s 1999 study, he focuses on nouns. The key question asked is how do different automatic similarity measures compare with one another against human judgments. In this chapter, we present a similar type of investigation but the focus is on verbs. We 231 design an experiment for human similarity using verbs and we compare the results obtained from different automated similarity measures against the obtained human judgments. This chapter is laid out as follows: in Section 6.2, we draw attention to the different intrinsic facets of verbs and motivate the experiment; Section 6.3 examines the differ- ent automated similarity models that are evaluated in this investigation; Section 6.4 describes the experiment, discusses the results obtained from both the human ratings independently and then in relation with the automatic measures; Section 6.5 describes a framework for incorporating the different automated measures as an approximation to human judgments in SALAAM; finally, the chapter concludes with a summary of the findings in Section 6.6. 6.2 Motivation Upon inspecting the performance of the different state-of-the-art WSD systems on the set of English verbs in the SENSEVAL 2 All Words task, we note a severe drop in the results obtained when compared to the results yielded for the nouns by those same systems. The average precision scored for the verbs is 30.8 % with the highest preci- sion score at 55.4% and the lowest precision score at 6.5%, with a standard deviation of 14.2. Likewise, recall scores are significantly lower with a range of 0.2% to 49.4%. These results, if nothing else, are indicative of the complexity of the task. It is hardly surprising that results obtained from both supervised and unsupervised systems alike 232 are low. This is attributed mainly to the sheer number of senses for verbs in WordNet; but also more importantly such results are a reflection of the fact that verb senses differ along more dimensions than simply paradigmatic ones. If a WSD system is not at least implicitly sensitive to these variations, it is subject to mis-tagging verb instances. Verbs are different from nouns in many respects.1 They vary paradigmatically, like nouns, however, in addition, verbs possess syntagmatic attributes. Typically, syntag- matic properties of verbs are characterized in terms of different dimensions. Being more relational in nature, verbs lay restrictions on the type and properties of the words that are associated with them. Some of these syntagmatic attributes are defined in terms of syntactic subcategorization properties and thematic restrictions, aspectual class at- tributes, and selectional preferences. These characteristics are not independent from each other in most cases. Given the complex nature of verbs, we design an experiment to obtain human judg- ments on verb semantic similarity. The design and choice of the experiment items pay close attention to the different dimensions of meaning associated with verbs. Such an experiment is intended to serve as a point of reference of automatic semantic similarity measures; moreover, guided by insights from such a study, we lay a more cognitively salient framework for utilizing the different automated similarity measures within the area of WSD. In the process, we explore different semantic similarity measures and 1We acknowledge that the distinction between verbs and nouns is not that discrete; we understand that a range does exist; but for purposes of this chapter we are not discussing the full spectrum. 233 compare them with human ratings on verb similarity. 6.3 Models of Verb Similarity Automated semantic similarity measures based on different similarity models typically take advantage of and are sensitive to one of or more of the many paradigmatic and syntagmatic attributes of verbs. Such methods may do so either implicitly or explicitly. For example, methods that depend on word adjacency collocations in text, implicitly take advantage of the word order of the verb and its arguments, yet depending on the prespecified window of text of interest, they may not be able to capture long distance syntactic relations. Methods that depend on WordNet and WordNet style Ontologies tend to be sensitive to the IS-A relationship within the taxonomy. The IS-A relation captures the manner dimension of meaning for the verbs? semantics. Such methods are known to yield good performance with nouns. Unlike the noun taxonomy however, the verb hierarchy is very shallow and broad. Similarity methods that measure similarity via syntactic relations may be sensitive to adjunction relations as well as locative and temporal modifiers when measuring verb similarity. Accordingly, in this chapter we consider three classes of similarity measure, cor- responding to three types of lexical representation ranging in level of syntactic depth from paradigmatic (syntactic-light) to being heavily dependent on syntax. In the first class, verbs are associated with nodes in a hierarchical ontology. At one end of the spectrum, the first class is shallowest in explicit syntactic representation. In the sec- 234 ond, distributional syntactic cooccurrence features obtained by parsing a large corpus represent verbs. Finally, in the third class, verb entries are represented according to a theory of lexical conceptual structure. This class represents the other end of the syntactic spectrum with explicit coding of syntactic facets of meaning. 6.3.1 Class 1: Taxonomic Models WordNet represents taxonomic models in this study. As mentioned earlier, this type of model focuses on the paradigmatic aspects of meaning for verbs. It is in clear contrast to efforts that classify verbs based on their syntagmatic behavior. For the purposes of this investigation, we use WordNet 1.5 (see Chapter 3 for a full description of a WordNet style Ontologies). We present three measures of verb similarity as follows: Edge Count Similarity Given the hierarchical nature of the WordNet verb taxonomy, the simplest method of calculating similarity between two verbs is to count the number of interven- ing edges. The total number of edges is subtracted from the maximum possible number of edges in the taxonomy. Accordingly, edge sim is calculated for two verbs - and - as follows:  - -      (6.1) 235 where  ranges over all the senses (synsets) of verb -, and  ranges over all the synsets of verb -;  is the maximum depth in the taxonomy, and   is the length of the shortest path from  to . Intuitively, if the synsets for the two verbs are not from the same portion in the taxonomy, the value of edge sim is 0. Edge counting is well known for its problems in over estimating and under esti- mating similarity between concepts in WordNet. This is mainly owing to the fact that subtrees vary in their bushiness, i.e. some trees are shallower than others. Information Theoretic Similarity Information theoretic based similarity measures address the problem associated with the edge counting measure. Information based measures typically assign weights to nodes in the taxonomy. The weights are quantified in terms of in- formation content. We describe two variants on information based similarity: res info, devised by Resnik [73] and lin info devised by Lin [51]. The inherent structure of the taxonomy arranges the nodes such that the more abstract con- cepts in the tree are higher than the more specific concepts. For instance, the concept of FRUIT is higher than the concept APPLE. Both measures exploit this feature of hierarchical taxonomies by assigning lower information content to the broader concepts than the amount of information content assigned specific concepts. Intuitively, this is a reasonable assumption since the amount of infor- 236 mativeness ? information contribution ? is lower in broader concepts than the amount of information in a more specific concept. Both measures calculate in- formation content based on the unigram frequencies of the concepts in a corpus. The amount of information in a node is calculated as follows:       (6.2) where  is a concept in the tree. The more frequent a concept in a taxonomy the lower its information content. However, not all concepts in the tree occur in the corpus. Accordingly, the fre- quencies of more specific nodes are propagated upwards in the tree such that a parent node?s frequency is the aggregate of the frequencies of all its children, in addition to its own frequency in the corpus if it happens to occur. Consequently, the broader concepts in the tree are automatically assigned lower information content. Based on this characterization of information content, the similarity between two concepts in a taxonomy is measured according to res info as   &     (6.3) 237 where   is the set of concepts that subsume both  and  , and is the information content of node  obtained in the manner described above. In res info measure, the most informative subsumer of the two concepts is the subsumer with the highest information content among all possible subsumers. The lin info measure between two concepts closely resembles the res info mea- sure but it normalizes the shared information content by using the sum of the unshared information content of the concepts compared. Therefore, lin info sim- ilarity between two concepts  and  is calculated as follows:       &    (6.4)   where  are the superclasses that are maximally specific for concepts  and . The range of this similarity is 0 to 1. The most important feature of both measures is the definition of similarity as a function of the shared information content. Similarity measured in this manner is not sensitive to the number of edges intervening between two concepts in a tree, therefore, it is not prone to the problems associated with edge sim. The equations as mentioned above, characterize the measure of similarity for two concepts (synsets/senses) in the taxonomy not verbs. In order to obtain the 238 similarity of two verbs - and -, the following calculation is utilized by both res info and lin info:  &- -   &  (6.5) where  &  is either res info or lin info. According to equation (6.5), the most informative subsumer is the subsumer with the highest information content across all sense pairings for the two verbs - and -. 6.3.2 Class 2: Distributional Co-occurrence Model Lin [50] demonstrates the generality of information theoretic based similarity by show- ing how such measures can be used to measure not only taxonomic distance but also string similarity and the distance between feature sets. This approach is illustrated by representing words as collections of syntactic cooccurrence features obtained by parsing a large corpus. For example, both the noun duty and the noun sanction would have feature sets containing the feature subj of(include), but only sanction would have the feature adj mod(economic), since economic sanctions appears in the corpus but economic duties does not. Because these features include both labeled syntactic rela- tionships and the lexical items filling argument roles, the underlying representational model can be thought of as capturing both syntactic and semantic components of verb 239 meaning. Lin computes the quantity of shared information as the information in the intersection of the distributional feature sets for the two items being compared. This yields the following measure lin dist:  ,  ,       (6.6) ,  ,  where ,  is the feature set associated with word , and where , the quan-  tity of information in a feature set , is computed as    & . Lin assumes that the features are independent, therefore allowing for the summa- tion of the log probabilities in equation (6.6). In the experiments described here, we use similarity values obtained for verb pairs using Lin?s implementation of his model, with his feature sets and probabilities obtained by analyzing a 22 million word corpus of newswire text, The Sun Jose Mercury. 6.3.3 Class 3: Semantic Structure Model The third automated method for assessing the semantic similarity of verbs relies on detailed representations of verb semantics according to the theory of lexical concep- tual structure, or LCS [19, 35]. In the theory of LCS, a verb representation is defined in terms of its semantic structure and semantic content. The semantic structure rep- resents the different dimensions of syntactic (syntagmatic) features for a verb, while semantic content is defined as the idiosyncratic (paradigmatic) information associated 240 with it. In this model, a clear distinction is made between semantic content and se- mantic structure. This same distinction plays a central role in current studies of lexical representations [28, 64, 66]. We take advantage of the explicit distinction between semantic structure and se- mantic content to derive a measure that focuses exclusively on similarity of semantic structure independent from semantic content. To illustrate LCS representations, we present the definition of the verbs run and jog which may have the same semantic structure, indicating a change of location de- pending on the senses of the verbs, e.g., ( x (  x (  x y)) (  x (  x z)) (manner (M))), The verbs differ only in the value (M) ? an element of semantic content within the semantic structure ? indicating the manner of motion (either (jogging) or (run- ning)). Such regularities in semantic structure are argued to provide an explanation for systematic relationships between meaning and syntactic realization [47, 44, 45, 46]. Given such LCS structures, we observe the patterning in semantic structure be- tween verbs that are highly similar. We devise an algorithm inspired by Lin?s adapta- 241 tion of his information theoretic measure to calculate similarity on parsed tuples.2 In his approach he decomposes parse trees into (pseudo) independent features and uses his information theoretic based similarity measure, lin info to calculate similarity. The algorithm we devise performs the same task: it converts each LCS entry into a series of tuples, conflating hierarchical information in the LCS structure. We recursively cre- ate an independent feature from each primitive component of the LCS representation and the head of its subordinates. So, for example, the feature set representation of run contains six features: (     manner ) (  x   ) (   x y ) (   x   ) (   x z ) ( manner RUNNING ). The created features of jog are identical except for the manner feature expressed as the last feature above, which instead would be (manner JOGGING). Therefore, we observe a complete overlap between the feature sets for the two verbs, which captures the fact that the semantic distinction between this particular pair of verbs rests entirely on semantic content, and not semantic structure. 2We are grateful to Dekang Lin for the idea (personal communication). 242 We have available to us a large lexicon of LCS representations for verbs in En- glish [20], containing thousands of lexical entries. The probability of each feature is estimated by counting feature occurrences within the lexicon. We acknowledge that the probability estimate calculated in this manner counts features within a set of en- tries in a large lexicon (types) rather than verb instances in a large corpus (tokens), but inspection of the estimated probabilities suggests that frequent features are relatively discounted, having low information content, and rare features have high information content. Accordingly, lcs sim is calculated for two entries verb - and verb - using the shared information content of their feature sets:   - -  , - , - (6.7) where , - , - is measured as in equation (6.6). Then the similarity lcs sim between two verbs is calculated over all their entries corresponding to the different senses as the maximum yielded value   taken over the cross product of all the verbs? lexical entries. Therefore, this similarity measure considers only semantic structure, not semantic content; accordingly, only syntagmatically relevant features take part in the computa- tion. When comparing, run and jog, in the specific example mentioned above, they only differ in their semantic content, their paradigmatic characteristic idiosyncrasies which are not captured by this model. 243 6.4 Human Judgment Experiment We design a human experiment to collect ratings on semantic similarity between verbs. The intent of the experiment is to establish a reference point for comparing different automated similarity measures. The design follows that of Miller and Charles (1991), which is a design for noun similarity. However, when comparing verbs, considering their multidimensional nature and the complex intertwining of their attributes, choice of experimental material has to be controlled. Therefore, we pay close attention to syn- tactic subcategorization, thematic grids, and aspectual class information, as described below, in order to limit the possible dimensions across which the two verbs in a pair could differ and to focus on semantic similarity. Moreover, we create two of versions of the experiment: one with the verbs pre- sented to the participants with No-Context, and the second with the verbs presented in context. The idea is to examine to what extent Context has an effect on verb similarity ratings. 6.4.1 Participants We have a cohort of 10 subject volunteers, 5 women and 5 men ranging in age from 24 to 53. They are all native speakers of English who participated by email. None of the participants has significant background in psychology or linguistics. 244 6.4.2 Materials As mentioned earlier, in order to capture semantic similarity between verbs, we need to control for the different dimensional variations associated with verbs. Fortunately, we have available to us a large ;lexicon of English LCS structures comprising 4900 entries [20]. We control for three syntagmatic dimensions in the choice of the verb pairs: Aspectual Class Each verb entry in the LCS lexicon contains information about its aspectual features: where the verb entry is dynamic, durative or telic [21]. Verbs are classified into four main aspectual classes: Activities, States, Achievements and Accomplishments. The combination of aspectual features predicts the aspectual class of the verb. Table 6.1 is taken from (Dorr& Olsen, 1997) [22], where a 1 indicated presence of a feature and 0 indicates its absence. Aspectual Class Telicity Dynamicity Durativity Example Verbs State 0 0 1 know, have, be Activity 0 1 1 march, paint, dance, chase Accomplishment 1 1 1 destroy, eat, build Achievement 1 1 0 notice, win, break Table 6.1: Aspectual features determining aspectual class for verbs Thematic Grid 245 The thematic grid information identifies whether or not a verb takes an Agent, Theme, Goal, etc. Subcategorization Frames This identifies whether a verb takes an object or two objects, for example. For in- formation on subcategorization, we used the subcategorization frame for the first listed verb sense provided in the Collins Cobuild Dictionary [76]. Accordingly, a verb such as broil requires both an Agent and a Theme, and is marked as both Durative and Telic but not Dynamic and has the subcategorization frame (v+o). In order for us to build verb pairs, we first remove all verbs whose thematic grids do not require a theme, so as to limit the range of variation in thematic grids. All verbs require an agent, so the remaining variation lies in the presence or absence of oblique roles such as goal. The next phase is to group the full set of verbs into lists in corre- spondence with the eight possible combinations for the three aspectual features, then the lists are reduced to the four most numerous ones which are Durative, Durative, Dynamic, Dynamic, Telic, and Durative, Dynamic, Telic. Verbs may and do appear on multiple lists. Within each of those four lists, all possible pairings of verbs that matched in terms of subcategorization frames are created. 12 pairs are selected that range from low to highsimilarity pairings. In summary, a set of 48 verb pairs is constructed such that: 246 1. both verbs in every pair require a theme, 2. both verbs have the same subcategorization frame, and 3. both verbs come from the same aspectual class. Verbs on the list are all presented to the participants in the past tense. In order to avoid ordering effects, the order of the verb pairs is randomized; half the subjects in either condition of the two conditions (see Section 6.4.3) is shown items in a specific order, and the other half is shown the same items in a reverse order. 6.4.3 Conditions We create two experimental conditions: Context and No-Context. The materials as just described are duplicated in order to create two distinct sets of conditions. The conditions are exactly the same with the exception that in the Context condition, each verb in the verb pairings presented to a participant is within an example sentence which demonstrates the verb?s intended sense. The contextual sentences are taken from the corresponding verb entry in the Collins Cobuild Dictionary. For example, the sentence for enrich is They enriched the library with new books. 247 6.4.4 Procedure Human Ratings Experiment The 10 participants are divided evenly into two groups corresponding to the experi- mental conditions Context and No-Context groups. Subjects in the No-Context group are given the set of 48 verb pairs, without example sentences. They are asked to com- pare their meanings on a scale of 0-5, where 0 indicates that the verbs are not similar at all and 5 indicates maximum similarity. Participants are explicitly asked to ignore similarities in the sound of the verb and similarities in the number and type of letters that make up the verb. Participants are also asked distinctly to rate similarity rather than relatedness, with the instructions giving an example of the distinction. For instance, spend and eat are related since they are associated with shopping malls, but they are not semantically similar. As some verbs in the set are of low frequency, a don?t know option is included for subjects to mark if they are unsure of the meaning of either verb. We impose no time limits on the subjects to carry out the task, however, the experiment tends to take approximately 20 minutes. From the full set of 48 verb pairs, 10 are excluded because some participant did not know the definition/intent of one or the other verb in a verb pair item. Further- more, upon inspecting the materials closely after the experiment, we discover that 11 items did not match the strict rules that we controlled the variability in the set with. The 21 pairs of verbs excluded are distributed evenly across the four aspectual classes 248 we consider for this experiment, therefore we do not believe that the exclusion has a significant impact on the general observations. Accordingly, we are reporting only on 27 verb pairs. Table 6.2 shows the final 27 verb pairs. bathe kneel loosen open chill toughen neutralize energize compose manufacture obsess disillusion compress unionize open inflate crinkle boggle percolate unionize displease disillusion plunge bathe dissolve dissipate prick compose embellish decorate swagger waddle festoon decorate unfold divorce fill inject wash sap hack unfold weave enrich initiate enter whisk deflate lean kneel wiggle rotate loosen inflate Table 6.2: The final verb pairs used in the human judgments experiment Participants in the Context group are given exactly the same task, but using the Context materials as described before. 249 Automatic Experiment The different computational similarity measures are calculated based on the descrip- tions rendered in Section 6.3. The set of 48 verb pairs are submitted to the respective similarity measure with no contextual information. In summary, each verb pair gets a set of 5 similarity scores computed based on the different automated measures. 6.4.5 Results We calculate the correlation coefficients of each automated similarity measure with both human conditions, in order to assess the extent to which sets of similarity ratings can predict one another. The correlation metric utilized is the Pearson?s . Table 6.3 show the resulting correlations. The Combined row of Table 6.3 shows the value of a linear Multiple Regression R when the five computational measures are compared with human ratings (see below); and the InterRater row of the table shows human average InterRater agreement, mea- sured by , using leaveoneout resampling according to Weiss & Kulikowski, (1991). Inspecting each of the similarity measures individually, we observe that for both conditions Context and No-Context the taxonomic measures res info, lin info and edge sim outperform the distributional measure lin dist and the lcs lcs sim measures in their cor- relations with the human ratings. 250 Sim Measure Context No-Context edge sim 0.720 0.675 res info 0.779 0.658 lin info 0.768 0.668 lin dist 0.453 0.433 lcs sim 0.313 0.385 Combined 0.872 0.785 InterRater 0.793 0.764 Table 6.3: Comparing the different automated similarity measures to the two human conditions 6.4.6 Discussion We are not surprised by the low correlation attained by the LCS measure, lcs sim, since we control in our experimental design for the most salient features that make LCSs interesting; we control the variation along the semantic structure dimension. The results obtained by lin dist are also quite low relative to both human condi- tions. This measure depends on the syntactic analyses of a large corpus. The low correlation is a reflection of the fact that the measure is dependent on the corpus being utilized and the frequency of the selected verbs for this experiment in that corpus. For example, some low frequency verb pairs such as decorate, embellish and dissolve, dissipate show wide differences with the human ratings. 251 Examining the correlations obtained via the taxonomic measures, we note the su- periority of the information measures to the edge counting measure in the Context condition, which is in congruence with the results obtained by Resnik(1999) on nouns; however, in the No-Context condition edge sim is no different from both of the infor- mation theoretic measures. At this point we are not certain why this is the case, but it will need to be investigated with a larger set of verb pairs before the results can be conclusive.3 There is no significant difference between the two different information theoretic measures, lin info and res info, they have a Pearson?s  of 0.96. Quantitative analysis Several interesting observations come to the fore when comparing human judgments. First, a comparison of the Context and No-Context mean ratings by human participants yields r = 0.89, which provides some reassurance that participants in the No-Context condition are generally interpreting the verbs in the same sense as participants in the Context condition ? where, as previously stated, the Context sentence encouraged interpretation according to the first listed verb sense in Collins Cobuild Dic- tionary. These results also indicate that the first sense listed in the dictionary is 3For the time being we entertain the hypothesis that the edge counting is more correlated with the No-Context condition since people are more liberal in assigning similarity scores in the No-Context condition more than in the Context condition, which is the case with the edge counting measure which is not sensitive to the weighting of the edges, i.e. small distances such as the distance between pen and ballpen has the same weight as the edge between toy and artifact. 252 indeed the default sense for participants. Secondly, average interrater agreement in the two conditions (0.79 and 0.76) is much lower than that obtained in a noun ratings experiment using the same method, where leaveoneout resampling yields an estimate of     [73]. This supports our hypothesis that judging verb similarity is a harder task than judging noun similarity; owing to the multidimensional facets of meaning for verbs quantifying their similarity is a more involved task. Thirdly, we find that participants in the No-Context condition have a very strong tendency to assign higher similarity ratings to the same pair when compared to partic- ipants in the Context condition, as determined using a Paired TTest            $ . This last observation keeps in line with the notion that participants in the No- Context condition are accommodating verb comparisons ? providing room for more flexible interpretations of verb meaning ? in a manner that is not conveniently acces- sible to participants in the Context condition since their interpretations are constrained by the context sentence. Finally, we combine the five measures in a Multiple Regression model, R, where the measures predict the human ratings. The basic idea is to validate the notion that combining different sources of information (the different models) for verb similarity will yield higher correlations with human ratings. Therefore, we use the similarity scores yielded by each measure as independent variables, and the human ratings in 253 each condition independently, as the dependent variable. The score obtained in the Combined row is the multiple regression value resulting from using all     . Looking across the different correlation scores obtained, those yielded by the combi- nation of all five measures is the best predictor of the human ratings. Although the lcs sim and lin dist do not yield high correlations with the human ratings, they do seem to contribute to the predictive power of the regression model since they rely on different sources of information. In summary, supported by the performance of the models, as well as the improved predictive power of the multiple regression, we interpret the outcomes as evidence that human ratings of similarity are sensitive to both paradigmatic and syntagmatic facets of verb representation, and we posit that the computational models are securing important aspects of verb representation in order to make predictions about similarity judgments. Qualitative Analysis We qualitatively examine the cases where none of the similarity measures assign sim- ilarity scores and compare those with the ratings assigned by participants in the ex- periments; we speculate that the subjects are tapping into dimensions of meaning that are not captured by any of the similarity models and possibly not yet characterized or fully formalized. For example, verb pairs such as unfold/divorce, chill/toughen, initiate/enter, all five measures assign them zeroes, yet the human mean ratings are 254 (average 1.6, 1.4, and 3.2, respectively, in the No-Context condition). The ratings are low but they are still higher than some other pairs that did get more than a zero from one of the automated similarity measures such as the verb pair open/inflate which gets a human mean rating of (0.6). In many of the cases, the apparent sense extensions seem to verge on the metaphorical: one can describe divorce as the unfolding of a marriage, a person may chill and toughen when being insulted, enter a group by being initiated into it. Attempting to integrate these dimensions of similarity in our models requires a better understanding of how word meanings are portrayed and intertwined, which is a very intriguing line of inquiry for future research but falls outside the scope of our current research. 6.5 Application to SALAAM Given the very interesting insights obtained from the relation between the different au- tomated similarity measures and the human ratings, we devise a framework for com- bining different automated measures of verb similarity within SALAAM. As men- tioned above, the results obtained on the verb portion, in general by all participating systems, are significantly lower than those yielded for the nouns. For SALAAM, the results yielded follow the same significant drop when com- paring verbs to nouns. We examine the results obtained by condition 4, in Chapter 3, GLSYS-SP T, which is the intralanguage merge of the Spanish with the sense selection criterion set to THRESH. We use a variant on the NG algorithm ? described in Chap- 255 ter 3 ? adapted for the verb WN1.7pre sense inventory called Verb Grouping (VG), also implemented by Resnik. VG is based on the same exact information theoretic similarity measure used for nouns res info described in Section 6.3.1. It specifically targets the IS-A taxonomy in WN1.7pre. SALAAM yields a precision of 32.4% and recall of 8.9% for verbs, in striking contrast to the 65.7% precision score and 50% re- call score for nouns in the same experimental condition. We examine the source type sets, they are very clean, in fact, qualitatively better than noun source type sets within the same condition. We observe that they do not have the pervasive noise attested for in the noun source type sets. Figure 6.1 illustrates a random sample from the verb source sets yielded by SALAAM.4 6.5.1 Integrating Human Ratings in SALAAM: A Cognitive Based Feasibility Study The obtained results are a reflection of the utilization of a similarity model that relies on a single dimension of meaning for verbs, the paradigmatic dimension. As explained earlier, verbs have a very rich multidimensional set of characteristics that are of sig- nificant relevance when comparing verbs. Such a reduction of the multidimensionality of the verbs to a single facet of meaning does not capture the possible fine variations between the different senses of the verb. The different senses of a verb are represented 4An observation worthy of noting, many of the source sets have the same members since we do not lemmatize the words on the target side of the parallel corpus. 256 in the taxonomy, however, with no explicit distinction for the different syntagmatic characterizations of the verb senses, at least within an IS-A style taxonomy such as WordNet, which is the target of the similarity measure. Framework Guided by the observation based on the human experiment that combining different automated similarity measures is a good approximation of human judgments, we set out to describe a framework that is based on the human experiment above for improv- ing verb similarity, primarily for the benefit of SALAAM. Crucially, the combination of the different measures, in order to be effective, has to rely on various sources of information, various models of the different dimensions of verb meaning. In the experiment described in Section 6.4, we compare human ratings against verb similarity measures that consider paradigmatic and syntagmatic information; the measures differ in their source of data, from WordNet sense inventory to a parsed corpus to LCS representations. We obtain the correlations between the different automated similarity measures based on a linear regression model. Linear regression models are predictor models. Put simply, the linear regression model yields the best fit of the data to a straight line.5 5A valid argument could be that the relation between the different sources of information is not necessarily linear; however it is proven that when we expand a continuous function in a Taylor series, often the lowest terms, which are the linear terms, are the most important, resulting in the best simple approximation yielded by a linear model. [29] 257 It estimates the weights contributed by a set of random variables+ in order to predict a variable . Regression models are compelling because they are easy to understand, compute and quite often they outperform more complicated prediction models. The regression equation is as follows:       (6.8)  where  is the predicted values of the response variable ,  is the intercept on the regression plane,  are the regression coefficients (weights) and  are the variables trying to predict . [29] Accordingly, the framework we are proposing is applying this model to the SALAAM similarity calculation phase, to the VG algorithm. The idea is using different similar- ity measures on the verb source type groups and combining the resulting similarity values, crucially, weighted by the coefficients obtained based on a human judgment experiment. As we note earlier in this chapter, the various similarity measures con- tribute differently to the overall correlation with the human ratings. Therefore, weight- ing the different similarity measures based on their expected predictive value for the human similarity judgment, renders a cognitively based framework for automatically measuring semantic similarity. We set out to explore the requirements for testing the proposed framework in order 258 to investigate its impact on SALAAM?s performance. Hypothesis Combining different automated similarity measures modeling different aspects of verb semantics according to linear coefficients obtained based on a human judg- ment experiment should yield better SALAAM performance over its performance if the different measures are used separately or simply added together with equal weight. Resources Since the end goal is to produce sense tagged data, the need arises for obtaining simi- larity values between the different senses of the verbs in the verb source type groups. This does not constitute a problem for the res info, lin info, or edge sim since they are applied to the verb taxonomy in WN1.7pre. The problem is encountered by lin dist since is based on parsing a large corpus, it calculates the similarity between verbs at a coarser level than the sense granularity level. In order to apply lin dist directly, we require the availability of a large enough corpus tagged with verb senses, which does not exist, therefore, lin dist is excluded from this experiment. As for LCS similarity, we would need a large lexicon of LCS entries, crucially marked with WN1.7pre verb sense ids. 259 Feasibility Experiment We conduct a feasibility experiment to test our hypothesis. We use the same metrics defined for SALAAM in condition 4 in Chapter 3, GLSYS-SP T. The test set is the set of 543 verb instances in the SV2AW English corpus (for details, see Section 3.7.1, in Chapter 3). The used ontology is WN17pre since it is the sense inventory used for the gold standard. Experimentation We adapt the VG algorithm to calculate several different similarity values and combine them according to the regression equation, equation (6.8). We use the similarity measures res info, lin info and edge sim as implemented in the publicly available package WordNet-Similarity0.03 by Patwardham & Pederson (2003).6 We add another similarity measure not mentioned above, an adapted Lesk mea- sure, lesk [3]. The adapted Lesk measure uses the basic Lesk algorithm of mea- suring the amount of overlap in the definitions of two words in a dictionary as a measure of their similarity. In lesk, the algorithm is applied to the glosses of the synsets in the WordNet ontology. The rationale behind using such a mea- sure in this Context is as an approximation to the lin dist since lesk considers the words surrounding the verb in the calculation, therefore implicitly coding 6http://www.d.umn.edu/cs/ tpederse/research/ 260 syntactic features.7 Fortunately, for the LCS measure, lcs sim, many of the approximately 10K lex- icon entries are marked with WordNet 1.6 sense ids. We map the sense ids to WN1.7pre using the publicly available mapper.8 The entries are expanded to roughly 27K entries indexed by the WN1.7pre sense ids. The entries are con- verted to the tuple format described in Section 6.3. The measure is applied to the pairs of senses of the verbs according to equation (6.7). Conditions 1. Default We examine the performance of SALAAM with the individual similarity measures: SALAAM-edge sim, SALAAM-lin info, SALAAM-res info, SALAAM-lesk, and SALAAM-lcs. 2. Combined-Equal In this condition, the different measures are combined with the same weight, therefore, the coefficients equation (6.8) are    where )     and the intercept   . 3. Combined-Weighted 7We may obtain data that can be used for the lin dist measure through the sense tagged training data from the SENSEVAL exercises in addition to the verbs? glosses in WordNet and SemCor, yet the size of such a corpus will still be relatively. 8http://www.lsi.upc.es/ nlp/tools/mappings.html 261 In this condition, the coefficients obtained from the regression model ap- plied to the different similarity measures predicting the human ratings are used.9 Results Table 6.4 shows the regression coefficients obtained for the different similarity measures applied to the 27 verb pairs used in the experiment described in Section 6.4. In this preliminary study, we only consider the Context condition from the human judgments experiment for comparison with the different similarity measures. lesk?s correlation with the human Context condition is 0.44, which is close to the correlation of edge sim with the same human condition. It is worth noting that the combined correlation coefficient excluding lin dist and including lesk is decreased to   . The value for the intercept  is  .10 We note that the three taxonomic similarities edge sim, res info and lin info have the highest coefficients corresponding to their correlations with the human judg- ments as illustrated in Table 6.3. lcs sim follows with a very tiny coefficient, then lesk is almost negligible. 9We calculate the coefficients for lesk using the 27 pairs of verbs used in applying the other automatic similarity measures. 10We experimented with removing either res info or lin info to avoid co-linearity effects since there was a high correlation between both measures, the results yielded are the same with either metric. 262 Measure Coefficient edge sim 0.247 lin info -0.353 res info 0.325 lcs sim -0.014 lesk -0.001 Table 6.4: Regression Coefficients for the automatic similarity measures The SALAAM performance results for the different conditions described in Sec- tion 6.5.1 are shown in Table 6.5. The results depicted in Table 6.5 are inconclusive due to the lack of statistical significance. But we note some qualitative phenomena, SALAAM-lcs sim and SALAAM-edge sim yield identical performance though they rely on different sources of information. As expected, the two information theoretic similarity measures SALAAM-res info and SALAAM-lin info yield extremely similar results. SALAAM-lesk produces the best recall results. We observe the precision of the Combined-Weight condition exceeds all the other measures in precision. Despite the modest results, we note the increase in precision from Combined- Equal to Combined-Weight condition. 263 Condition Precision Recall SALAAM-edge sim 43.2% 2.9% SALAAM-lcs sim 43.2% 2.9% SALAAM-res info 32.1% 7.9% SALAAM-lin info 31.1% 7.9% SALAAM-lesk 28.4% 8.8% Combined-Equal 43.2% 2.9% Combined-Weight 45.8% 2.0% Table 6.5: SALAAM performance results with different similarity measure conditions Discussion In fact, the precision results obtained by Combined-Weight and Combined-Equal are close to the high end of the scores obtained by state-of-the-art WSD as exemplified by the SENSEVAL 2 exercise. The VG assigns high confidence to several senses leading to partial credit by scorer2, which affects the precision negatively. Nonetheless, the precision score obtained by Combined-Weight qualitatively exceeds any of the individual similarity score conditions as well as the Combined-Equal. The drop in recall for both combined conditions is potentially explainable by the fact that all the measures return similarity values for the same verb senses, but given that the coefficient weights are not absolute, they neutralize each other leading to a 264 loss in confidence scores. This hypothesis is supported by the Combined-Equal results since all of the measures are given equal weight. The recall in general is extremely low which is mainly attributed to the shallowness and bushiness of the taxonomy. This is reflected in all the recall values obtained. This observation is confirmed by the results obtained by SALAAM-edge sim. The edge counting is the most affected by the shallow depth of the ontology. If all the senses are on similar levels, they will be equidistant, then there is no bias to choose one sense over the other, which is reflected in the uniform confidence scores obtained by VG. As for SALAAM-lcs sim, most likely it is a problem of coverage of the WN1.7pre. However, majority of the senses did not exist in the LCS lexicon. For SALAAM-lesk, the problem is distinctive glosses in WN1.7pre. Many of the glosses for the verb senses overlap very highly, rendering the choice among the differ- ent senses very hard, leading to low confidence by the VG algorithm. All in all, we conclude with the following observations: SALAAM for verbs produces very good quality source sets. The IS-A taxonomy in WN1.7pre is not sufficient as a knowledge representation source since it only focuses on the manner aspect of the verb. Therefore, it should be interesting to apply these similarity measures to other relations in the taxonomy such as meronymy; The lack of a distinct measure that depends on an explicit syntactic model seems 265 to play a distinct role in the regression coefficient value distributions. 6.6 Summary In summary, we presented a novel design for obtaining human ratings on verb se- mantic similarity with Context and with No-Context. We use the human similarity judgments as a pivot for comparing different automatic similarity measures. Crucially, we conclude that combining evidence from different similarity measures that rely on different sources of information yields the highest correlation with human ratings in both Context and No-Context conditions for the human experiment. We then present a framework for incorporating these observations in an actual WSD system, SALAAM, where the goal is to improve results for verb similarity. We have concluded that WordNet IS-A verb taxonomy is not the best source of seman- tic similarity especially if it were the only source of semantic relations between verb senses for all measures. 266 exclude foreclose pour pour out shed stream teem spur stimulate assign mandate advertise announce herald post sell smooth-talking annoy bedevil bother irk mind molest land raise trail falsify forge misrepresent boost extrude jostle poke poke fun push push up thrust hand introduce lay out present bubble express zing bring out clunk evidence survive blunt call for claim demand request poke poke at poke fun stoke ripple wave wave off chair preside meet resolve solve work out carry on continue go on accelerate hasten speed speed up Figure 6.1: Random sample of verb source type sets yielded by SALAAM 267 Chapter 7 Conclusions & Future Directions 7.1 Conclusions The overall theme in this thesis is the search for non traditional sources of evidence and the combination of these sources in functional ways for the benefit of gaining insight into quantifiable aspects of word meaning. To that end, we investigate the characterization of an age old complex problem of functionally resolving word ambi- guity,WSD, using evidence from translations into different languages. We address the problem of WSD from a multilingual perspective; we expand the notion of context to encompass multilingual evidence. We devise a new approach to resolve word sense ambiguity in natural language, using a source of information that was never exploited on a large scale for WSD before. We develop an algorithm that empirically investigates the feasibility and the validity of utilizing translations for WSD. The algorithm is an unsupervised approach, SALAAM, for word sense tagging large amounts of text given 268 a parallel corpus and a sense inventory for one of the languages in the corpus. We eval- uate the approach using machine translated parallel corpora, pseudo-translations. The performance for English nouns ? as the source language in the parallel text ? in a SENSEVAL 2 defined test set is rigorously evaluated using community-wide available tools and compared against state-of-the-art WSD systems. The results yielded are su- perior to all unsupervised methods evaluated on the same test set, moreover, SALAAM rivals some of the supervised methods and partially supervised methods. We observe that evidence from several languages aids SALAAM?s performance. We conclude that language distance does have some impact on the quality of the results obtained. We quantifiably show the complementarity of a multilingual approach to monolingual approaches. We empirically establish that translation is a good source of sense dis- tinction, thereby, lending solid support to the characterization of word meaning using translational correspondence. Figure 7.1 summarizes SALAAM?s performance against state of the art WSD sys- tems which all rely on monolingual contexts. Furthermore, Table 7.1 and Table 7.2 illustrate the significant departure of mul- tilingual WSD from toy systems in terms of evaluation to large scale standardized evaluation in the current thesis. SALAAM?s robustness is tested with naturally occurring parallel corpora of genre types that are unrelated to the test set. The results obtained show no significant dif- ference in performance precision for SALAAM using pseudo-translations of relevant 269 Figure 7.1: SALAAM F-Measure results depicted against state-of-art WSD systems corpora genre versus utilizing unrelated genre corpora for augmenting the test corpus. Having established SALAAM as a good tagger for the source language, we inves- tigate its tagging quality of the target language of the parallel corpus. We examine two target languages: Arabic and Spanish. The results obtained from Arabic demonstrate that 90.5% of the correct tags for English noun instances are appropriate tags for Ara- bic. SALAAM, as a tagger for the target language, zones in on the commonality of sense usage cross linguistically, in effect, quantifying meaning characterizations for a 270 System Method Corpus Type Inventory # Label Linguistic Tools Brown et al Sup. Parallel 2 words Tokenizers Gale et al Sup. Parallel 2 words Tokenizers Dagan & Itai Unsup. Comparable Biling. Dict. words Tokenizers & Parsers Kikui Unsup. Comparable Biling. Dict. words Tokenizers SALAAM Unsup. Parallel WordNet senses Tokenizers Table 7.1: Summary of Multilingual WSD Systems? Required Resources System Languages Metric Size GS Performance Brown et al. En-Fr improv. 100 instances No 8% improv. Gale et al En-Fr acc. 6 words, 140 inst./word No 90% Dagan & Itai Heb-En, Ger-En prec., applic. 103 Heb, 54 Ger inst. No 91% prec, 63%applic. Kikui En-Jap acc. 120 inst. Yes 79.1% acc. SALAAM En-Fr, En-Sp, En-Ar prec., rec., FM 1071 noun inst Yes 64.5% prec. 53% rec.1 Table 7.2: Summary of Multilingual WSD Systems? Evaluation language with poor resources ? such as Arabic ? via its shared sense usages cross linguistically through a language such as English with rich resources. This usage of SALAAM is quite different from its application to the source language; in tagging a source language, SALAAM exploits divergences of meaning representation, but in tagging the target language, SALAAM exploits commonality. On the other hand, we perform a fully automated blind evaluation of the quality of projected tagging for Span- ish data, despite severe lack of resources for this experiment. The results obtained are modest even though they significantly improve on a random baseline. The main rea- son for the modest performance is attributed to the use of source pseudo-translations accompanied by inconsistencies in alignments, therefore detrimentally affecting the quality of the tagging. But nonetheless, the technique presented is a new technique that is fully automated and, except for the parallel corpus, requires minimal resources. 271 Furthermore, SALAAM, as an algorithm, is explored as a method for seeding a WordNet style ontology for Arabic. By quantitative inspection the approach seems promising. We discuss different issues of representation for Arabic specifically. We conclude that stems as a first step are the appropriate level of representation for a taxonomic style ontology for a Arabic. We view SALAAM yet from different angle, it is exploited as an inexpensive source for large amounts of acceptable quality sense annotated data. The produced an- notated data enables us to investigate the trade-off between quantity and quality of an- notated data for supervised learning WSD. We undertake a study to empirically explore the feasibility of bootstrapping a supervised learning WSD system using SALAAM tagged training data instead of human tagged data. In essence, we use SALAAM as an unsupervised learning approach for WSD. SALAAM produces several different tagged unsupervised data sets. Bootstrapping a machine learning WSD system using noisy data from SALAAM is shown to yield better performance than state-of-the-art bootstrapping performance, by Mihalcea [58], using clean tagged data on the same data set in a completely supervised learning experimental set up. We obtain PRs of %   for 12 of the data items tested compared to Mihalcea?s system performance which yields the same PR as SALAAM on 6 data items only. Figure 7.2 illustrates the results, specifically PR values, obtained using SALAAM training data versus PR values obtained by Mihalcea?s system. The hashed bars in the graph are the PR values obtained by Mihalcea while the solid bars are those of 272 SALAAM. Figure 7.2: Comparison between Mihalcea?s results and SALAAM results on the same test set Moreover, SALAAM rivals UMSST?s ? a canonical supervised learning WSD system trained on human tagged data ? performance on these 12 items. We ana- lyze the different factors affecting the performance of the bootstrapping system with the intent of quantifying good predictors of good performance given noisy data. We conclude that different factors play some role, but the major contributors were Sense Context Confusability and Sense Distribution Contexts. With the central role played by similarity in the SALAAM sense disambiguation method, we set out to investigate different aspects of semantic similarity. Driven by 273 the fact that all WSD systems participating in the SENSEVAL 2 All-Words exercise did much worse on verbs than nouns and also by the fact that verbs are very interesting and complex entities in natural language in and of themselves, we explore dimensions of semantic similarity of verbs. We devise a novel experimental design for obtaining human judgments for verb similarity, where the subjects are given test items in Con- text and with No Context in an attempt at measuring the effect of context on human similarity ratings. We compare different automated similarity measures that crucially rely on various sources of information with the obtained human judgments. The intent is to provide a framework for quantifying the amount of contribution that should be attributed to these different sources of information based on insights derived from cog- nitively based studies of verb similarity. As expected, the combination of all automated measures examined in this investigation yields the highest correlation with the human ratings. Accordingly, we provide a cognitively based framework for combining evi- dence for multidimensional verb similarity measures that may be utilized to improve results obtained by WSD systems, in general, and SALAAM in particular, in the task of verb disambiguation. 7.2 Thesis Problems & Limitations In the current implementation of SALAAM, we note that it is limited by its dependence on the availability of parallel texts. 274 SALAAM?s performance is sensitive to the alignment quality and translation variability in text; it is sensitive to noise in the source type sets. The quality of the projected sense tagging of Arabic is tested with only one annotator, more annotators need to inspect the data with the results being judged taking inter-annotator agreements into consideration. SALAAM is tested with bad quality source for Spanish target data set which does not allow us to draw conclusive results regarding the quality of the projected Spanish sense annotations In the semantic similarity experiment in Chapter 6, the data is limited with the number of verb pairs 27 pairs tested on 10 participants only. 7.3 Research Contributions This thesis contributed the following to the field of computational linguistics: A novel robust unsupervised approach to WSD which constitutes a significant departure from the traditional monolingual approaches. The approach is a val- idation of a sound linguistic assumption that meaning characterizations can be captured cross linguistically. We contribute a novel multilingual perspective on the notion of context for addressing the problem of WSD. Context scope is no longer confined monolingually. 275 The thesis provides a detailed description of an end-to-end fully operational, modularly designed system for producing large amounts of good quality sense annotated data in both source and target languages for a parallel corpus. Given a token aligned parallel corpus, SALAAM can produce a fully annotated corpus in less than an hour. The thesis investigates the quality of automatic sense annotations for a language with few computerized linguistic resources such as Arabic. The thesis provides an operational end-to-end automatic framework for testing the quality of projected automatic sense annotations for Spanish. The thesis examines the feasibility of automatically bootstrapping a WordNet style ontology for Arabic via projected sense tags from English and concluded that is a tractable task given a large balanced parallel corpus. The thesis investigates the feasibility of bootstrapping WSD within a super- vised learning paradigm using noisy data based on the results obtained using SALAAM data, thereby introducing a novel unsupervised approach for WSD since even in bootstrapping mode, SALAAM does not require any hand tagged data. SALAAM yields results that are superior to those obtained by the state- of-the-art bootstrapping method on the same test set; simultaneously, SALAAM rivals a canonical supervised system UMSST?s performance on 12 out of 29 noun items of the SENSEVAL 2 test set. 276 The thesis contributes a novel design for attaining human judgments on seman- tic similarity for verb pairs using contextual and non-contextual data. The thesis compared the results obtained by several automated semantic similarity mea- sures against the human similarity ratings. The thesis utilizes insights derived from the human similarity judgment experi- ment to motivate an operational cognitively based framework for exploiting sim- ilarity in a novel way in order to improve WSD results obtained for verbs. 7.4 Future Directions Combining Monolingual Evidence from Monolingual Context with Multilingual Evidence: The source of evidence for word sense tagging using SALAAM is orthogonal to typical monolingual approaches that rely on the monolingual con- texts of the polysemous words to resolve their ambiguity. In SALAAM, at this stage, monolingual evidence is disregarded. For instance, if the word bank oc- curs in a sentence such as She walked by the river bank the fact that bank is preceded by river does not play a role in the sense selection phase of SALAAM for bank. Moreover, given the encouraging quantitative results on complemen- tarity of SALAAM with monolingual approaches, we can visualize explicitly exploiting the monolingual contextual information of the polysemous words as a means of constraining the sense inventory search space. So in this case of 277 the given example, where the river bank sense is intended, only senses that are related to the geographical sense of bank are to take part in the sense se- lection phase. A simple method of implementing this extension is to use the Lesk Algorithm in matching the overlapping context of words in the cor- pus and glosses in WordNet as a preliminary sense tagging step then applying SALAAM. This has the advantage of reducing the search space and introduces bias in the source type sets which could potentially aid in the sense selection process for the other words in the set. Such monolingual evidence can be ob- tained by bracketing a corpus or even parsing it in order to attain even more linguistically interesting biases for the appropriate sense of a word in question. Subclustering Source sets: We noted in the discussion section in Chapter 3 the detrimental effect of noise in the source sets on the performance of SALAAM. Many of the problems emerged from the presence of multiple clusters in the source sets. Therefore, it would be worthwhile to use quantitative clustering techniques on the source data token sets to split them into more coherent sub source sets. Application to comparable corpora: Comparable corpora are more widely avail- able than parallel corpora. A corpus is considered comparable if the two corpora are of the same genre and the same time frame and size. Methods of finding translation equivalents in comparable corpora are very promising. One such method by Diab & Finch [17] introduced a novel unsupervised greedy algorithm 278 that produces very reliable results for comparable corpora. Once we have the translation equivalents and we have a sense inventory for one of the languages of a comparable corpus, SALAAM may be directly applicable to the corpora at hand. Evaluating the quality of the projected Arabic sense annotations with more hu- man annotators Evaluating the quality of the projected Spanish sense annotations using good quality source English text. This is achievable by human translating the Spanish Lexical Sample corpus into English. Implementing a system for predicting the conditions per item that would yield good enough examples with the right pedigree for narrowing the gap between training with noisy tagged data and hand tagged data Testing the bootstrapping approach for supervised WSD with the Spanish SEN- SEVAL 2 data More analysis of the verb data and operationally incorporating the results for improving the performance of SALAAM on verb sense tagging. This is achiev- able by adding more verb pairs as well as participants to the human judgments experiment. Building resources of sense annotated corpora in order to render the lin dist part of this verb similarity investigation (see Chapter 6). 279 Using the SALAAM multilingual framework to distinguish homonymy and pol- ysemy in the WordNet ontology: we believe that homonymy should be explicitly marked in the WordNet, thereby creating a multidimensional WordNet taxon- omy. 280 BIBLIOGRAPHY [1] Eneko Agirre, Jordi Atserias, Luis Padro?, and German Rigau. Combining Supervised and Unsupervised Lexical Knowledge Methods for Word Sense Disambiguation. Computers and the Humanities, 34:50?58, 2000. [2] Yaser Al-Onaizan, Jan Curin, Michael Jahr, Kevin Knight, John Lafferty, I. Dan Melamed, Franz-Josef Och, David Purdy, Noah A. Smith, and David Yarowsky. Statisti- cal Machine Translation: Final Report. In Summer Workshop on Language Engineering. John Hopkins University Center for Language and Speech Processing, 1999. [3] Banerjee and Ted Pedersen. Extended gloss overlaps as a measure of semantic relat- edness. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, Aqapulco, Mexico, August 2003. [4] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classifi- cation and Regression Trees. Wadsworth & Brooks/Cole Advanced Books & Software, Monterey, California, 1984. [5] Eric Brill. Transformation-based tagger, version 1.14, 1995. 281 [6] P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and P. Roossin. A statistical approach to machine translation. Computational Linguistics, 16(2):79?85, 1990. [7] P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer. A statistical approach to sense disambiguation in machine translation. In Fourth DARPA Workshop on Speech and Nat- ural Language, Pacific Grove, CA, February 1991. [8] Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mer- cer. The Mathematics of Machine Translation: Parameter Estimation. Computational Linguistics, 1993. [9] P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, J.D. Lafferty, and R.L. Mercer. Anal- ysis, Statistical Transfer, and Synthesis in Machine Translation. In Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages, pages 83?100, Montreal, Canada, 1992. [10] Rebecca Bruce and Janyce Wiebe. Word-sense disambiguation using decomposable mod- els. In Proceedings of the 32nd Annual Conference of the Association for Computational Linguistics, Las Cruces, New Mexico, June 1994. [11] Tim Buckwalter. Buckwalter Arabic Morphological Analyzer Version 1.0., LDC Catalog No.: LDC2002L49. Linguistic Data Consortium, University of Pennsylvania, 2000. [12] Clara Cabezas, Philip Resnik, and Jessica Stevens. Supervised sense tagging using sup- port vector machines. In Proceedings of the Second International Workshop on Evaluat- ing Word Sense Disambiguation Systems (SENSEVAL-2), Toulouse, France, July 2001. 282 [13] Scott Cotton, Phil Edmonds, Adam Kilgarriff, and Martha Palmer, editors. SENSEVAL- 2: Second International Workshop on Evaluating Word Sense Disambiguation Systems, Toulouse, France, July 2001. ACL SIGLEX. http://www.sle.sharp.co.uk/senseval2/. [14] D. Cruse. Lexical Semantics. Cambridge University Press, 1986. [15] Ido Dagan and Alon Itai. Word Sense Disambiguation Using a Second Language Mono- lingual Corpus. Computational Linguistics, 20(4):563?596, 1994. [16] Kareem Darwish. Building a shallow arabic morphological analyzer in one day. In Proceedings of ACL Workshop on Semitic languages, Pennsylvania, USA, 2002. [17] M. Diab and S. Finch. A Statistical Word-Level Translation Model for Comparable Cor- pora. In Proceedings of RIAO 2000 Conference, April 2000. [18] Mona Diab. An unsupervised method for multilingual word sense tagging using parallel corpora: A preliminary investigation. In SIGLEX2000: Word Senses and Multi-linguality, Hong Kong, October 2000. [19] Bonnie J. Dorr. Machine Translation: A View from the Lexicon. The MIT Press, Cam- bridge, MA, 1993. [20] Bonnie J. Dorr. Large-Scale Dictionary Construction for Foreign Language Tutoring and Interlingual Machine Translation. Machine Translation, 12(4):271?322, 1997. [21] Bonnie J. Dorr. LCS Verb Database. Technical Report Online Soft- ware Database, University of Maryland, College Park, MD, 2001. http://www.umiacs.umd.edu/?bonnie/LCS Database Docmentation.html. 283 [22] Bonnie J. Dorr, M. Antonia Mart??, and Irene Castello?n. Spanish EuroWordNet and LCS- Based Interlingual MT. In Proceedings of the Workshop on Interlinguas in MT, MT Summit, New Mexico State University Technical Report MCCS-97?314, pages 19?32, San Diego, CA, October 1997. [23] Helge Dyvik. Translations as semantic mirrors, 1998. [24] Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998. http://www.cogsci.princeton.edu/?wn [2000, September 7]. [25] Christiane Fellbaum, Martha Palmer, Hoa Trang Dang, Lauren Delfs, and Susanne Wolff. Manual and Automatic Semantic Annotation with WordNet. In Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources: Applications, Customiza- tions, Carnegie Mellon University, Pittsburg, PA, 2001. [26] W. Francis and H. Kuc?era. Frequency Analysis of English Usage. Houghton Mifflin Co.: New York, 1982. [27] William A. Gale, Kenneth W. Church, and David Yarowsky. Using Bilingual Materials to Develop Word Sense Disambiguation Methods. In Proceedings of the Fourth Inter- national Conference on Theoretical and Methodological Issues in Machine Translation, pages 101?112, Montre?al, Canada, June 1992. [28] Jane Grimshaw. Semantic Structure and Semantic Content in Lexical Representation. 1993. [29] David Hand, Heikki Mannila, and Padhraic Smyth. Principles of Data Mining. MIT Press, Boston, 2001. 284 [30] Philip Hayes. On Semantic Nets, Frames and Associations., 1977. Proceedings of the 5th International Joint Conference of Artificial Intelligence. [31] Rebecca Hwa, Philip Resnik, Amy Weinberg, and Okan Kolak. Evaluating Translational Correspondence using Annotation Projection. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, 2002. [32] Nancy Ide. Cross-lingual sense discrimination: Can it work? Computers and the Hu- manities, 34:223?34, 2000. [33] Nancy Ide and Jean Ve?ronis. Introduction to the special issue on word sense disambigua- tion: The state of the art. In Computational Linguistics 24(1): 1-40, 1998. [34] Julio Gonzalo Irina Chugur and Felisa Verdejo. Polysemy and sense proximity in the senseval-2 test suite. In Proceedings of Word Sense Diasmbiguation: Recent Successes and Future Directions, University of Pennsylvania, Pennsylvania, July 2002. [35] Ray Jackendoff. Semantics and Cognition. The MIT Press, Cambridge, MA, 1983. [36] Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learn- ing. Springer, 1998. [37] Daniel Jurafsky and James H. Martin. Speech and Language Processing. Prentice Hall, New Jersey, USA, 2000. [38] Edward Kelly and Philip Stone. Computer Recognition of English Word Senses, 1975. 285 [39] Genichiro Kikui. Resolving translation ambiguity using non-parallel bilingual corpora. In Proceedings of ACL99 Workshop on Unsupervised Learning in Natural Language Pro- cessing., College Park, Maryland, 1999. [40] A. Kilgarriff and J. Rosenzweig. Framework and Results for English SENSEVAL. Com- puters and the Humanities, 34:15?48, 2000. [41] Adam Kilgarriff. SENSEVAL: An Excercise in Evaluating Word Sense Disambiguation Programs, 1977. [42] Adam Kilgarriff. Inheriting Polysemy. In P. Saint-Dizier and E. Viegas, editors, Compu- tational Lexical Semantics, pages 319?335. Cambridge University Press, England, 1995. [43] Michael E. Lesk. Automated Sense Disambiguation Using Machine-readable Dictionar- ies: How to Tell a Pine Cone from an Ice Cream Cone. In Proceedings of the SIGDOC Conference, 1986. [44] Beth Levin. English Verb Classes and Alternations: A Preliminary Investigation. Uni- versity of Chicago Press, Chicago, IL, 1993. [45] Beth Levin and Malka Rappaport Hovav. The Elasticity of Verb Meaning. In Proceedings of the Tenth Annual Conference of the Israel Association for Theoretical Linguistics and the Workshop on the Syntax-Semantics Interface, University of Haifa, Israel/Ben Gurion University of the Negev, Be?er Sheva, Israel, June 12?13 1994. [46] Beth Levin and Malka Rappaport Hovav. From Lexical Semantics to Ar- gument Realization. Technical report, Nortwestern University, October 1996. http://www.ling.nwu.edu/?beth/pubs.html. 286 [47] Beth Levin and Malka Rappaport Hovav. Building Verb Meanings. In M. Butt and W. Geuder, editors, The Projection of Arguments: Lexical and Compositional Factors, pages 97?134. CSLI Publications, Stanford, CA, 1998. [48] Dekang Lin. Government-Binding Theory and Principle-Based Parsing. Technical report, University of Maryland, 1995. Submitted to Computational Linguistics. [49] Dekang Lin. Using syntactic dependency as local context to resolve word sense ambi- guity. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Compu- tational Linguistics, Madrid, Spain, July 1997. [50] Dekang Lin. Automatic Retrieval and Clustering of Similar Words. In Proceedings of COLING-ACL98, Montreal, Canada, 1998. [51] Dekang Lin. Dependency-Based Evaluation of MINIPAR. In Proceedings of the Work- shop on the Evaluation of Parsing Systems, First International Conference on Language Resources and Evaluation, Granada, Spain, May 1998. [52] Adam Lopez, Michael Nossal, Rebecca Hwa, and Philip Resnik. Word-level alignment for multilingual resource acquisition. In Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data, at the Third International Conference on Language Resources and Evaluation (LREC-2000), Las Palmas, Canary Islands, Spain, June 2002. 287 [53] M. McCord. Slot Grammar: A System for Simpler Construction of Practical Natural Language Grammars. In In R. Studer (Ed.), Natural Language and Logic), pages 118? 145, Berlin,Heidelberg, 1990. [54] Dan I. Melamed. Measuring semantic entropy. In SIGLEX Workshop on Tagging Text with Lexical Semantics. ACL, 1997. [55] I. Dan Melamed. Models of Translational Equivalence among Words. Computational Linguistics, 26(2):221?249, June 2000. [56] I. Dan Melamed and Philip Resnik. Evaluation of sense disambiguation given hierarchical tag sets. Computers and the Humanities, (1?2), 2000. [57] R. Mihalcea and D. Moldovan. A method for word sense disambiguation of unrestricted text, 1999. [58] Rada Mihalcea. Bootstrapping large sense tagged corpora. In Proceedings of the 3rd International Conference on Languages Resources and Evaluations (LREC-2000), Las Palmas, Canary Islands, Spain, June 2002. [59] Rada Mihalcea and Dan I. Moldovan. Word sense disambiguation based on semantic density. In Sanda Harabagiu, editor, Use of WordNet in Natural Language Processing Systems: Proceedings of the Conference, pages 16?22. Association for Computational Linguistics, Somerset, New Jersey, 1998. [60] Arthur Nadas. A decision-theoretic formulation of a training problem in speech recogni- tion and a comparison of training by unconditional versus conditional maximum likeli- 288 hood. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-31(4):814? 817, August 1983. [61] Hwee Tou Ng and Hian Beng Lee. Integrating Multiple Knowledge Sources to Disam- biguate Word Sense: An Exemplar-Based Approach. In Proceedings of the 34th Annual Conference of the Association for Computational Linguistics, pages 40?47, Santa Cruz, CA, June 1996. [62] Franz J. Och and Hermann Ney. Improved Statistical Alignment Models. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL?00), pages 440?447, Hongkong, China, October 2000. [63] Ted Pedersen. Machine learning with lexical features: The duluth approach to senseval 2. In Proceedings of SENSEVAL-2: Second International Workshop on Evaluating Word Sense Disambiguation Systems, Toulouse, France, July 2001. [64] Steven Pinker. Learnability and Cognition: The Acquisition of Argument Structure. The MIT Press, Cambridge, MA, 1989. [65] M.R. Quillian. Semantic Memory. In M. Minsky, editor, Semantic Information Process- ing. The MIT Press, Cambridge, MA, 1968. [66] Malka Rappaport Hovav and et al. Levels of Lexical Representation. Semantics and the Lexicon, pages 37?54, 1983. [67] Philip Resnik. Selection and Information: A Class-Based Approach to Lexical Relation- ships. PhD thesis, University of Pennsylvania, December 1993. 289 [68] Philip Resnik. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of IJCAI-95, pages 448?453, Montreal, Canada, August 20?25 1995. [69] Philip Resnik. Selectional Preference and Sense Disambiguation. Technical report, Uni- versity of Maryland, 1997. [70] Philip Resnik. Disambiguating Noun Groupings with Respect to WordNet Senses. In S. Armstrong, K. Church, P. Isabelle, S. Manzi, E. Tzoukermann, and D. Yarowsky, editors, Natural Language Processing Using Very Large Corpora, pages 77?98. Kluwer Academic, Dordrecht, 1999. [71] Philip Resnik. Mining the Web for Bilingual Text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL?99), University of Mary- land, College Park, Maryland, June 1999. [72] Philip Resnik. Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Artificial Intelligence Research, (11):95?130, 1999. [73] Philip Resnik, Mari Olsen, and Mona Diab. The Bible as a Parallel Corpus: Annotating the Book of 2000 Tongues. Computers and the Humanities, (33):129?153, 1999. [74] Philip Resnik and David Yarowsky. Distinguishing Systems and Distinguishing Senses: New Evaluation Methods for Word Sense Disambiguation. Natural Language Engineer- ing, 1(1):1?25, 1998. [75] Hinrich Schu?tze. Automatic word sense discrimination. Computational Linguistics, 24(1):97?124, 1998. 290 [76] John Sinclair, editor. Collins Cobuild English Dictionary. Collins, 1995. Patrick Hanks, managing editor. [77] Steven Small. Word Expert Parsing: A Theory of Distributed Word Based Natural Lan- guage Understanding, September 1980. Doctoral Dissertation. Computer Science De- partment, University of Maryland. [78] P. Vossen, W. Peters, and J. Gonzalo. Towards a Universal Index of Meaning. pages 1?24, 1999. [79] Piek Vossen, Pedro Diez-Orzas, and Wim Peters. The Multilingual Design of EuroWord- Net. In Proceedings of the ACL/EACL-97 Workshop on Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Application, Madrid, Spain, 1997. [80] W. Weaver. Translation(1949). In Machine Translation of Languages. MIT Press, Cam- bridge, MA, 1955. [81] Yorick Wilks. Preference Semantics. In E.L. Keenan, editor, Formal Semantics of Natural Language. Cambridge University Press, Cambridge, MA, 1975. [82] Louise Guthrie Wim Peters and Yorick Wilks. Cross-linguistic discovery of semantic regularity, 2001. [83] D. Yarowsky. Word-Sense Disambiguation: Using Statistical Models of Roget?s Cate- gories Trained on Large Corpora. In Proceedings of the Fourteenth International Con- ference on Computational Linguistics, pages 454?460, Nantes, France, 1992. 291 [84] David Yarowsky. Word-sense disambiguation using statistical models of roget?s cate- gories trained on large corpora. In In Proceedings of COLING-92, Nantes, France, 1992. [85] David Yarowsky. Unsupervised Word Sense Disambiguation Rivaling Supervised Meth- ods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL?99), pages 189?196, Cambridge, MA, 1995. 292