ABSTRACT Title of dissertation: COMBINING LINGUISTIC AND MACHINE LEARNING TECHNIQUES FOR WORD ALIGNMENT IMPROVEMENT Necip Faz?l Ayan, Doctor of Philosophy, 2005 Dissertation directed by: Professor Bonnie J. Dorr Department of Computer Science Alignment of words, i.e., detection of corresponding units between two sen- tences that are translations of each other, has been shown to be crucial for the success of many NLP applications such as statistical machine translation (MT), construction of bilingual lexicons, word-sense disambiguation, and projection of resources between languages. With the availability of large parallel texts, statisti- cal word alignment systems have proven to be quite successful on many language pairs. However, these systems are still faced with several challenges due to the complexity of the word alignment problem, lack of enough training data, diffi- culty learning statistics correctly, translation divergences, and lack of a means for incremental incorporation of linguistic knowledge. This thesis presents two new frameworks to improve existing word align- ments using supervised learning techniques. In the first framework, two rule-based approaches are introduced. The first approach, Divergence Unraveling for Statis- tical MT (DUSTer), specifically targets translation divergences and corrects the alignment links related to them using a set of manually-crafted, linguistically- motivated rules. In the second approach, Alignment Link Projection (ALP), the rules are generated automatically by adapting transformation-based error-driven learning to the word alignment problem. By conditioning the rules on initial align- ment and linguistic properties of the words, ALP manages to categorize the errors of the initial system and correct them. The second framework, Multi-Align, is an alignment combination framework based on classifier ensembles. The thesis presents a neural-network based imple- mentation of Multi-Align, called NeurAlign. By treating individual alignments as classifiers, NeurAlign builds an additional model to learn how to combine the input alignments effectively. The evaluations show that the proposed techniques yield significant improve- ments (up to 40% relative error reduction) over existing word alignment systems on four different language pairs, even with limited manually annotated data. More- over, all three systems allow an easy integration of linguistic knowledge into statis- tical models without the need for large modifications to existing systems. Finally, the improvements are analyzed using various measures, including the impact of improved word alignments in an external application?phrase-based MT. COMBINING LINGUISTIC AND MACHINE LEARNING TECHNIQUES FOR WORD ALIGNMENT IMPROVEMENT by Necip Faz?l Ayan Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2005 Advisory Commmittee: Professor Bonnie J. Dorr, Chair Professor Rebecca Green Professor Rebecca Hwa Professor Philip Resnik Professor Amy Weinberg c? Copyright by Necip Faz?l Ayan 2005 DEDICATION This thesis is dedicated to my parents, Muradiye and Cevdet Ayan, for having a dream for me and doing everything they can to make that dream come true, to my wife, Burcu Karag?ol-Ayan, for loving and supporting me for the last 12 years and making me a better person, and to my son, Alperen Ayan, for bringing a new meaning to my life. ii ACKNOWLEDGMENTS First of all, I would like to thank my advisor, Dr. Bonnie Dorr, for always pushing me hard, challenging my work, teaching invaluable lessons, being available all the time, and most importantly helping me become a better researcher. I?m grateful to her for her understanding, patience and support in bad times over the last five years. I would also like to thank Dr. Christof Monz for his guidance and help while developing the algorithms in this thesis. I thank all my committee members, Dr. Rebecca Green, Dr. Rebecca Hwa, Dr. Philip Resnik and Dr. Amy Weinberg for sparing their invaluable time to read my dissertation and making very helpful comments. I thank Adam Lopez for providing me with his word alignment system and making several insightful comments on my work, Nitin Madnani for pointing out several resources that made my life easier while implementing the algorithms in this thesis, and Dr. Nizar Habash for his help in early stages of my work. I should also express my gratitude to Dr. Franz Och and Dr. Philipp K?ohn for making GIZA++ and Pharaoh public, and allowing me to use them in my experiments. I would like to thank Muhammet Pakdil for listening to me day and night, supporting me through all the good and bad times, and being a true friend. I also iii thank my friends, F?usun Yaman, Evren S?irin, Mustafa Murat T?k?r, Okan Kolak, Kemal Akkaya, C?a?gda?s Dirik and several others, for making my life at College Park easier and enjoyable. I thank Karag?ol family, who accepted me as their own son or brother, for their moral support and encouragement. I thank all the organizations and individuals who made this research possible by financially supporting me. Most of all, I thank my family for helping me to be where I am right now. I owe a great debt to my parents, who always believed in me, supported me and did everything possible to make sure I get a good education. I thank my wife, Burcu, for letting me become a part of her life, being on my side through all the difficult times in my graduate life, and making me a better person. This thesis would not have materialized without her constant love, trust, support, and encouragement. Finally, I thank my son Alperen?the best thing that ever happened to me in my life?for filling my heart with joy and happiness, and being an inspiration to me. This thesis is dedicated to them. iv TABLE OF CONTENTS List of Tables viii List of Figures xii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Techniques for Improving Existing Word Alignments . . . . . . . . 8 1.2.1 Improving One Word-Alignment System . . . . . . . . . . . 9 1.2.2 Combining Multiple Alignments . . . . . . . . . . . . . . . . 16 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.4 Thesis Layout & Brief Overview of Chapters . . . . . . . . . . . . . 21 2 Related Work 24 2.1 Sentence-Level Alignment . . . . . . . . . . . . . . . . . . . . . . . 26 2.2 Word Level Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.1 Heuristic Methods . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.2 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . 35 2.3 Phrase Level Alignment . . . . . . . . . . . . . . . . . . . . . . . . 54 2.4 Combining Word Alignments . . . . . . . . . . . . . . . . . . . . . . 58 2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.6 Evaluation of Word Alignments . . . . . . . . . . . . . . . . . . . . 63 2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3 DUSTer: Divergence Unraveling for Statistical MT 74 3.1 Translation Divergences . . . . . . . . . . . . . . . . . . . . . . . . 76 3.2 Relating Alignment Errors to Divergence Types . . . . . . . . . . . 80 3.3 DUSTer: System Description . . . . . . . . . . . . . . . . . . . . . . 84 3.3.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.3.2 Universal Rule Set . . . . . . . . . . . . . . . . . . . . . . . 88 3.3.3 Setting Up DUSTer for a New Language . . . . . . . . . . . 94 3.3.4 Improving Alignments Using DUSTer . . . . . . . . . . . . . 98 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.4.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 v 4 Alignment Link Projection (ALP) 105 4.1 Transformation-based Error-driven Learning . . . . . . . . . . . . . 108 4.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.3 Description and Parameters of ALP . . . . . . . . . . . . . . . . . . 113 4.3.1 Initial Alignment . . . . . . . . . . . . . . . . . . . . . . . . 114 4.3.2 TBL Templates . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.3.3 Instantiation of Templates . . . . . . . . . . . . . . . . . . . 122 4.3.4 Best Rule Selection . . . . . . . . . . . . . . . . . . . . . . . 125 4.4 Evaluation Data and Settings . . . . . . . . . . . . . . . . . . . . . 125 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.5.1 Results for Different Initial Alignments . . . . . . . . . . . . 130 4.5.2 Results for Different Sets of Templates . . . . . . . . . . . . 134 4.5.3 Using Different Types of Instantiation . . . . . . . . . . . . 136 4.5.4 Using Different Methods for Best Rule Selection . . . . . . . 137 4.5.5 Effects of Training Size for Input Aligners . . . . . . . . . . 138 4.5.6 Analysis of ALP Rules . . . . . . . . . . . . . . . . . . . . . 139 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5 Multi-Align: A Framework for Combining Multiple Alignments 144 5.1 Classifier Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.1.1 Why Ensembles Work . . . . . . . . . . . . . . . . . . . . . 148 5.1.2 When Ensembles Work . . . . . . . . . . . . . . . . . . . . . 149 5.1.3 Creating Ensemble Members . . . . . . . . . . . . . . . . . . 153 5.1.4 Combining Ensemble Members . . . . . . . . . . . . . . . . 155 5.1.5 Ensembles in NLP . . . . . . . . . . . . . . . . . . . . . . . 157 5.2 Multi-Align . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.3 Preliminary Study: Alignment Combination Using Weighted Sum- mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.3.1 Set of Features . . . . . . . . . . . . . . . . . . . . . . . . . 168 5.3.2 Perceptron Learning . . . . . . . . . . . . . . . . . . . . . . 169 5.3.3 Using Perceptrons for Alignment Combination . . . . . . . . 172 5.4 Evaluation Data and Settings . . . . . . . . . . . . . . . . . . . . . 173 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 175 5.5.1 Effects of Feature Selection for Link Classes . . . . . . . . . 175 5.5.2 Effects of Different Feature Functions . . . . . . . . . . . . . 177 5.5.3 Effects of Training Size for Input Aligners . . . . . . . . . . 178 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6 NeurAlign 181 6.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 6.2 NeurAlign in Multi-Align Framework . . . . . . . . . . . . . . . . . 187 6.3 NeurAlign Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.3.1 Extracting Features . . . . . . . . . . . . . . . . . . . . . . . 190 vi 6.3.2 Learning A Classifier . . . . . . . . . . . . . . . . . . . . . . 192 6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 6.4.1 Training and Evaluation Data . . . . . . . . . . . . . . . . . 197 6.4.2 Neural Network Settings . . . . . . . . . . . . . . . . . . . . 200 6.4.3 Description of Classification Data . . . . . . . . . . . . . . . 201 6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 203 6.5.1 Effects of Using Different Features for NeurAlign1 . . . . . . 204 6.5.2 Effects of Partitioning Data (NeurAlign1 vs. NeurAlign2) . . 206 6.5.3 Effects of Number of Alignment Systems . . . . . . . . . . . 210 6.5.4 Effects of Training Size for NeurAlign . . . . . . . . . . . . . 212 6.5.5 Effects of Training Size for Input Aligners . . . . . . . . . . 213 6.5.6 Stability of Results . . . . . . . . . . . . . . . . . . . . . . . 215 6.5.7 Experiments on Languages With Scarce Resources . . . . . . 217 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 7 Analysis of Alignments and MT Evaluation 221 7.1 Analysis of Improvements in Alignments . . . . . . . . . . . . . . . 223 7.1.1 Precision vs. Recall . . . . . . . . . . . . . . . . . . . . . . . 223 7.1.2 Comparison of Number of Alignment Links . . . . . . . . . . 227 7.1.3 Comparison of Fertilities . . . . . . . . . . . . . . . . . . . . 230 7.1.4 Resolution of Ambiguous Links . . . . . . . . . . . . . . . . 233 7.2 Phrase-based Machine Translation . . . . . . . . . . . . . . . . . . . 240 7.3 MT Evaluation: Results and Analysis . . . . . . . . . . . . . . . . . 245 7.3.1 Phrase Table Analysis . . . . . . . . . . . . . . . . . . . . . 248 7.3.2 MT Evaluation using BLEU . . . . . . . . . . . . . . . . . . 249 7.3.3 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 8 Conclusions 254 8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 8.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 A DUSTer Parameters 263 B DUSTer Universal Rules for English-Spanish 297 C ALP Rules (on English-Spanish) 303 D ALP Rules (on English-Chinese) 306 vii LIST OF TABLES 2.1 Summary of IBM Models . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1 Human and GIZA++ Alignments Between an English and Spanish Sentence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.2 NumberandPercentageofAlignmentErrorsbyGIZA++(onEnglish- Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.3 Semantic and Syntactic Categories of Words from Missing, Added and Replaced Links (on English-Spanish) . . . . . . . . . . . . . . . 83 3.4 DUSTer Parameters in English . . . . . . . . . . . . . . . . . . . . 87 3.5 Set of Universal Rules in English-Spanish . . . . . . . . . . . . . . . 93 3.6 Times Needed To Prepare DUSTer for 3 Languages . . . . . . . . . 97 3.7 GIZA++ and DUSTer Results (on English-Spanish) . . . . . . . . . 102 4.1 Templates for Expanding the Alignment A According to a Valida- tion Alignment V . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.2 Templates for Deleting Spurious Links in a Given Alignment A . . . 118 4.3 Number of Words with at Least One Correct Alignment Link (on English-Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.4 Templates for Handling Multi-Word Correspondences in a Given Alignment A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.5 Templates for Correcting One-to-One Correspondences in a Given Alignment A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 viii 4.6 GIZA++ Results (on English-Spanish and English-Chinese) . . . . 127 4.7 ALP Results Using Different Initial Alignments (on English-Spanish)130 4.8 ALP Results Using Different Initial Alignments (on English-Chinese)131 4.9 ALP Results Using GIZA++(int) as Initial Alignment (on English- Spanish and English-Chinese) . . . . . . . . . . . . . . . . . . . . . 134 4.10 ALP Results Using Different Template Instantiations (on English- Spanish and English-Chinese) . . . . . . . . . . . . . . . . . . . . . 136 4.11 ALP Results Using Different Rule Selection Methods (on English- Spanish and English-Chinese) . . . . . . . . . . . . . . . . . . . . . 137 4.12 ALP Results Using Different Initial Alignments When GIZA++ is Trained on More Data (on English-Chinese) . . . . . . . . . . . . . 138 5.1 Multi-Align: EffectsofFeatureSelectionforLinkClasses(onEnglish- Chinese) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.2 Multi-Align: Effects of Different Feature Functions (on English- Chinese) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 5.3 Multi-Align: Effects of Using More Training Data for Initial Align- ments (on English-Chinese) . . . . . . . . . . . . . . . . . . . . . . 179 6.1 DistributionofInstancesAccordingtoGIZA++Outputs(onEnglish- Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 6.2 ErrorRatesAccordingtoPOSTagsforGIZA++(e ? s)(onEnglish- Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 6.3 GIZA++ Results (on English-Spanish, English-Chinese, English- Arabic and English-Romanian) . . . . . . . . . . . . . . . . . . . . 200 6.4 SAHMM Results (on English-Chinese) . . . . . . . . . . . . . . . . 200 6.5 NeurAlign1: Effects of Feature Set for Combining Alignments (on English-Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 6.6 NeurAlign1: EffectsofFeatureSelectionforPartitioning(onEnglish- Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 ix 6.7 NeurAlign2: Effects of Feature Set for Combining Alignments (on English-Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 6.8 NeurAlign: EffectsofFeatureSelectionforPartitioningUsingGIZA++ and SAHMM Alignments (on English-Chinese) . . . . . . . . . . . . 209 6.9 NeurAlign: Effects of Using a Higher Number of Input Alignments (on English-Chinese) . . . . . . . . . . . . . . . . . . . . . . . . . . 211 6.10 NeurAlign: Effects of Training Size (on English-Chinese) . . . . . . 213 6.11 NeurAlign: Effects of Using More Data to Train Input Alignments (on English-Chinese) . . . . . . . . . . . . . . . . . . . . . . . . . . 214 6.12 Stability of NeurAlign (on English-Chinese) . . . . . . . . . . . . . 216 6.13 GIZA++ vs. NeurAlign (on English-Arabic) . . . . . . . . . . . . . 217 6.14 GIZA++ vs. NeurAlign (on English-Romanian) . . . . . . . . . . . 218 7.1 Relative Improvements in AER by ALP and NeurAlign over 5 Dif- ferent Alignments (in Percentages) . . . . . . . . . . . . . . . . . . 222 7.2 Number of Alignment Links for Different Alignments (on English- Chinese) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 7.3 Percentage of English Words with Different Fertilities (on English- Chinese) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 7.4 Percentage of Chinese Words with Different Fertilities (on English- Chinese) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 7.5 Resolution of Ambiguous Cases Among 2 Alignments (on English- Chinese) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 7.6 Resolution of Ambiguous Cases Among 4 Alignments (on English- Chinese) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 7.7 Upper Bounds (Assuming Perfect Resolution of Ambiguous Cases) and NeurAlign Results for 3 Sets of Input Alignments (on English- Chinese) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 7.8 SizeofPhraseTablesGeneratedbyDifferentAlignments(onEnglish- Chinese) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 x 7.9 EvaluationofPharaohwithDifferentInitialAlignments(onEnglish- Chinese) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 7.10 MT Parameters (on English-Chinese) . . . . . . . . . . . . . . . . . 252 xi LIST OF FIGURES 1.1 DUSTer: Example Rule Application . . . . . . . . . . . . . . . . . . 11 1.2 Alignment Link Projection (ALP) . . . . . . . . . . . . . . . . . . . 14 1.3 Multi-Align Framework for Combining Multiple Alignments . . . . 17 2.1 Word Alignment Example . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 Graphical Illustration of Three Types of Alignment Errors . . . . . 82 3.2 Components of DUSTer . . . . . . . . . . . . . . . . . . . . . . . . 85 3.3 Universal Rules in BNF Format . . . . . . . . . . . . . . . . . . . . 89 3.4 Universal Rule Application Example . . . . . . . . . . . . . . . . . 92 3.5 Example Sentences, Dependency Tree and Initial Alignment . . . . 98 3.6 DUSTer?s Inferred Alignments from Initial GIZA++ Output . . . . 100 4.1 TBL Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.2 Pseudo-code for Alignment Link Projection . . . . . . . . . . . . . . 112 4.3 Graphical Representation of a Template . . . . . . . . . . . . . . . 115 4.4 Illustration of Deletion Templates . . . . . . . . . . . . . . . . . . . 119 4.5 Graphical Illustration of Two Multi-word Correction Templates . . 122 5.1 Multi-Align Framework for Combining Multiple Alignments . . . . 159 5.2 Representation of Grow-diag-final Method in Multi-Align Framework164 xii 5.3 A Perceptron Network with R Input Units and S Output Units . . 170 6.1 Multilayer Perceptron Overview . . . . . . . . . . . . . . . . . . . . 184 6.2 NeurAlign in Multi-Align Framework . . . . . . . . . . . . . . . . . 188 6.3 An Example of Transforming Alignments into Classification Data . 192 6.4 NeurAlign1?Alignment Combination Using All Data At Once . . . 194 6.5 NeurAlign2?Alignment Combination with Partitioning . . . . . . . 196 7.1 Precision and Recall for Initial Alignments and ALP . . . . . . . . 224 7.2 Precision and Recall for Initial Alignments and NeurAlign . . . . . 226 7.3 Examples of Valid and Invalid Phrase Pairs . . . . . . . . . . . . . . 243 xiii Chapter 1 Introduction Parallel texts are texts accompanied by their translation in one or more languages. The first few attempts at using parallel texts, such as machine translation (MT) in the late Fifties, did not succeed because of limited storage and computing capacities of computers, along with the difficulty of creating electronic texts. However, the incredibly rapid growth of the Web, and development of several techniques to collect data from the Internet have led to a huge increase in parallel resources in electronic form (Resnik, 1998; Koehn, 2002; Resnik and Smith, 2003). Since then, parallel texts have become one of the most commonly used resources in natural language processing (NLP), especially in statistical MT. The most critical task in parallel text processing is alignment, i.e., detection of corresponding units between two texts of different languages. Given parallel texts in two different languages, alignment can be extracted at the level of sections, paragraphs, sentences, phrases, collocations, or words. Although the alignment between other units is also an important problem, this thesis will focus on the 1 alignment of words. Alignment of words, i.e., detection of corresponding words between two sen- tences that are translations of each other, is usually an intermediate step of sta- tistical MT that has been shown to be crucial for the success of statistical MT systems (Brown et al., 1993; Vogel et al., 1996; Och and Ney, 2003; Koehn et al., 2003), and for many other NLP applications such as construction of bilingual lexi- cons (Gale and Church, 1991b; Melamed, 1997c), automatic generation of transfer rules for machine translations (Menezes and Richardson, 2001; Carbonell et al., 2002), word-sense disambiguation (Brown et al., 1991a; Gale et al., 1992; Chang et al., 1996; Diab and Resnik, 2002), projection of resources (such as morpholog- ical analyzers, part-of-speech taggers, and parsers) from a resource-rich language into other resource-poor languages (Yarowsky et al., 2001; Hwa et al., 2002), and cross-language information retrieval (Fluhr, 1995; Oard and Dorr, 1996). The word alignment problem is difficult because of the following reasons: 1. Words are not always aligned one-to-one because some languages are more verbose than others. Moreover, nearly 50% of the tokens in any text are function words, which frequently translate into an affix, positional informa- tion, part of expressions, phrases, or even nothing at all. There is even less of a one-to-one correspondence between function words than for content words. 2. Some languages make morphological distinctions that are absent in the other. German, for example, makes a number of case distinctions, especially in 2 adjectives, that are not reflected in the morphology of English. 3. For various reasons, such as language typology, style, and cultural differences, a translator does not always translate literally on a word-by-word basis. Most of the time, words are added or deleted. 4. Idiomatic expressions and phrases in two different languages usually contain words that are not translations of each other. Given two sentences e = e1,...,ei,...,eI and f = f1,...,fj,...,fJ that are translations of each other, there are I?J different connections that can be drawn between e and f because each of the J words in f can be connected to any of the I words in e. Since an alignment is determined by the connections that it contains, and since a subset of the possible connections can be chosen in 2I?J ways, there are 2I?J possible alignments. Different word alignment models reduce the search space by making further assumptions on types of connections between the words. For example, in IBM Models (Brown et al., 1993), each source word fj can align to exactly one target word ei, or the null word. Alternatively, the target words can link to any number of source words. As a result, the number of possible alignments is O(IJ) but the complexity is still exponential. Since Brown et al. (1993) proposed the widely-used IBM models, several researchers have developed various word alignment systems that are based on dif- ferent models, such as hidden Markov models (HMM), maximum entropy models, log-linear combinations, and similarity-based heuristic methods. Taking advan- 3 tage of large parallel texts, statistical systems have been shown to outperform their counterparts significantly. Because of their recent success and their ability to handle different languages easier than linguistically-motivated techniques, most of the researchers working on word alignment shifted their focus to statistical methods in the recent years. In this thesis, instead of building a new word alignment model from scratch, two frameworks are presented to improve existing statistical word alignment sys- tems. One goal is to enrich statistical models with linguistic knowledge. Another goal is to enable the use of different word alignments without heavy modifications to their alignment modeling. The rest of this chapter describes the motivation behind the alignment frameworks presented in the later chapters. Two different approaches to improving word alignments are presented. Finally, the contributions of this thesis are highlighted and the organization of the remainder of this thesis is outlined. 1.1 Motivation Current statistic alignment systems are faced with five important challenges: 1. Complexity of the word alignment problem, 2. Lack of enough training data (in the face of limited linguistic resources), 3. Learning statistics correctly, 4 4. Translation divergences, and 5. Lack of a means for incremental incorporation of linguistic knowledge. The rest of this section discusses each of these 5 challenges in detail. The first issue is the complexity of the word alignment problem. As men- tioned above, the word alignment problem is an exponential problem. Like all other machine learning systems, statistical systems need to make certain assumptions, or biases, about the hypothesis to be learned from the training data to reduce the search space of hypotheses. Such biases enable the learning algorithm to perform well in some domains, but not in others. As a result, all systems tend to make similar errors on unseen data because either the model is deficient when handling certain cases (because of its biases) or the generalization learned by the system on training data is not reflected in unseen data. The second hurdle for statistical systems is the lack of enough linguistically annotated training data for capturing generalizations. This has led to the tendency for the development of statistical systems that use large parallel texts without any linguistic knowledge about the languages. However, it is unclear how one determines how much data is sufficient for different language pairs. For example, in some initial experiments presented in this thesis, a training set consisting of 47K sentence English-Spanish pairs yields lower error rates than a training set consisting of 107K English-Chinese sentence pairs using the same word alignment system. The recent trend in statistical systems is to incorporate all the data that 5 is available and let the system take care of redundant or noisy data. Ultimately, with billions and billions of sentences, this is equivalent to memorizing all possible word (or phrase) correspondences so that there is nothing left unseen for any given test set. In practice, this is nearly impossible to achieve, at least for a majority of language pairs. That is, there will always be language pairs where there is only a limited amount of data. Moreover, statistical systems are still susceptible to their biases in modeling alignments; therefore, it is highly unlikely that they will produce 100% correct alignments even with infinite data. The third problem is related to infrequent words and frequently-seen func- tion words. The rareness of some words?coupled with the too-frequent occurrence of other words in the training data?makes it very difficult to choose what sta- tistics to use (Dunning, 1993; Moore, 2004). Moreover, most expressions in the languages are only ?semi-frozen? and can still undergo a number of linguistic op- erations (such as inflection, insertion of adjectives and adverbs, conversion to the passive voice, etc.). For example, one of the major problems with the IBM models is their tendency to align rare words to several words on the other side in an at- tempt to reduce the number of unaligned words. The problem with the function words is more dramatic: Statistical systems are often unable to pick up the correct word correspondences for function words because function words occur in every sentence several times and statistical models are based on co-occurrence statistics. Experiments on English-Spanish data that are presented in Chapter 3 support this claim. The alignment of function words might not be important in certain applica- 6 tions (e.g, bilingual lexicon construction), but they are vital in machine translation because they usually translate into an affix, positional information, or a sub-part of an expression or phrase. The fourth problem is the handling of translation divergences, i.e., structural differences between languages. Divergences between languages occur when the un- derlying meaning of a set of words in one language is distributed over a set of words with different syntactic and lexical semantic properties in the other language?e.g., the English verb fear corresponds to tener miedo de (have fear of) in Spanish. The most common types of alignment errors related to divergences occur when a word in one language is translated into a phrasal structure with the addition or deletion of function words (i.e., conflational, inflational and structural divergences) or into words that have different parts of speech (i.e., categorial divergence). This thesis demonstrates empirically that the existence of translationally divergent word pairs is the most significant factor contributing to statistical alignment errors. The last problem is the difficulty of incremental incorporation of linguistic knowledge into statistical systems. The most common method for injecting lin- guistic knowledge into the alignment process is to build a new alignment model to account for additional linguistic knowledge in the form of features (or feature functions) (Toutanova et al., 2002; Cherry and Lin, 2003; Liu et al., 2005). In a sense, this is equivalent to discarding already existing systems and building a new system from scratch. There are two problems with this ?build-everything- from-scratch? approach: First, the resulting system may lose valuable information 7 that is inherent in the architecture of previous systems, even when the new system produces better alignments. Second, it is quite difficult to assess the usefulness of different pieces of linguistic knowledge. In addressing the five challenges above, this thesis demonstrates that it is thus possible to improve on word alignments generated by existing word alignment systems. The core of the work presented in this thesis is the construction of automatic techniques for: (1) detection of common errors produced by existing word alignment systems and (2) correction of those errors to obtain better word alignments. 1.2 Techniques for Improving Existing Word Alignments This thesis presents two different frameworks to improve existing word alignments. Both of them are based on the assumption that existing alignment systems succeed at capturing some word correspondences but fail at handling certain phenomena, resulting in similar alignment errors. Both approaches attempt to identify align- ment links where the input systems perform well, and places where the systems make repeated errors. This is achieved by employing additional linguistic features, such as part-of-speech (POS) tags, dependency relations and semantic-based word classes, or alignment-based features that are extracted from the outputs of input alignments. The difference between the two frameworks is that one starts with only one 8 alignment and categorizes the frequently occurring errors systematically, whereas the other makes use of multiple alignments. In the first framework, either new alignment links are added based on already existing links or some of the links that share a common linguistic or alignment-based property are deleted. In contrast, the second framework involves cross-validation of different word alignments. This is achieved by identifying places where the alignments agree or disagree and making a decision about which alignment link to pick in each case according to the features of the words that are involved. 1.2.1 Improving One Word-Alignment System In the first framework, two new rule-based systems are presented. Divergence Unraveling for Statistical MT (DUSTer), is a system that combines linguistic and statistical knowledge to resolve structural differences between languages during the process of alignment. The goal is to identify places where two sentences are divergent, and correct the alignment links related to them using a set of manually crafted, linguistically motivated rules. Previous work on divergences (Dorr et al., 2002) showed that at least 10% of the sentence pairs in Arabic/English and Spanish/English are divergent in some way. Moreover, it has been demonstrated that different divergence types frequently co-occur. For instance, for the following two sentences in English and Spanish, Different politicians please Maria. Maria tiene gustos de pol??ticos diferentes. 9 there are four co-occurring divergences: Categorial (please is a verb in English but gusto is a noun in Spanish), conflational (please is conflated to tener gusto), thematic (Maria and politicians switch thematic-to-syntactic realization order), and structural (politicians is an argument in English but an oblique in Spanish). Thus, it is important to handle divergences concurrently rather than tackling each divergence one at a time and to pay attention to distinctions in both content and structure. A preliminary study on analysis of alignment errors made by a statistical system demonstrated that missing alignment links are the most frequent alignment errors introduced by GIZA++, accounting for nearly 80% of the errors. These are precisely the cases that correspond to the divergence classes specified above, most notably conflational and structural divergences, where a single word in one language corresponds to multiple words in the other language. The key idea behind DUSTer is to relate one or more linguistically-motivated categories?specifically, POS tags and semantic word classes? associated with the English input words to those of the foreign language (FL); the resulting match sets are used to infer corrected alignment links. DUSTer achieves this by employing a set of manually crafted rules. Consider the following example: English: John fears Mary Spanish: John tiene miedo de Mary (Gloss): John has fear of Mary 10 LightVB ofOblique fear JohnNoun NounMary NounJohn Noun NounMary Subj Obj Subj Obj Pred PObj have fear Type 1.B.X: John fears Mary ??> John has fear of Mary. PsychV Figure 1.1: DUSTer: Example Rule Application The rule below (also presented in Figure 1.1 graphically) corresponds to the mapping between ?j fears k? in English and ?j has fear of k? in Spanish. 1.B.X [ English{2 1 3} Spanish{2 1 3 4 5} ] [PsychV<1,i,CatVar:V_N,Verb> [Noun<2,j,Subj>] [Noun<3,k,Obj>]] <--> [LightVB<1,Verb,C:i> [Noun<2,j,Subj>] [Noun<3,i,Obj>] [Oblique<4,Pred,Prep,C:i> [Noun<5,k,PObj>]]] Note that, in addition to the simple indices, i, j, etc., conflated indices (C:i, C:j, etc.) are used to refer to semantically light words that co-occur with high- content words but are generally unaligned in the initial alignments. Nodes marked C:i are taken to be related structurally to a (single) high-content node marked i. Nodes marked C:i inherit their alignment links from the node marked i during alignment correction. For example, the conflated index C:i on the right-hand-side 11 (RHS) of rule 1.B.X associates two semantically light nodes?node 1 (LightVB) and node 4 (Oblique)?with node 3 (Obj), which has the same alignment index (i) as node 1 in the left-hand-side (LHS) of the rule. In this case, the initial alignment associated with these two nodes (node 3 in RHS and node 1 in LHS) is copied to the two C:i nodes in RHS. That is, the conflated indices provide the appropriate mechanism to copy an alignment link from a single (high-content) word pair (i.e., fear,miedo(fear)) to one or more light-content words in the FL (i.e., fear,tener(have) and fear,de(of)). Employing such a set of linguistically motivated rules yields promising results but the improvement over the initial alignment is relatively low, primarily because the coverage of the rules is not adequate to capture common errors made by the initial alignment system. Specifically, the hand-constructed rules are both too general and not general enough: 1. The rules are too general in that they are not tailored to accommodate the specific characteristics of individual alignment systems; thus, common errors made by the initial alignment system are not necessarily addressed by the rules. 2. The rules are too specific in that they are designed to handle a small set of divergence examples that are not necessarily represented by the cases that arise in most data sets. To overcome these problems, this thesis presents a second rule-based ap- 12 proach where rules are tailored to the initial alignment and are generated au- tomatically using machine learning techniques. This new rule-based approach, Alignment Link Projection (ALP), reduces the number of alignment errors by cat- egorizing the errors made by an initial alignment system. ALP starts with an initial alignment and then fills out (i.e., projects) new word-level alignment relations (i.e., links) from existing alignment relations. ALP then deletes certain alignment links associated with common errors, thus improving precision and recall. ALP adapts transformation-based error-driven learning (TBL) (Brill, 1993) to the problem of word alignment. Following the TBL formalism, ALP attempts to find an ordered list of transformation rules (within a pre-specified search space) to improve a baseline annotation. The transformation rules decompose the search space into a set of consecutive words (windows) within which alignment links are added to, or deleted from, the initial alignment. This window-based approach exploits theclusteringtendencyofalignment links, i.e., whenthere isalink between two words, there is frequently another link in close proximity. As shown in Figure 1.2, ALP starts with an unannotated parallel corpus (which might be enriched with linguistic features such as POS tags), a ground-truth alignment for this corpus, and a set of rule templates. The first step is word-level alignment of the corpus using an initial annotator, which is usually an existing word alignment system, to obtain a word-aligned (annotated) corpus. Next, for each rule template, ALP goes over the corpus and finds the words that satisfy the condition of the template. If the template is applicable to the words in question, 13 Figure 1.2: Alignment Link Projection (ALP) a new rule is generated by instantiating the constituents of the template with features of the words. As part of the template instantiation process, ALP records whether the action taken by the rule results in a correct or incorrect alignment link by comparing it against the ground-truth. After all the templates are instantiated with all possible values, ALP decides which rule results in the best score by applying the rule to a copy of the current state of the annotated corpus, and evaluating against the ground truth. If the best rule results in a higher score than the previous alignment, ALP updates the final state of the annotated corpus by applying the rule. This process of instantiating templates, finding the best rule, and comparing its application against the ground truth is repeated until the best rule found in the current iteration yields a lower score than the previous alignment, or when the maximum number of rule applications is satisfied. 14 The initial templates consider consecutive words (of size 1, 2 or 3) in both languages. The condition portion of a rule template tests for the existence of an alignment link between two words. The action portion involves the addition or deletion of an alignment link. For example, assuming an alignment link between the words ei and fj is represented by (i,j), the rule template, Condition: (NULL,j),(i,j + 1) Rewrite rule: add (i,j) is applicable only when a word (ei) in one language is aligned to the second word (fj+1) of a phrase (fj,fj+1) in the other language, and the first word (fj) of the phrase is unaligned in the initial alignment. The action taken by this rule template is to add a link between ei and fj. ALP employs three different sets of templates to project new alignment links or delete existing links in a given alignment: 1. Expansion of the initial alignment according to another alignment, 2. Deletion of spurious alignment links, and 3. Correction of multi-word (one-to-many or many-to-one) correspondences. ALP requires only a small amount of manually aligned data for extract- ing transformation rules?a major strength of the system. Any existing word- alignment system may be used for the initial annotation. Note that these initial aligners are treated as black boxes in the ALP framework. The instantiation of the 15 templates is based on linguistic features of words such as POS tags, dependency re- lations, and semantic-based word classes, rather than the words themselves. Using these features is what sets ALP apart from existing combination approaches, such as the refined method (Och and Ney, 2000b; Koehn et al., 2003). The selection of best rules may be accomplished using two different metrics: The accuracy of the rule or the overall impact of the application of the rule on the entire data. As shown in Chapter 4, ALP provides significant improvements over various initial aligners and, moreover, the improvements are not specific to any language pair. Alignment link projection approach yields a significantly lower alignment error rate than that of the best performing alignment system (22.6% relative re- duction on English- Spanish data and 23.2% relative reduction on English-Chinese data). 1.2.2 Combining Multiple Alignments The second alignment framework combines existing word alignments into an im- proved alignment by taking advantage of the merits of different systems rather than building a new system from scratch. This method relies on the concept of classifier ensembles and attempts to reduce the number of alignment errors by cross validating a given alignment system against another one. Classifier ensembles have been shown to perform better than individual systems in several machine learning applications, but have never been used on word alignments. At the core of the multiple-alignment framework is an engine called Multi- 16 Figure 1.3: Multi-Align Framework for Combining Multiple Alignments Align, which treats each alignment system as a classifier, and then the alignment combination problem is transformed into a classifier ensemble problem. Multi- Align combines any number of word alignment systems, regardless of the assump- tions the individual aligners make or the resources they employ. One of the motiva- tions behind the Multi-Align is to be able to incorporate linguistic knowledge into the alignment modeling. As described in Section 2.2, different pieces of linguistic knowledge, such as POS tags and dependency trees, have been shown to improve word alignments. Multi-Align provides a mechanism for combining linguistically- informed alignment approaches with statistical aligners without the need for com- plex modifications to existing systems. Figure 1.3 illustrates the Multi-Align design. In this framework, first, n dif- ferent word-alignment systems, A1,...,An, generate word alignments between a given English sentence and a FL sentence. Then a Feature Extractor takes the out- put of these alignment systems and the parallel corpus (which might be enriched 17 with linguistic features) and extracts a set of feature functions based on linguistic properties of the words and the input alignments. Each feature function hm is associated with a model parameter ?m. Next, an Alignment Combiner uses this information to generate a single word-alignment matrix based on the extracted feature functions and the model parameters associated with them. The contribu- tion of each feature function to a particular alignment link is proportional to the model parameter associated with that feature. In a final step, a Filterer filters the alignment links according to their confidence in the final alignment matrix. The decision to include a particular alignment link in the final alignment is based on a confidence threshold ?. Parameters of Multi-Align can be set manually or learned via machine learn- ing algorithms. To illustrate the generality of the framework, these parameters are set manually to implement three alignment combination approaches: intersec- tion, union, and a refined alignment approach called grow-diag-final (Koehn et al., 2003). Later, two different methods are presented for learning the model parame- ters automatically using machine-learning techniques. The first method is a com- bination module based on weighted summation of the model parameters and fea- ture functions. A simple but effective method based on (single-layer) perceptrons (Rosenblatt, 1958) is used to learn the model parameters and the confidence thresh- old. Perceptrons enable word-alignment improvements, but they are only capable of solving problems where a linear solution exists. Unfortunately, word alignment 18 is not a linearly separable problem. To overcome this problem, the Multi-Align framework has been implemented as a neural-network based classifier-ensemble approach called NeurAlign. Neural nets with two or more layers and non-linear activation functions are capable of learning any function of the feature space with arbitrarily small error and they have been shown to be effective for combining classifiers (Hansen and Salamon, 1990). Neural nets have been shown to be effective especially with (1) high-dimensional input vectors, (2) relatively sparse data, and (3) noisy data with high within-class variability, all of which apply to the word alignment problem. Results presented in Chapter 6 indicate that, even with only two input align- ments, NeurAlign yields a significant 28-39% relative error reduction over the best of the input alignment systems and a significant 20-34% relative error reduction over the best known alignment combination technique on English-Spanish and English-Chinese data. Also, using as many input alignments as possible produces improved word alignments. Multi-Align is a general enough framework to be easily tunable to applica- tions where alignments are utilized. Users are able to tune parameters to control the effects of input alignment systems or additional resources. Moreover, Multi- Align is an easy-to-adapt, robust framework that takes advantage of already ex- isting word alignment systems. The impact of improved word alignments on machine translation is also in- vestigated in this thesis. As shown in Chapter 7, improved word alignments lead 19 to a better MT system, based on automated MT evaluation metrics. 1.3 Contributions This thesis yields the following contributions: ? Error analysis on word alignments based on linguistic properties of the words, such as POS tags, dependency relations, and semantic-based classes. ? Introduction and implementation of a word alignment improvement module that targets translation divergences. ? A novel word alignment correction system that characterizes alignment errors systematically using machine learning, and improves word alignments by correcting frequently occurring errors. ? The first use of classifier ensembles on word alignment problem and intro- duction of a new framework for combining different word alignments, which might take as many aligners as possible as input, regardless of their under- lying model and resources they employ. ? The easy integration of linguistic knowledge into statistical models without the need for large modifications to existing word alignment systems. ? The implementation of two neural-network based alignment combination algorithms to improve word alignments within the combination framework. 20 ? The analysis of alignments that elucidates the sources of improvements by one alignment over another. ? The investigation of the impact of improved word alignment on machine translation (specifically on a phrase-based MT). Although this thesis focuses on improving word alignments, it also con- tributes to other fields in Computer Science by: 1. Demonstrating that error-driven learning techniques are useful for improving existing systems and that they can be used easily to eliminate the need for handling each system separately. 2. Introducing an application of classifier ensembles in a new field and demon- strating that classifier ensembles perform better than individual classifiers in this new field. 3. Demonstrating that neural networks can be successfully and easily applied to a new domain. 4. Demonstrating that linguistic knowledge can be easily integrated into exist- ing systems without changing the internals of existing systems. 1.4 Thesis Layout & Brief Overview of Chapters This thesis includes 8 chapters as briefly described below: 21 ? Chapter 2 presents earlier word-alignment work. The well-known IBM mod- els and HMM-based alignment systems are described, as well as different techniques to extend or combine these models. Evaluation of word align- ments is also discussed. ? Chapter 3 presents a novel method, DUSTer, to identify translation diver- gences, i.e., structural differences, between languages, and to correct word alignments related to these divergences using linguistically motivated rules. ? Chapter 4 presents a new technique, ALP, to systematically categorize the errors made by an initial alignment system, and to correct those errors for an improved word alignment. The technique of transformation-based error- driven learning is adapted to the problem of learning word alignments, and then use the learned patterns are used to project new alignment links from existing links. ? Chapter 5 first presents a brief survey on ensembles of classifiers, which is the core of the work presented in this thesis, and its application to NLP problems. A new and general framework, Multi-Align, is introduced that combines outputs of different word alignment systems. A simple but effective combination method based on linear weighting of the outputs of individual aligners is described. ? Chapter 6 presents a new method, NeurAlign, to combine different word alignments using an additional more complex model. A brief overview of 22 neural networks?the core of the combination technique?is given. An inves- tigation into how neural networks can be used effectively for generating an ensemble of word alignments is presented. ? Chapter 7 presents an analysis of improvements using various measures, and investigates the effects of improved word alignments in the context of another application, specifically phrase-based machine translation. ? Chapter 8 concludes with overall observations and conclusions, limitations of the methods, and future directions. 23 Chapter 2 Related Work Parallel texts are texts accompanied by their translation in one or more languages. History abounds with parallel texts, such as contracts, treaties, sacred writings, literature, dating from just about every period and involving nearly every pair of languages. In many cases, the parallelism was only virtual (in today?s terms) because the texts and their translations were not written on the same physical medium. The first known source of parallel texts is the Rosetta stone, which conveyed the honors presented to King Ptolemy V by the temples of Egypt, in two languages (Greek and Egyptian) and three writing systems (V?eronis, 2000). The first few attempts at using parallel texts, such as machine translation in the late Fifties, were limited by storage and computing capacities of computers, along with the difficulty of creating electronic texts. Between 1970 and 1990, two groups (Bell Communications Research, and the IBM T. J. Watson Research Center) collected a French-English corpus containing over fifty million words taken from transcriptions of debates in the Canadian Parliament, which is known as 24 ?The Hansards.? In addition to a verbatim record of the proceedings and its translation, the Hansards include session numbers, names of speakers, time stamps, question numbers, and indications of the original language in which each speech was delivered. Consequently, this corpus became a de facto gold standard for developing and testing systems, opening the way for parallel text processing. The incredibly rapid growth of the Web and development of several techniques to collect data from the Internet has led to a huge increase in parallel resources in electronic form (Resnik, 1998; Koehn, 2002; Resnik and Smith, 2003). Since then, parallel texts have become one of the most commonly used resources in natural language processing. The most critical task in parallel text processing is alignment, i.e., detection of corresponding units between two texts of different languages. Given parallel texts in two different languages, alignment can be extracted on a level of sections, paragraphs, sentences, phrases, collocations, or words. Other logical approaches involve aligning parse trees of a sentence and its translation (Matsumoto et al., 1993; Meyers et al., 1996), or simultaneously generating parse trees and alignment arrangements (Wu, 1995). The rest of this chapter will describe some of the techniques that have been used to solve three alignment problems: Sentence-level alignment, word-level align- ment, and phrase-level alignments. The applications that use word alignment are then described. The chapter concludes with a discussion of techniques for evalu- ating word alignments. 25 2.1 Sentence-Level Alignment Alignment of parallel texts at the sentence level is an important problem, and it is usually the first step toward aligning parallel texts at the word level. Due to the success of sentence alignment algorithms on the Hansards corpus in early 90?s, sentence alignment is considered a trivial task, but it has been later noticed that it is not an easy task because of the following reasons: ? The electronic texts may contain physical noise, such as OCR errors. ? A single sentence in one language may be translated into two or more sen- tences in the other language, or the content may be distributed across mul- tiple translated sentences (Wu, 1994). ? A sentence, or even a whole passage, may be missing from one or the other of the texts. ? The texts may have presentational differences such as different places for floating figures and tables, footnotes, headers, different orders in glossaries, etc. (V?eronis, 2000). Various methods have been proposed for solving sentence alignment prob- lems. These methods can be classified into two groups: 1. Lexical methods: Word correspondences in the aligned sentences are used (Catizone et al., 1989; Kay and Roscheisen, 1993). Which word in one text corresponds to which word in the other text is based on the similarity of their 26 distributions, where the similarity measure is usually the Dice coefficient (Dice, 1945). 2. Statistical methods: Only internal information given in the parallel text is used, without making any direct assumptions about the lexical content of the sentences (Gale and Church, 1991a; Brown et al., 1991b). The common strategy is to use lengths of sentences (either in number of words or number of characters) to guide the search space. There has been additional research on enrichment of statistical methods with vari- ous linguistic resources, such as dictionaries, lists of cognates1, etc. (Simard et al., 1992; Church, 1993; Melamed, 1996a; McEnery and Oakes, 1995). When sentence- length correlation between the languages is not high because of different character sets and grammatical and rhetorical difference of the two languages (for exam- ple, Chinese and English), lexical knowledge has been shown to be necessary for more accurate sentence alignment (Fung and Church, 1994; Wu, 1994; Haruno and Yamazaki, 1996; Davis et al., 1995; Dagan et al., 1993). Much of the work on sentence alignment is based on alignment at the word level. For example, the Smooth Injective Map Recognizer (SIMR) first identifies likely points of correspondence between the two texts using translation lexicons or cognates, and then uses these word correspondences to align sentences (Melamed, 1The term cognate refers to pairs of tokens of different languages that share obvious phono- logical or orthographic and semantic properties leading to their use in mutual translations. 27 1996a). Thus, the problems of word alignment and sentence alignment are inter- connected. Although sentence-level alignment is still an important issue, in this thesis, it is assumed that the parallel texts have been already aligned at the sentence level. This thesis focuses on improving word alignments. 2.2 Word Level Alignment Word alignment is arguably a more complicated problem than sentence alignment. Figure 2.1 shows an alignment of words between the English sentence She will fear her enemies and the Spanish sentence Ella tendr?a miedo de sus enemigos. The alignments may be one-to-many (fear is aligned with three Spanish words: tendr?a miedo de) and many-to-one (will fear is aligned with the Spanish verb tendr?a). It is often difficult to decide, even for a human, just which words in an original are responsible for a given one in a translation. Moreover, some words apparently translate morphological or syntactic phenomena rather than other words, i.e., func- tional words. However, it is relatively easy to establish correspondences between such words as proper nouns and technical terms, so that partial alignment at the word level is often possible (V?eronis, 2000). Some of the issues causing problems for word alignment are mentioned in (Kay and Roscheisen, 1993; Ker and Chang, 1997; V?eronis, 2000): ? Words are not always aligned one-to-one because some languages are wordier 28 Figure 2.1: Word Alignment Example than others. ? Some languages make morphological distinctions that are absent in the other (German, for example, makes a number of case distinctions, especially in adjectives, that are not reflected in the morphology of English.) ? Paraphrased and free translations: For various reasons, such as language ty- pology, style, and cultural differences, a translator does not always translate literally on a word by word basis. Most of the time, words are added or deleted. ? There is a high percentage of function words (about 50% of the tokens in any text), for which there is even less of a one-to-one correspondence than for content words. Function words frequently translate into an affix, positional information, part of expressions, phrases or even nothing at all. In addition, the same problems that arise with sentence alignment arise for word alignments because many sentence alignment methods use (often partial) word alignments as anchor points (Kay and Roscheisen, 1993; Dagan et al., 1993; Fung and Church, 1994). A variety of techniques, including bootstrapping and 29 relaxation, perform the two types of alignment at the same time. In fact, the bitext- correspondence2 problem has often been viewed as just one instance of the more general translation analysis problem, which is known to be AI-complete (Isabelle et al., 1993). In other words, solving the bitext correspondence problem is no easier than any significant AI problem. The first issue is robustness: As pointed out in (Church, 1993), ?real texts are noisy.? For instance, parts of the source text are omitted in the target text or end up in a different order. The second issue is accuracy: even when the texts are clean, there are hard decisions to make for the alignment (Simard and Plamondon, 1998). During sentence alignment, word alignment is not the primary goal so the methods are usually error-tolerant to noise in the text. In contrast, when the primary goal is word alignment, one can no longer settle for rough and partly erroneous alignments. For this purpose, various researchers have focused directly on filtering out noise in alignments and extractions (Dagan and Church, 1994; Melamed, 1996b; Resnik and Melamed, 1997; Jones and Alexa, 1997). Given two sentences e = e1,...,ei,...,eI and f = f1,...,fj,...,fJ that are translations of each other, there are I?J different connections that can be drawn between e and f because each of the J words in f can be connected to any of the I words in e. Since an alignment is determined by the connections that it contains, 2A bitext correspondence is defined as a pair of tokens on two sides of the text that can be translated to each other. The unit of correspondence need not to be words but can also occur in the form of multiple words (Melamed, 1996a). 30 and since a subset of the possible connections can be chosen in 2I?J ways, there are 2I?J possible alignments. Different word alignment models reduce the search space by making further assumptions on types of connections between the words. For example, in IBM Model 4 (Brown et al., 1993), each source word fj can align to exactly one target word ei, or the null word. On the other hand, the target words can link to any number of source words. As a result, the number of possible alignments is O(IJ), which is still exponential. Word alignment techniques fall into two groups: Similarity-based heuristic methods, and statistical (or corpus-based) methods. In the rest of this section, both approaches are described in detail. 2.2.1 Heuristic Methods The common strategy is to find similarities between words in two texts using some correlation measures based on co-occurrence counts. The key idea for robust statistical modeling is not only looking at where two words co-occur, but also looking at the places where only one of the two occurs or the places neither occurs. In one of the early studies, the correlation measure ?2 was useful for this purpose (Gale and Church, 1991b).3 ?2(w1,w2) = (ad?bc) 2 (a+b)(a+c)(b+c)(c+d) 3According to (Melamed, 2000), the G2 statistic suggested by (Dunning, 1993) slightly out- performs ?2. 31 where a is the number of places where w1 and w2 co-occur, b is the number of places where w1 occurs but not w2, c is the number of places where w2 occurs but not w1, and d is the number of places neither occurs. The ?2 measure is used progressively to align words with each other. That is, on a portion of the corpus, ?2 is computed for all pairs occurring in this corpus. After selecting the best pairs, the same procedure is repeated for another subset of corpus, enhancing the set of best pairs at each iteration. The final set of best pairs is used to align the words that are in the pair. The Smooth Injective Map Recognizer (SIMR) is a greedy algorithm for finding bitext correspondence (Melamed, 1996a). SIMR produces sentence-level alignments, but it makes use of word alignment techniques as a component of the algorithm. SIMR relies on the high correlation between the lengths of mutual translations, as in (Gale and Church, 1991a; Brown et al., 1991b), and infers bitext maps from likely points of correspondence between the two texts (Church, 1993). Unlike previous methods, SIMR searches for a handful of points of correspondence at a time, by choosing only those points whose geometric arrangement most resem- bles the typical arrangement of true points of correspondence (TPC). This selection involves localized pattern recognition heuristics for guessing whether a given point in the bitext space is a TPC. A translation lexicon or a set of cognates can be used for this decision. Melamed claims that a set of TPCs leads to alignment more directly than a translation model or a translation lexicon because TPCs are a relation between token instances, not between token types. Another advantage 32 of SIMR is that it handles inversions and word-order differences. Moreover, a set of correspondence points, supplemented with sentence boundary information, can be used to express sentence correspondence. It has been demonstrated that SIMR can easily be ported to a new lan- guage pair and is applicable to any text genre in any pair of languages (Melamed, 1997a). SIMR?s output quality is highly influenced by the coverage of the transla- tion lexicon used to find correspondence points. In subsequent studies, Melamed extended SIMR to eliminate the problem of indirect associations, i.e., frequent co-occurrences of words that are not translations of each other, by taking into account several features of the tokens such as POS information, word ordering, word classes, and existence in another bilingual lexicon, using simple heuristics (Melamed, 1997b, 2000). The proposed model generates translation lexicons with precision and recall both exceeding 90%. In a recent study (Probst and Brown, 2002), the effects of the dependence on the translation lexicon are examined. A simple bilingual lexicon is enhanced by taking advantage of inflectional and derivational morphology. Improved dictio- naries were shown to yield statistically significant improvements in the quality of word alignment. In other approaches (Simard and Plamondon, 1998), the robustness of char- acter based methods such as char align (Church, 1993) and the accuracy of lexical- based methods, such as (Chen, 1993; Dagan et al., 1993), are combined. This is a two-step strategy: first compute a bitext map, working on robustness rather than 33 accuracy, and second use this map to constrain the search space for the computa- tion of sentence alignment, this time relying on a method that favors accuracy over robustness or efficiency. The initial bitext mapping is computed using a program that is called Jacal (Just another cognate alignment program) that is similar in flavor to Church?s char align and Melamed?s SIMR (Melamed, 1996a). One differ- ence is that, instead of using translation lexicon, Jacal uses isolated cognates, which results in an improvement in robustness. The results are comparable with other methods for Hansards corpus but not quite accurate for other corpora. Simard claims that this is caused by the segmentation errors in the corpus. Similarity-based approaches on their own are usually successful for frequent words that are consistently aligned to one translation. However, words that are less frequent or exhibit diverse translations generally do not have statistically sig- nificant evidence for a confident alignment, resulting in incomplete or incorrect alignments. Ker presents a word alignment algorithm based on thesaurus classifi- cation and relies on an automatic procedure to acquire class-based alignment rules (Ker and Chang, 1997). The basic motivation is that a word?s translational devi- ation is mostly bound within the relevant semantic classes. The word classes are adopted from the categories in Roget?s thesaurus.4 Experimental results indicate that a classification based on existing thesauri is highly effective in broadening coverage (over 80%) while maintaining a high precision rate. 4Word classes can also be obtained by using parts-of-speech classes (Chang and Chen, 1994). 34 2.2.2 Statistical Methods At the core of any statistical machine translation system is a word-level alignment model. The IBM models (Brown et al., 1993) were among the first statistical alignment approaches, in which the goal was to translate a foreign-language (FL) text into English, i.e., a FL string f = f1,...,fj,...,fJ is translated into an English string e = e1,...,ei,...,eI. Among all possible English translations, the one with the highest probability ?e is chosen by an application of Bayes? decision rule: ?e = argmaxe{p(e|f)} = argmaxe{p(e)?p(f|e)} P(e) is called the language model and P(f|e) is the string translation model. The argmax operation can be characterized as a search problem. A key issue in modeling the string translation probability p(f|e) is the definition of correspondence between the words of the English sentence and the words of the foreign language. Typically, a pairwise dependence is inferred by considering all possible pairs of words. Models describing these types of dependencies are referred to as alignment models. Five different IBM-style alignment models were designed?IBM Model 1, IBM Model 2, etc. (Brown et al., 1993)?to assign a probability to each of the possible word-by-word alignments. The models use minimal linguistic information so they are applicable to any language pair as long as a sufficient parallel text is available. The IBM models are constrained by assigning each FL word to at most one 35 English word. A hidden alignment variable a = aJ1 is introduced, where each aj takes a value between 0 and I, where I and J are the lengths of English and FL sentence, respectively. The value of aj represents the position of the target word eaj to which the source word fj corresponds, i.e., if the word in position j of the FL string is connected to the word in position i of the English string, then aj = i, and if it is not connected to any English word, then aj = 0. Using this hidden alignment variable, the likelihood of p(f|e) is written as follows: p(f|e) = summationdisplay a p(f,a|e) A Viterbi alignment ?a of a specific model is an alignment that maximizes p(f,a|e): ?a = argmaxa p(f,a|e) Model 1 assumes that alignments are independent of each other, and the distribution is uniform. In this case, p(f|e) = summationdisplay a Jproductdisplay j=1 p(aj)p(fj|eaj) = epsilon1(I + 1)J Jproductdisplay j=1 Isummationdisplay i=0 p(fj|ei) It is easy to show that for Model 1, the most likely alignment ?a of e and f is given by: ?a = argmaxa Jproductdisplay j=1 p(fj|eaj) Since the alignments are independent of each other, the most likely alignment ?a can be found by choosing the value for aj that leads to the highest value for p(fj|eaj), for each j. 36 Model 2 discards the uniform alignment distribution by enforcing aj to be dependent on the position of the words, and also the lengths of the sentences. p(f|e) = summationdisplay a Jproductdisplay j=1 p(aj|j,I,J)p(fj|eaj) Models 1 and 2 capture some interesting correspondences between words, but they are usually incapable of connecting one word to multiple words successfully. Models 3 and 4 introduce an additional fertility model p(?i|ei), describing the num- ber of words aligned to the English word ei, and a distortion model p(d(j|i,I,J)), which describes the movement of words. In Model 3, for instance, p(f,a|e) = parenleftBigg J ?? 0 ?0 parenrightBigg ?pJ?2?00 p?01 ? Iproductdisplay i=1 ?i! p(?i|ei)? Jproductdisplay 1 p(fj|eaj)d(j|aj,I,J) with summationtextf p(f|e) = 1, summationtextj d(j|aj,I,J) = 1, summationtext? p(?|e) = 1, and p0 +p1 = 1. Model 5 was designed to address the deficiency problem, i.e., the problem of alignment to an increasingly higher number of empty words on each iteration in Models 3 and 4, and was shown to be better than previous models.5 However, Model 4 performs nearly as well as Model 5, with a much lower computation cost. The key ideas of each model are summarized in Table 2.1 (taken from (Och and Ney, 2000b)). 5The alignment to empty words arises when the model assigns a high probability on generalized strings, i.e., strings where some positions are aligned with several words and others with none. 37 IBM-1: All alignments have the same probability. IBM-2: A zero-order alignment model p(aj|j,I,J) is applied, where different alignment positions are independent from each other. IBM-3: An inverted zero-order alignment model p(j|aj,I,J), with an additional fertility model p(?|e), which describes the number of words ? aligned to an English word e, is applied. IBM-4: An inverted first-order alignment model p(j|jprime) with an additional fertility model p(?|e) is applied. IBM-5: A reformulation of IBM-4 with a refined alignment model to address the deficiency problem. Table 2.1: Summary of IBM Models IBM models are later extended by learning word classes automatically and using those classes instead of the actual words during modeling (Och and We- ber, 1998). Researchers have also investigated enhancements to the IBM-3 and IBM-4 models (Och and Ney, 2000b) for the purpose of addressing the deficiency problem. Another problem with the IBM models is that they consider only one di- rection in the translation model, i.e., p(fj|ei). Zens et al. (2004) improved the IBM models using a symmetric translation model by applying a linear and log-linear interpolation to the probabilities in two directions. The same study also proposes a discounted smoothing technique for an improved translation model to overcome the data sparseness problem seen in highly inflected languages like German. The goal is to smooth the lexicon probabilities of the full-form words using a probability distribution that is estimated using the word base forms, i.e., stems. Another improvement to IBM models is the addition of context dependencies using a maximum entropy (ME) model, which is directly integrated into the EM training (Garc??a-Varea et al., 2002). The goal is to include more dependencies, 38 i.e., a larger context, in the translation model rather than just using p(fj|eaj). In a ME framework, the properties of e that are deemed to be useful are described in the form of feature functions ?e,k(x,f), where x is the context of e. For example, the absence or existence of a specific word eprimek in the context of an English word e, which can be translated as fprimek, can be represented by the following feature function: ?e,k(x,f) = ?? ??? ??? ? 1 if f = fprimek and eprimek ? x 0 otherwise Assuming that pe(f|x) represents the probability of the ME model associated with e assigns to f in the context of x, and the feature functions for a specific word e are represented by ?e,k(x,f) : k = 1,...,Ke, pe(f|x) = 1Z ?e(x) exp parenleftBiggK esummationdisplay k=1 ?e,k ??e,k(x,f) parenrightBigg Here, ?e = {?e,1,...,?e,Ke} and Z?e(x) is a normalization factor. The parameter values ?e,k that will maximize the likelihood for a given training corpus can be opti- mized using generalized iterative scaling (Darroch and Ratcliff, 1972) or improved iterative scaling (Berger et al., 1996). Although a different ME model should be learned for each target word e, this process is usually subject to a thresholding. The context of a word e is defined by three words to the left and three words to the right of e. In addition to dependence on the actual words, the approach also models dependence on the word classes. Alignment error rate is reduced with respect to IBM models. Moore (2004) describes two limitations of Model 1 that are not deeply struc- 39 tural but that can be addressed merely by changing how the parameters of Model 1 are estimated. The first of these nonstructural problems is that rare words in the source language tend to act as garbage collectors (Brown et al., 1993; Och et al., 2004), aligning to too many words in the target language. The second nonstruc- tural problem is that it seems to align too few target words to the null source word. To address the first problem of rare words aligning to too many words, at each iteration of EM, all the translation probability estimates are smoothed by adding virtual counts according to a uniform probability distribution over all tar- get words. The second problem is addressed by adding extra null words to each source sentence and multiplying all the translation probabilities for the null word by the number of null words per sentence. Moore (2004) also proposes a heuristic model based on the log-likelihood ratio statistic recommended by Dunning (1993), which has also been shown to be useful for other applications (Melamed, 2000). The goal is to improve Model 1 alignment accuracy by starting with a better set of initial parameter estimates. These three simple modifications achieve excellent results, which are nearly comparable to those of the more complex IBM models. These five models were re-implemented during the JHU Workshop in 1999, and the resulting package, GIZA, has become a classic in the MT literature (Al- Onaizan et al., 1999). Later, GIZA is improved by addressing several problems in the implementation and deficiencies of the models (described below). Since then, the resulting software, GIZA++ (Och, 2000), has become the state-of-the- art statistical alignment package. 40 An alternative and widely used alignment model is a hidden Markov model in which alignment probabilities rely on the differences in the alignment positions rather than on the absolute positions (Vogel et al., 1996). The basic motivation is the existence of strong localization effects in aligning the words in parallel texts, especially for language pairs from Indo-European languages, where words are not distributed arbitrarily over the sentence positions, but tend to form clusters. For some languages (German, for example), there is an even stronger restriction be- cause the difference in the position index may be smaller than 3. The basic idea behind the HMM-based alignment model is that the English word that is aligned to a FL word is dependent on the English word that the previous FL is aligned with.6 Formally, p(f|e) = summationdisplay a Jproductdisplay j=1 p(aj|aj?1)p(fj|eaj) HMM-based alignment models, like IBM models, also suffer the problem of aligning each source word to at most one target word. An alternative HMM-based alignment model is adopted that makes use of the following observation (Tillmann et al., 1997): Over large portions of the source string, the alignment is monotone. For the alignment model, the monotonicity property allows only transitions from aj?1 to aj with a jump width ?. If ? = 0, a target word is aligned with two or more source words. If ? = 1, a single new target 6If fj is aligned with ei, then aj = i. The HMM model states that aj is dependent on the previous alignment aj?1. 41 word is generated. If ? = 2, there is a word in the target string with no aligned word in the source string. The HMM-based alignment model has been further extended (Och and Ney, 2000a) in three different ways: 1. Use of equivalence classes over both languages for a refined alignment model. 2. Empty word handling: the HMM network is extended by I empty words, i.e., the English word ei has a corresponding empty word ei+I. 3. Smoothing: pprime(aj|aj?1,I) = ?? 1I + (1??)?p(aj|aj?1,I) The results show that more sophisticated alignment models are crucial for re- duction of word-alignment error. Consistently, the use of a first-order alignment model, modeling an empty word and fertilities result in better alignments. Toutanova et al. (2002) reported further improvements on word alignment by extending HMMs as follows: 1. Incorporating POS tag information of the source and target languages in the translation model, 2. Approximately modeling fertility, and 3. Better modeling of the alignments to NULL target word. HMMs provide good quality alignments, better than IBM Models 1-3 and 42 comparable to Model 4, despite the simplicity of the model.7 A log-linear combi- nation of the HMM and IBM Model 4, which was recently introduced as Model 6, was shown to improve alignments further (Och and Ney, 2003), but at a higher computational cost. The complexity of the word alignment problem can be reduced by enforcing syntactic bracketing constraints between two sentences using an Inversion Trans- duction Grammar (ITG) (Wu, 1997). Reordering of the words is modeled by binary branching trees using rules that reverse of the order of the words. The trees were generated by synchronously parsing a parallel corpus, using a simple grammar where one non-terminal symbol generates either a terminal symbol or 2 non-terminal symbols (by either keeping the order of the words same or revers- ing the order). Generation of a terminal symbol is guided by lexical translation probabilities at the leaves. Inversion transduction grammar cannot handle com- plex syntactic structures such as noun phrases or verb phrases, but they reduce the complexity of the word alignments to polynomial-time. Later, Zhao and Vogel (2003) extended bilingual bracketing approach to enforce additional constraints on the structure of one language by using English POS tags and base noun phrase boundaries. In a more recent study, ITGs were extended to condition the grammar 7In fact, standard implementations of IBM Model 4 (e.g., GIZA++) use HMMs to obtain initial alignments. Lopez and Resnik (2005) demonstrated that their implementations of Model 4 derived most of their performance benefits from an underlying HMM, and the improvements by Model 4 itself were relatively small compared to improvements by the HMM. 43 production probabilities on lexical information throughout the tree. This study demonstrated that lexicalization yields further improvements on word alignment (Zhang and Gildea, 2005). Yamada and Knight (2001) used a similar model by transforming a source- language parse tree into a target-language string by applying stochastic operations, which capture linguistic differences such as word order and case marking, at each node. As a result, the space of possible word alignments is constrained by the structure of input trees. The input trees are generated by an existing parser, and the output of the model is a string, not a parse tree. Therefore, parsing is only needed on the channel input side. The tree transformation is achieved by three operations: 1. Reordering: intended to model translation between languages with different word orders, such as SVO-languages (English or Chinese) and SOV-languages (Japanese or Turkish). 2. Word insertion: intended to capture linguistic differences in specifying syn- tactic cases (for example, English and French use structural position to spec- ify case, while Japanese and Korean use case-marker particles). 3. Translating leaf words The model parameters are estimated using the EM algorithm, and the resulting model produces word alignments that are better than those produced by IBM Model 5. Later, Gildea (2003) extended tree-to-string alignment model by intro- 44 ducing a clone operation, which copies an entire subtree of the source language syntactic structure and moves it anywhere in the target language sentence, to han- dle structural divergences between languages (Dorr, 1994) and free translations in the training data. The same work also introduced a new tree-to-tree alignment model, where the space of possible alignments were constrained by parse trees on both sides rather than only on the source side. In order to provide enough flexi- bility, a single node on the source tree might produce two nodes on the target tree or two nodes in the source tree may be grouped together to produce a single node in the target tree. The tree-to-tree model was trained on parallel treebanks; thus, they are limited by the size of training data. Although the tree-to-tree model did not provide significant improvements over unstructured IBM models, they brought a huge savings in computational complexity. In a later study, Gildea (2004) in- vestigated whether dependency trees were more useful than constituency trees but found, interestingly, that the latter significantly outperformed the dependency tree model. An empirical comparison of ITGs and tree-to-string models revealed that ITGs significantly outperformed the latter (Zhang and Gildea, 2004). Using trees has proven fruitful for improving alignments but a data-derived tree structure (i.e., ITG) gives better results than projecting automatic English parser output onto the Chinese string(i.e., tree-to-stringmodel). Zhangand Gildea(2004)claim that tree- to-string models perform poorly because of parsing errors and non-isomorphism of parse trees due to structural differences between languages. 45 A primary drawback of the generative alignment models is that incorpora- tion of arbitrary (whether linguistic or not) features of the data is difficult. The most common strategy to overcome this problem is to represent words with a set of features and weights associated with them, and then model the alignment based on these features and weights. For example, in an early study, associations be- tween two words are viewed as word alignment clues, which can be estimated from association measures on a training data (Tiedemann, 2003). Alignment clues are assumed to be independent of each other, and they can be computed based on arbitrary feature functions between the words. Possible alignment clues include co-occurrence (Dice coefficient), string similarity, and translation probabilities ex- tracted from a pre-aligned data according to features of the words such as POS tags, phrase categories, word positions, and any other kind of contextual features. Each alignment clue Ck(ei,fj) is associated with a probability p(ak). The overall clue is represented as Call(ei,fj) = p(aall) = p(a1?a2????an). For instance, given only two clues, Call(ei,fj) = p(a1)+p(a2)?p(a1?a2). The clues are assumed to be mutually independent; therefore, Call(ei,fj) = p(a1) + p(a2)?p(a1)?p(a2). Once a clue confidence matrix is obtained by computing the overall clue for each pair of words, the best alignment is extracted using a dynamic programming approach. Similarly, in a recent study, for each pair of words ei and fj, a confidence score s(ei,fj) is computed using a set of features (Taskar et al., 2005). Then, based on these confidence scores, a word alignment problem is viewed as a maximum weighted bipartite matching problem, where nodes correspond to the words in the 46 two sentences. The best alignment is taken to be the one that maximizes the sum of edge scores in the graph satisfying some constraints, such as one-to-one alignment. The score of each link is a weighted sum of the features, and weights are learned using large-margin estimation via support vector machines on a small manually-aligned data. The feature set for a pair of words ei and fj includes 1. The Dice coefficient between ei and fj, 2. Absolute difference between i and j, 3. Word-similarity features such as cognateness, substring matching, etc., 4. Word-frequency features, and 5. IBM Model 4 alignment outputs. Moore (2005) presents another example of discriminative training. The word alignment is modeled as a weighted sum of several features, and the alignment with the highest sum is selected as the word alignment for that sentence pair. Formally, ?a = argmaxa Msummationdisplay i=1 ?ihi(a,e,f) where hi represents a feature and ?i corresponds to the weight associated with the feature hi. The weights of the features are computed using a version of weighted perceptron learning (Collins, 2002). This model is very similar to the model in (Taskar et al., 2005): Instead of using dice coefficients and absolute position dif- ferences as features and using support vector machines to compute weights, Moore 47 (2005) uses log-likelihood ratio and non-monotonicity measures as features and av- eraged perceptron learning to compute weights. This discriminative model achieves results comparable to more complex IBM models. The biggest advantages of the model are its low computational complexity and ease of integrating different pieces of linguistic knowledge into alignment modeling. As an alternative to IBM and HMM models where the alignment is a hidden variable, the alignment can be modeled directly using a different decomposition of the terms (Cherry and Lin, 2003; Ittycheriah and Roukos, 2005; Liu et al., 2005). In this framework, the alignment problem is defined as finding the alignment a that maximizes p(a|e,f) instead of finding Viterbi alignment in IBM models.8 Cherry and Lin (2003) define an alignment a to be a set of t links {l1,...,lt}, where each lk = (eik,fjk) for some ik and jk. The consecutive subsets of a is denoted by lji = {li,li+1,...,lj}. Given this notation, p(a|e,f) can be decomposed as follows: p(a|e,f) = p(lt1|e,f) = tproductdisplay k=1 p(lk|e,f,lk?11 ) Assuming the context of the kth link is Ck = {e,f,lk?11 }, the term in the product can be decomposed into two: 1) A link probability given a co-occurrence of two words, and 2) A context probability given a link divided by context probability 8IBM and HMM models search for the alignment that maximizes p(f,a|e), which is equivalent to maximizing p(a|e,f). 48 given a co-occurrence of two words. Formally, p(a|e,f) = tproductdisplay k=1 p(lk|eik,fjk) p(Ck|lk)p(C k|eik,fjk) Unfortunately, Ck = {e,f,lk?11 } is too complex to estimate context probabilities directly. Suppose FTk is a set of context-related features such that p(lk|Ck) can be approximated by p(lk|eik,fjk,FTk). Assuming, for all ft ? FTk, ft is independent of lk and eik,fjk, the problem is reduced to: p(a|e,f) = tproductdisplay k=1 ? ?p(lk|ei k,fjk)? productdisplay ft?FTk p(ft|lk) p(ft|eik,fjk) ? ? Unfortunately, this alignment model has very few factors to prevent undesirable alignments, such as having all source words align to the same target word. To guide the search process over all possible alignments, a one-to-one constraint, that requires each word to participate in exactly one alignment link, and a cohesion constraint, that is used to restrict possible link combinations by using a dependency tree in English, are employed. Two types of context-specific features are fed to the system: (1) Adjacency features force close-proximity source-language words to be closer in the target language; and (2) Dependency features capture regularities among grammatical relationships between languages, using a dependency tree. The biggest advantage of this model is that new information can be incorporated modularly by adding features, which is similar to the motivation behind maximum entropy (ME) models. In fact, Cherry and Lin (2003) claimed that estimates of p(lk|eik,fjk,Ck) can be improved using the ME model. 49 Recently, Ittycheriah and Roukos (2005) followed this idea and proposed a ME-based alignment model, using a direct formulation of alignments, as in Cherry and Lin (2003). Similar to IBM models, it is assumed that each source word is generated by a target word; therefore, each alignment includes J links, where the value of aj represents the position of the target word eaj to which the source word fj corresponds. Then, p(a|e,f) = p(aJ1|e,f) = Jproductdisplay j=1 p(aj|e,f,aj?11 ) = 1Z Jproductdisplay j=1 p(aj|aj?1)? ?p(aj|e,f,aj?11 )1?? The first term in the product corresponds to the transition model, which tends to keep alignment links close to each other. The second term is the observation model that measures the linkage of source and target words using a set of feature functions defined on the words and their context.9 The context of a target word ei is defined as the words e1,...,ei, and Wordnet synsets associated with ei. Five sets of feature functions are employed: 1. Lexical features, which are similar to IBM Model 1 probability estimates, 2. Segmentation features, 3. Wordnet features, 9The formulation of the observation model is similar to ME model in Garc??a-Varea et al. (2002), so it will not be repeated here. 50 4. Spelling correction features, and 5. Dynamic features, which are related to already established alignment links. The values of the feature functions are extracted from manually-aligned data (nearly 10K sentence pairs). Ittycheriah and Roukos (2005) reported significant improvements over IBM models and HMM-based alignments. Another attempt to directly model p(a|e,f) also uses ME model, but differs from Ittycheriah and Roukos?s approach by modeling alignment directly as a log- linear combination of feature functions (Liu et al., 2005). In ME framework, there are M feature functions hm(a,e,f),m = 1,...,M. For each feature function, there exists a model parameter ?m,m = 1,...,M. The direct alignment probability is given by p(a|e,f) = exp( summationtextM m=1 ?m ?hm(a,e,f))summationtext aprime exp( summationtextM m=1 ?m ?hm(aprime,e,f)) The best alignment ?a is given by ?a = argmaxa parenleftBigg Msummationdisplay m=1 ?m ?hm(a,e,f) parenrightBigg In order to incorporate a new dependency that contains extra information other than the bilingual sentence pair, the model above is modified by introducing a new variable v, which corresponds to set of additional dependencies: p(a|e,f,v) = exp( summationtextM m=1 ?m ?hm(a,e,f,v))summationtext aprime exp( summationtextM m=1 ?m ?hm(aprime,e,f,v)) ?a = argmaxa parenleftBigg Msummationdisplay m=1 ?m ?hm(a,e,f,v) parenrightBigg 51 Anydependencybetweenthealignedwordscanbe representedas afeaturefunction in this framework. Liu et al. (2005) used IBM Model 3 alignment probabilities (in both directions), POS tags transition probabilities, and bilingual dictionary coverage as their feature set. Once the model parameters are learned using GIS (Darroch and Ratcliff, 1972), a greedy search is applied to find the best alignment using a gain function of adding a link to the set of existing links. It has been shown that log-linear combination of the mentioned features yield a significant improvement over IBM models. However, most of the improvement comes from using only IBM-3 model probabilities. Another approach to word alignment is the transformation of the problem into orthogonal non-negative matrix factorization based on a probabilistic model of the alignment data (Goutte et al., 2004). By introducing a set of hidden nodes h between two sentences e and f, the alignment of e and f is viewed as a product of alignment of e to h and alignment of h to f. In order to use matrix factorization, the following constraints are enforced: 1. Coverage: Every word on either side must be aligned to at least one word on the other side, possibly taking NULL words into account. 2. Transitive closure: If fj is aligned to ei and ek, any fl aligned to ek must also be aligned to ei. Transitive closure is satisfied by connecting each word to a node in the hidden layer. Coverage is guaranteed by ensuring that each word is connected to at least 52 one node in the hidden layer. For example, general M-N alignments correspond to alignment of multiple English words to one node in the hidden layer, and then alignment of that hidden node to multiple FL words. Analignmentbetweeneandf mayberepresentedbyaJ?I alignmentmatrix A = [aji], such that aji > 0 if fj is aligned to ei and aji = 0 otherwise. Similarly, given K hidden nodes, alignment of words to hidden nodes may be represented by a J ? K matrix F and a I ? K matrix E, with positive elements indicating existence of alignment links. An alignment can be represented as A = F ? ET. Assuming the translation matrix is represented by an J ? I matrix M = [mji], where mji ? 0 measures the strength of the association between fj and ei, finding a suitable alignment matrix A corresponds to finding factors F and E such that: M ? F ?S ?ET S corresponds to a diagonal K ?K scaling matrix that gives different weights to hidden nodes. Since F and E contain only positive elements, this is an instance of non-negative matrix factorization. To guarantee proper alignments, each word is associated to exactly one hid- den node, and each hidden node is associated to at least one word on each side. Given this propriety constraint, F and E are orthogonal matrices. Finding the best alignment starting from M therefore reduces the alignment problem to performing a decomposition into orthogonal non-negative factors. Another commonly-used technique to improve alignments is preprocessing 53 the training data. The motivation is that an input source text that is closer to the target text in terms of length and word order would be useful for increasing word alignment accuracy. This assumption serves as the basis for approaches that transform texts by grouping some words into some classes, joining/splitting words, and reordering some words (Tillmann et al., 1997). This technique has been shown to produce better alignments than using the original (untransformed) training data. In another study, the effects of lemmatization10 and dictionary usage in the trainingdatawereinvestigatedforIBMModel4(Dejeanetal.,2003). Applyingthe lemmatizer reduces the parameter space for the alignment algorithm by reducing the vocabulary size. Adding bilingual lexicon entries was intended to introduce bias for the alignment model parameters. Interestingly, applying the lemmatizer to all data reduces the performance. When lemmatization is applied only to rare words (with varying thresholds), it performs better. Adding dictionary entries did not improve the alignment quality much although slight improvements were obtained in some sets (Och and Ney (2000b) also reported similar results). 2.3 Phrase Level Alignment Another rapidly growing research area is segment alignment, i.e., alignment of contiguous textual units more than one word but shorter than a full sentence. 10Lemmatization is splitting the words into their root forms and suffixes. 54 Examples of segments are clauses, phrases, syntax tree fragments, and skeleton sentences. An alignment of this type would be very useful for a variety of appli- cations including example-based translation, language teaching, and comparative linguistics. However, the problem is extremely hard to solve due to the difficulty of detecting language-specific clause boundaries, the complexity of partial syntactic analysis, and substantial structural differences across languages, even related ones (V?eronis, 2000). Phrasal (or multi-word) alignment is an important capability because it pro- vides a mechanism for characterizing certain types of differences between lan- guages. For example, in English many technical terms are multi-word compounds, while the corresponding terms in other Germanic languages are often single-word compounds. Also, many common adverbials and prepositions are multi-word units, which may or may not be translated as such (for example, English-Swedish). The usability of bilingual concordances would be greatly improved if phrases could be looked up with the same ease and precision as single words (Macklovitch and Han- nan, 1996). However, this is not the case in reality, and identifying phrases and aligning them is crucial for an improved alignment quality. Alignment of multiple-word expressions or phrases has been investigated by various researchers. Some approaches use preprocessing only on the source side (Melamed, 1997c; Smadja et al., 1996); target correspondences are then estimated during the linking stage. In other approaches, both the source and target texts are pre-processed independently and candidate lists for both source and target multi- 55 word units are created to be used in the linking process (Ahrenberg et al., 1998; Och and Weber, 1998). Phrase-level correspondences are obtained by coupling phrases of two languages obtained by CKY parsing (Kaji et al., 1992). An iterative method based on the Dice coefficient has been investigated which yields high recall if partially correct translations were included (Kitamura and Matsumoto, 1996). Similarly, a method involving Dice coefficient has been used to align collocations between English and French (Smadja et al., 1996). One of the best performing phrase alignment systems is a two-level alignment model that systematically considers whole phrases rather than single words as the basis for the alignment models (Och et al., 1999). In the first level, source sentence f and the target sentence e are decomposed into a sequence of phrases vf = 1,...,Kf andve = 1,...,Ke. The translation probability is based on these phrases ve and vf instead of actual words e and f. In the second level, the translation of shallow phrases is modeled using alignment templates based on an automatic method for extraction of bilingual classes (Och, 1999). The use of classes instead of actual words has the advantage of a better generalization. This alignment template approach produces better translation results than the single-word based approach. In a recent study, an HMM-based probabilistic model of word-to-phrase align- ment has been proposed (Deng and Byrne, 2005). In this generative model, each target sentence f = fJ1 is realized as a sequence of K phrases: f = vK1 . Each phrase in f is generated as a translation of one word in e, which is determined by the alignment sequence aK1 : eak ? vk. The length of each phrase is specified 56 by the process ?K1 , which is constrained so that summationtextKk=1 ?k1 = J. To allow phrases to be generated by a NULL word, a binary hallucination sequence hK1 is defined such that if hk = 0, NULL word generates vk, otherwise eak generates vk. With all these quantities gathered into an alignment a = (?K1 ,aK1 ,hK1 ,K), the modeling objective is to realize the conditional distribution p(f,a|e). With the assumption that p(f,a|e) = 0 if f negationslash= vK1 : p(f,a|e) = p(vK1 ,?K1 ,aK1 ,hK1 ,K) = epsilon1(J|I)? sentence length p(K|I,J)? phrase count p(aK1 ,?K1 ,hK1 |K,I,J)? word to phrase alignment p(vK1 |aK1 ,?K1 ,hK1 ,K,I,J) word to phrase translation In other approaches, texts are assumed to have structure at many different levels (character, word or partially ordered bag of lexical units). In one such ap- proach (Ahrenberg et al., 1998), an automatic phrase-extraction program provides bilingual input to a system that makes use of a combination of K-Vec (Fung and Church, 1994) and a greedy word-to-word algorithm (Melamed, 1997b) to provide the final alignment. The basic algorithm associated with this approach is enhanced by a number of modules: 1. A morphological module that groups expressions that are identical except a specified set of suffixes, 2. A weight module that adjusts the likelihood of a candidate translation ac- 57 cording to its position in the sentence, and 3. A phrase module that includes multi-word expressions generated in the pre- processing stage as candidate expressions for alignment. Multi-word expressions are linked with a relatively high recall, but the precision of these links is not as high as for single words. Other work on phrase alignment includes structural matching of phrasal texts (Matsumoto et al., 1993), phrasal translation example extraction (Wu, 1995), iden- tification of phrasal sequences (Wang, 1998; Och et al., 1999), and synchronous parsing of two texts for phrase identification and alignment, including collocations and recurrent strings, (Wu, 1997; Alshawi et al., 2000; Bangalore and Riccardi, 2001). 2.4 Combining Word Alignments One of the major problems with the IBM models (Brown et al., 1993) and the HMM models (Vogel et al., 1996) is that they are restricted to the alignment of each source-language word to at most one target-language word. For example, while aligning German-English texts, if German is the source language the frequently occurring German word compounds cannot be aligned correctly, as they typically correspond to two or more English words. The standard method to overcome this problem to use the model in both directions (interchanging the source and target languages) and applying heuristic-based combination techniques to produce 58 a refined alignment (Och and Ney, 2000b; Koehn et al., 2003). The motivation behind the refined alignment method is that the alignment links form clusters for most language pairs, i.e., when there is an alignment link, there is usually another one in its neighborhood. ForcombiningtwosetsofalignmentsA1 andA2, thestraightforwardsolutions are to take the union, i.e., A = A1?A2, and intersection of the two alignments, i.e., A = A1?A2. In addition to these two methods, Och and Ney (2000b) introduced a refined alignment to combine two sets of alignments, A1 and A2. In a first step, the intersection A = A1?A2 is determined. Then, A is extended by adding alignment links (i,j) occurring only in A1 or A2 if 1. Neither fj nor ei has an alignment in A, OR 2. If both of the following conditions hold: ? The alignment (i,j) has a horizontal neighbor, i.e., (i?1,j),(i+ 1,j), (i,j ?1),(i,j + 1), that is already in A. ? The set A?(i,j) does not contain alignments with both horizontal and vertical neighbors. In a later study, Koehn et al. (2003) proposed another refined alignment approach called grow-diag-final. This particular method extends the intersection of two alignments A1 and A2 by adding alignment links (i,j) occurring only in A1 or A2 if: 59 1. Diag-step: Neither fj nor ei has an alignment in A, AND the alignment link (i,j) has a neighbor in A. In contrast to Och and Ney?s refined alignment method, diagonal adjacencies are considered part of the neighborhood. 2. Final-step (after no more links can be added in Diag-step): Either fj or ei does not have an alignment in A. Of these 3 methods, the intersection of the two alignments consists of only one-to-one alignment links, and yields a high precision and a low recall alignment. Alternatively, the union of the two alignments yields a high recall and a low pre- cision alignment. The refined alignment method is an attempt to find a balance between the intersection and the union, by starting with more reliable alignment links and then adding less reliable but highly probable alignment links (i.e., links that occur in one of the alignments and have another alignment link in its neigh- borhood). As stated by Och and Ney (2003), whether a higher precision or a higher recall is preferred depends on the final application for which the word align- ment is intended. For example, in statistical machine translation, a higher recall is deemed to be more important. On the other hand, a higher precision would be preferred in lexicography applications. Koehn et al. (2003) claimed that choosing the right heuristic to refine alignments is more important than the choice of model for the initial word alignments. In this thesis, grow-diag-final is used as baseline in Chapters 3, 4, 5, and 6. Throughout this thesis, the notation Aligner(gdf), e.g., GIZA++(gdf), represents the alignment generated by the grow-diag-final method 60 on the outputs of Aligner in two different directions (i.e., the first input to grow- diag-finalmethodisthealignmentgeneratedbyAligner usingEnglishasthesource language, and the second input is the alignment generated by Aligner using FL as the source language). A recent attempt to symmetrize two alignments obtained by training the same model in two different directions considers the alignment problem as a task of finding the edge cover with minimal costs in a bipartite graph, where the cost of aligning a specific target and source word is computed using the parameters of the input model (Matusov et al., 2004). Once the state occupation probabilities are computed for each input model in both directions, the cost of a link is defined as cij = Msummationdisplay i=1 ?m ?hm where hm refers to the negated logarithm of state occupation probability for each model and direction. To obtain a more symmetric estimate of the costs, the state occupation probabilities are interpolated loglinearly. For instance, to symmetrize the alignments generated by training a specific model (e.g., IBM Model 4) in two different directions, the following cost function can be used: cij = ?(?log pj(i|e,f)) + (1??)(?log pi(j|e,f)) Additional feature functions can be included to compute cij. Given an I ? J cost matrix C, the best alignment is equivalent to finding a minimum-weight edge cover (or a maximum-weight bipartite matching) in a complete bipartite graph, 61 where the two node sets of this bipartite graph correspond to the source sentence positions and the target sentence positions, respectively, and the costs of the edges are elements of C. Matusov et al. (2004) reported slightly better results than the heuristic-based combination methods on two different language pairs. 2.5 Applications Word-level alignment of parallel text corpora is a critical capability for a wide range of NLP applications. In statistical MT, translation models rely directly on word alignments (Brown et al., 1993; Vogel et al., 1996; Och and Ney, 2003; Koehn et al., 2003). Extraction (or construction) of bilingual lexicons is a direct application of word alignments, as shown in various studies (Gale and Church, 1991b; Church and Gale, 1991; Fung and Church, 1994; Dagan and Church, 1994; Melamed, 1997c). In addition, word-level alignments are used for: 1. Automatic generation of transfer rules (or mappings) for MT (Menezes and Richardson, 2001; Carbonell et al., 2002); 2. Word-sense disambiguation (Brown et al., 1991a; Gale et al., 1992, 1993; Chang et al., 1996; Diab and Resnik, 2002; Bhattacharya et al., 2004); 3. Projection of resources (such as morphological analyzers, part-of-speech tag- gers, and parsers) from a resource-rich language into other resource-poor languages (Hwa et al., 2002; Yarowsky et al., 2001); and 62 4. Cross-language information retrieval (Fluhr, 1995; Oard and Dorr, 1996). The quality of word-level alignments plays a crucial role in the success of these applications. For example, in statistical MT, it has been shown that improved word alignment directly affects the output quality of statistical MT systems (Och and Ney, 2003; Callison-Burch et al., 2004).11 2.6 Evaluation of Word Alignments Nearly all word alignment evaluation techniques compare generated word align- ments with the manually annotated alignments. However, it is well known in NLP that manually performing a word alignment is a complicated and ambiguous task, even for humans (Melamed, 1998). The major difficulty is that alignments are not always 1-to-1, especially between languages that are structurally different. In fact, it could be argued that, ultimately, text alignment is no easier than the more general problem of natural language understanding (Langlais et al., 1998). There are various factors that affect word alignment evaluation (Ahrenberg 11It is worth noting that the impact of improved word alignments on MT quality is a subject of debate. Other researchers have proposed techniques that improved word alignments in terms of alignment evaluation metrics but the impact of those improved alignments on MT output have been shown to be relatively smaller (Koehn et al., 2003; Ittycheriah and Roukos, 2005). In Chapter 7 of this thesis, it will be shown that the alignment improvement techniques yield a huge error reduction on word alignments but very little improvement on MT output in terms of the well-known BLEU metric. 63 et al., 2000; Merkel and Ahrenberg, 1999): 1. The purpose of the alignment system. 2. Unit of comparison. (Is it a word-level comparison or are multi-word units included? Are all words included or are some set of words excluded, such as functional words or frequent words?) 3. Resources used. 4. The use of a gold standard. (Is this prepared before the actual alignment or do the experts evaluate a sample of the output after the alignment? Is a com- plete alignment of the sample generated or are only a subset of words/phrases aligned? What is the size of the gold standard and what is the distribution of the samples?) 5. Metrics and scoring method. (How are partial alignments measured?) 6. Error analysis. (What is the nature of the mistakes that a particular system makes?) Once the word alignments are produced manually, it is easier to evaluate the quality of any word alignment with respect to this reference set. This evalu- ation can be performed automatically and it results in a very precise and reliable evaluation criterion. If the text only consists of single words, the straightforward solution is to use the usual precision and recall metrics in information retrieval, by viewing word alignment as a retrieval problem (i.e., find all correspondences 64 at the lexical level that exist in a given parallel text or corpus). The precision gives the proportion of segments in the proposed alignment that is considered to be correct. The recall gives the coverage of the alignments in the gold standard by the proposed alignment. However, these metrics may not be suitable when align- ments are not one-to-one, for instance when collocations are involved, as well as deletions, insertions, segmentation errors and paraphrases. In a simple approach, the scoring for precision and recall can be adjusted to handle partial alignments by using weighted score (Merkel and Ahrenberg, 1999). Some approaches to evaluation of word alignment are: ? Measuring results relative to a gold standard, using either a sample of con- tinuous text (Melamed, 1998) or spot checks (V?eronis, 1998). ? Evaluating the type of links created by full text alignment system as a bilin- gual dictionary and measuring recall and precision based on a sample of this dictionary (Ahrenberg et al., 1998). ? Measuring precision only on the highest ranked n candidates suggested by the system (Kitamura and Matsumoto, 1996; Gaussier, 1998). The state-of-the-art method to evaluate word alignments is measuring pre- cision and recall on the level of alignment links (pairs of word indices) instead of words (Langlais et al., 1998; Simard and Plamondon, 1998; V?eronis, 1998). Assuming the alignment A = {al,a2,...,am} and the reference alignment G = {gl,g2,...,gn}, where ai and gi is a pair of word indices, precision and recall can 65 be computed at the level of alignment links. The advantage of this approach is that partial alignments of multi-words are rewarded. A widely-used strategy is dividing the alignments into two sets: Probable (P) alignments and Sure (S) alignments. The P relation is used especially to align words within idiomatic expressions, free translations, and missing function words. Note that it is assumed that S ? P. Using A and G as the alignment and reference alignment, A = AP ?AS and G = GP ?GS. Similar measures of recall and precision are defined (Mihalcea and Pedersen, 2003), where T is the alignment type (either P or S): RecallT = |AT ?GT||G T| PrecisionT = |AT ?GT||A T| FscoreT = 2?PT ?RTP T +RT Under the same settings, alignment error ratio (AER) is defined as follows (Och and Ney, 2000a): AER = 1? |A?GS|+|A?GP||A|+|G S| When there is only one category of alignments in the gold standard G (i.e., no distinction is made between ?sure? and ?probable? alignments), the formula can be generalized to the following: AER = 1? 2?|A?G||A|+|G| = 1?Fscore 66 AER is heavily biased toward sure alignments. If an alignment A gets exactly the same set of sure alignments in the gold standard, it will yield an AER of 0, even if it misses a lot of links in the probable alignments. Therefore, if the majority of the links in the gold standard is probable, it might not be a good idea to evaluate word alignment using AER, especially if the recall of the alignments is important for the end-application (Goutte et al., 2004). One of the important decisions in word alignment is whether to include or exclude links with unaligned words (i.e., links where one word is aligned to NULL word). In one setting, each word is enforced to belong to at least one alignment. If a word does not belong to any alignment, a NULL Probable alignment is assigned by default. This set of evaluations pertains to full coverage word alignments. In another setting, all NULL alignments are removed from both A and G (Mihalcea and Pedersen, 2003). It has been common practice in the NLP community to evaluate without NULL alignments, and to report precision of probable alignments, recall of sure alignments, and the alignment error rate. This is the strategy adopted for this thesis, as we will see in Chapters 3, 4, 5, and 6. The evaluation metrics discussed above are useful for studying the accuracy of an alignment. However, it may be more informative to evaluate these in a larger context. While the figures obtained by these metrics are informative, more telling figures can only be obtained by measuring the effect of the alignment system on some specific task (Ahrenberg et al., 1998). For example, word alignments have 67 been evaluated by computing their effect on bilingual lexicon extraction (V?eronis, 1998; Ahrenberg et al., 2000). Another approach is to evaluate statistical align- ments by looking at the quality of the corresponding MT output. Word alignments can also be used to project parse trees from one language to another language (Yarowsky et al., 2001). In this case, word alignments can be evaluated based on projected tree comparisons against a hand treebanked corpus (Goodman, 1996; Hwa et al., 2002) or on parser output after training a parser on the projected trees (Carroll et al., 1998). In Chapter 7 of this thesis, word alignments are eval- uated within a phrase-based machine translation system to investigate the effects of improved alignments inside another application. 2.7 Discussion As described in Chapter 1, current statistical word alignment systems are faced with five important challenges: 1. Complexity of the word alignment problem, 2. Lack of enough training data (in the face of limited linguistic resources), 3. Learning statistics correctly, 4. Translation divergences, and 5. Lack of a means for incremental incorporation of linguistic knowledge. 68 The rest of this section outlines how these five challenges have contributed to a number of shortcomings in existing alignment systems. The complexity of the word alignment problem forces statistical systems to constrain the search space of hypotheses by employing certain assumptions, or bi- ases, about the hypothesis to be learned from the training data. For example, the IBM models, and their variations, (Brown et al., 1993; Och and Ney, 2003) and HMM models (Vogel et al., 1996; Tillmann et al., 1997) restrict source language words to be aligned to at most one target word. Other approaches take this restric- tion even further by allowing only one-to-one alignments (Cherry and Lin, 2003; Goutte et al., 2004). As a result of their deficiencies in modeling alignments, these systems fail at generating some subset of true word alignments. Alignment combi- nation approaches eliminate some of the problems related to the biases of existing systems (Och and Ney, 2000b; Koehn et al., 2003; Matusov et al., 2004), but these approaches are based on simple heuristics and often provide little improvement over existing systems. The lack of linguistically annotated data impairs the ability of alignment systems to capture linguistic generalizations. This has led to the tendency for the development of statistical systems that use large parallel texts without any linguistic knowledge about the languages. Several researchers have shown that more training data yields better word alignments (Och and Ney, 2003; Koehn et al., 2003), yet it is unclear how one determines how much data is sufficient for different language pairs. The recent trend in statistical systems is to incorporate 69 all the data that is available and let the system take care of redundant or noisy data (Och, 2005). Ultimately, with billions and billions of sentences, this is equivalent to memorizing all possible word (or phrase) correspondences so that there is nothing left unseen for any given test set. In practice, this is nearly impossible to achieve, at least for a majority of language pairs. That is, there will always be language pairs where there is only a limited amount of data. Moreover, statistical systems are still susceptible to their biases in modeling alignments; therefore, it is highly unlikely that they will produce 100% correct alignments even with infinite data. The rareness of some words?coupled with the too-frequent occurrence of other words in the training data?makes it very difficult to choose what statistics to use (Dunning, 1993; Moore, 2004). Moreover, most expressions in the languages are only ?semi-frozen? and can still undergo a number of linguistic operations (such as inflection, insertion of adjectives and adverbs, conversion to the passive voice, etc.). For example, one of the major problems with the IBM models (Brown et al., 1993) and HMM models (Vogel et al., 1996) is their tendency to align rare words to several words on the other side in an attempt to reduce the number of unaligned words. The problem with the function words is more dramatic: Statistical systems are often unable to pick up the correct word correspondences for function words because function words occur in every sentence several times and statistical models are based on co-occurrence statistics. Such shortcomings of the models can be ad- dressed by changing the way the alignments are modeled (Moore, 2004; Tillmann et al., 1997; Och and Ney, 2000a). The rareness of words due to morphological 70 distinctions between languages can be addressed using stemming or data trans- formation techniques (Tillmann et al., 1997; Dejean et al., 2003). To solve the problem with the alignment of function words, recent studies focus on alignment of phrases rather than words (Marcu and Wong, 2002; Deng and Byrne, 2005). However, the dependence on statistics collected from training data remains to be an issue for all of these approaches. Translation divergences?structural differences between two languages (Dorr et al., 2002)?have a significant impact on alignment systems. Divergences occur when the underlying meaning of a set of words in one language is distributed over a set of words with different syntactic and lexical semantic properties in the other language, e.g., the English verb fear corresponds to tener miedo de (have fear of) in Spanish. The most common types of alignment errors related to divergences occur when a word in one language is translated into a phrasal structure with the addition or deletion of function words (i.e., conflational, inflational and structural divergences) or into words that have different parts of speech (i.e., categorial diver- gence). Chapter 3 demonstrates that translationally divergent word pairs are the most significant factor contributing to statistical alignment errors. The current- best approach to handling translation divergences aligns phrases directly rather than aligning words and extracting phrases from word-aligned corpora (Och et al., 1999; Marcu and Wong, 2002; Deng and Byrne, 2005) but these approaches are not very successful at capturing long-distance dependencies. 71 The lack of a means for incremental incorporation of linguistic knowledge into statistical systems has resulted in a proliferation of ?build-everything-from-scratch? approaches. Several researchers have demonstrated that linguistic knowledge such as POS tags, dependency relation, bilingual lexicons, and morphological analyzers improves word alignment (Toutanova et al., 2002; Cherry and Lin, 2003; Itty- cheriah and Roukos, 2005). Current approaches to injecting linguistic knowledge into word alignments include changing the model to include dependencies on the linguistic features of the words (Toutanova et al., 2002; Cherry and Lin, 2003) and representing linguistic knowledge as feature functions in a maximum entropy model (Garc??a-Varea et al., 2002; Liu et al., 2005; Ittycheriah and Roukos, 2005). There are two problems with this ?build-everything-from-scratch? approach: First, the resulting system may lose valuable information that is inherent in the architec- ture of previous systems, even when the new system produces better alignments. Second, it is quite difficult to assess the usefulness of different pieces of linguistic knowledge. The remainder of this thesis addresses these challenges, demonstrating that it is possible to improve alignments by constructing techniques for detection and correction of existing word alignments. Chapter 3 introduces a framework called DUSTer that addresses the challenges of translation divergences and lack of train- ing data. Chapter 4 addresses the first four challenges in a system called ALP that identifies and corrects frequent alignment errors automatically and conditions the alignment rules on linguistic knowledge. In Chapters 5 and 6, the shortcomings of 72 existing systems due to first 4 challenges above are handled by taking advantage of multiple alignments (Multi-Align) and using linguistic knowledge in the form of feature functions during the combination step (NeurAlign). 73 Chapter 3 DUSTer: Divergence Unraveling for Statistical MT This chapter describes the first of two rule-based approaches to the alignment prob- lem, a system called DUSTer (Divergence Unraveling for Statistical Translation) that uses parallel texts for divergence unraveling in word-level alignment. DUSTer combines linguistic and statistical knowledge to resolve structural differences be- tween languages, i.e., translation divergences, during the process of alignment. The goal is to improve word-level alignments produced by existing state-of-the-art statistical systems using linguistic knowledge. Early alignment algorithms incorporated linguistic cues to address standard alignment issues, e.g., one-to-many/many-to-one mappings, but such systems were often narrow in their coverage. Statistical algorithms (Och and Ney, 2003) are more broadly applicable, requiring only that a large parallel corpus exists for the languages in question, but they are not able to accommodate certain types of MT phenomena. For instance, statistical aligners are incapable of handling com- 74 plex phrasal constructions because dependencies are not captured between non- consecutive, widely distributed words. At the heart of the problem is the lack of linguistic knowledge about struc- tural differences between languages, e.g., English verb fear vs. Spanish tener miedo de (have fear of). A statistical word aligner usually aligns fear to miedo, leaving the main verb tener unaligned, because tener and de are frequent enough in the Spanish corpus to be aligned to a variety of other high-frequency words. Learning statistics from a huge corpus is not sufficient to align words correctly in such cases. In DUSTer, this deficiency is addressed by relating one or more linguistically- motivated categories associated with the (English) input words to those of another language (henceforth, foreign language?FL); the resulting match sets are used to infer corrected alignments. DUSTer employs a set of rules to identify and handle translation divergences. The rules utilize the dependency relationships between words, POS tags of the words, and a set of related semantic-based classes to identify the places where two sentences are divergent. The application of the rules relies on the existence of alignment links between some words in the rule. If a rule is found to be applicable, then DUSTer adds additional alignment links as indicated by the rule. The rules can be tailored to any language easily once the divergences between two languages are identified.1 1DUSTer served as an initial study to test the feasibility of alignment corrections for certain divergence types. The goal in this thesis is to move toward automatic construction of these 75 The rest of this chapter describes different types of translation divergences and relates them to the word alignment errors made by statistical alignment sys- tems. Then an approach to resolving translation divergences is described. The chapter concludes with a presentation of a set of experiments on English-Spanish data. 3.1 Translation Divergences Divergences between languages occur when the underlying meaning of a set of words in one language is distributed over a set of words with different syntactic and lexical semantic properties in the other language. The translation divergences can be divided into 5 groups (Dorr et al., 2002): Categorial Divergence: A categorial divergence is the translation of words in one language into words that have different parts of speech as in the translation of an adjective into a noun. In the following examples, the adjectival phrase is translated into a light verb accompanied by a nominal version of the adjective: to be jealous ? tener celos (to have jealousy) to be fully aware ? tener plena conciencia (have full awareness) rules. Chapter 4 presents a machine learning approach to demonstrate how similar rules can be generated automatically. 76 Conflational/Inflational Divergence: A conflation is the translation of two or more words in one language into one word in another language. Inflation is the reverse image of conflation. Common forms of this divergence type are the light-verb construction and manner conflation. The light-verb construction is the translation of a single verb in one language into a combination of a semantically ?light? verb and some other meaning unit (maybe a noun or a preposition). for example, to kick ? dar una patada (give a kick) to end ? poner fin (put end) Manner conflation is the translation of a single manner verb (e.g., float) in one language into a light verb of motion and a manner in the other language. Some examples in English-Spanish are: to float ? ir flotando (go (via) floating) to pass ? ir pasando (go passing) Structural Divergence: A structural divergence is the realization of verb ar- guments in different syntactic configurations in different languages, e.g., the real- ization of incorporated arguments (such as subject and object) as obliques (i.e., headed by a preposition in a PP). Some English-Spanish examples are: to enter the house ? entrar en la casa (enter in the house) ask for a referendum ? pedir un referandum (ask-for a referendum) 77 Head Swapping Divergence: Head swapping is the inversion of a structural dominance relation between two semantically equivalent words when translating from one language to another. An example is the demotion of the head verb and the promotion of one of its modifiers to the head position. For example, an English motion verb and a preposition are translated as a directed motion verb and a progressive verb in Spanish: to run in ? entrar corriendo (enter running) fly about ? andar volando (go-about flying) Thematic Divergence: A thematic divergence occurs when a verb?s arguments are realized in syntactic configurations that reflect different thematic hierarchies (thematic to syntactic mapping orders). For example, the experiencer and theme of a verb can be realized as subject and object in one language and as dative (object of to) and subject in another language. I like grapes ? Me gustan uvas (to-me please grapes) I have a headache ? Me duele la cabeza (to-me hurt the head) Previous work on divergences (Dorr et al., 2002; Habash and Dorr, 2002) showed that at least 10% of the sentence pairs in Arabic/English and Span- ish/English are divergent in some way. In another study of 2K human-confirmed Spanish/English divergence pairs, 98% pairs contained categorial divergences, 83% contained conflational divergences (Light-Verb and Manner conflations), 35% con- tained structural divergences, 8% contained head swapping divergences, and 6% 78 contained thematic divergences. These numbers clearly indicate that different di- vergence types frequently co-occur. For instance, categorial divergence co-occurs with almost all other divergence types. To illustrate the co-occurrences of diver- gences, consider the following English and Spanish sentences: Different politicians please Maria. Maria tiene gustos de pol??ticos diferentes. Therearefourco-occurringdivergencesinthisexample: Categorial(please isaverb in English but gusto is a noun in Spanish), conflational (please is conflated to tener gusto), thematic (Maria and politicians switch thematic-to-syntactic realization order), and structural (politicians is an argument in English but an oblique in Spanish). This indicates that it is important to handle divergences concurrently rather than tackling each divergence one at a time. Examples of the type given above are what motivates the idea of diver- gence unraveling for inducing word-alignment improvements. Divergence handling requires careful attention to distinctions in both content and structure. Histor- ically, MT approaches have achieved this by means of transfer rules (hye Han et al., 2000; Lavoie et al., 2000) or complex interlinguas (Dorr, 1993). However, these techniques rely heavily on resources that are difficult to obtain, e.g., anno- tated parallel treebanks, parses, or large complex manually checked lexicons. In divergence-unraveling approach in this chapter, resources on the FL side are kept minimal; knowledge-intensive resources are required only in English. 79 3.2 Relating Alignment Errors to Divergence Types This section examines the relation between the alignment errors made by a state- of-the-art alignment system (GIZA++ (Och and Ney, 2003)) and the linguistic categories (syntactic and semantic) associated with translation divergences. The results of this analysis validate the use of a set of linguistically-motivated universal rules described in Section 3.3.2. Table 3.1 presents an alignment example (between English and Spanish) taken from a development set of 99 randomly-selected sentence pairs.2 Each row shows all alignment links for a given English word. Two different alignments (one by a human and one by GIZA++) are shown in Spanish. Certain additional features, e.g., part-of-speech (POS) labels and semantic word classes (such as Psych Verbs (abbreviated as PsyV) and Directional Prepositions (abbreviated as DirP)) are also shown here. Analysis of GIZA++ alignments reveals that all alignment errors fall into one of 3 categories:3 1. Type 1?Missing Links: For a given English word ei, GIZA++ omits one 2The development set was selected from a mixture of Bible and UN Corpus sentence pairs. There is no limitation on sentence length: the length of the English sentences is between 7 and 46 words while the length of Spanish sentences is between 6 and 50 words. The average English sentence length is 25 while the average Spanish sentence length is 27. For all 99 sentences, the pairs of English/Spanish word pairs that are misaligned were extracted for the current analysis. 3(i,j) is used to represent the alignment link between ei and fj. 80 English Word Human GIZA 1.Domenech/Noun [ 1.el/FuncD 2.dirigente/Noun ] [ 1.el/FuncD 2.dirigente/Noun ] 2.also/Adv [ 3.tambi?en/Adv ] [ 3.tambi?en/Adv ] 3.complained/Verb [ 4.se/FuncN 5.quej?o/PsyV ] [ 5.quej?o/PsyV ] 4.about/Oblique [ 6.de/DirP,Oblique ] [ 6.de/DirP,Oblique ] 5.Argentine/Adj [ 8.argentinos/Adj ] [ 8.argentinos/Adj ] 6.officials/Noun [ 7.funcionarios/Noun ] [ 7.funcionarios/Noun ] Table 3.1: Human and GIZA++ Alignments Between an English and Spanish Sentence or more links (i,j) specified in the human alignment. Alternatively, for a given FL word fk, GIZA++ omits one or more links (i,k) specified in the human alignment. 2. Type 2?Added Links: For a given English word ei, GIZA++ includes one or more links (i,k) not specified in the human alignment. 3. Type 3?Replaced Links: For a given English word ei, GIZA++ sub- stitutes one or more links (i,j) for all links (i,k) specified in the human alignment.4 Figure 3.1 illustrates these alignment errors graphically, where dashed lines indicate human alignment links missed by GIZA++, solid lines indicate human alignment links matched by GIZA++, and crossed-out solid lines indicate GIZA++ alignment links not matched by the human alignment. 4Although this error could be viewed as a combination of the missed link in Type 1 errors and the added link in Type 3 errors, it is counted as a separate error type for this analysis. 81 e e ei i i f f f f f fj k j k j k ei ej fk Type 2: Added linksType 3: Replaced linksType 1: Missing links Figure 3.1: Graphical Illustration of Three Types of Alignment Errors Error Type Number Percentage Example 1. Missing links 833 79.7% fear ? tener miedo 2. Added links 122 11.7% foreign ? relaciones exteriores 3. Replaced links 90 8.6% stolen ? apacientan Total 1045 100.0% Table 3.2: Number and Percentage of Alignment Errors by GIZA++ (on English- Spanish) Table 3.2 shows the error rate associated with each of these types (for the de- velopment set) and their relation to the divergence categories. Missing alignment links are the most frequent alignment error introduced by GIZA++, accounting for 79.7% of the cases. These are precisely the cases that correspond to the diver- gence classes specified above, most notably conflational, inflational and structural divergences, where a single word in one language corresponds to multiple words in the other language. On the other hand, Added and Replaced alignment links occur in the much rarer cases where the statistical alignment is tripped up, not by issues concerning translation divergences, but by idiosyncratic, domain-specific statistical correspondences from the training corpus. Further analysis of the errors above allows us to identify the linguistic cat- 82 Linguistic Category Missing Added Replaced Example Adjective 34 14 5 big Adverb 20 2 3 very Complement 38 6 4 that Determiner 11 3 - the DirectionP 47 2 4 to FunctionalDet 99 2 9 a FunctionalNoun 88 13 4 he Negation 8 1 - not Noun 141 34 14 book Oblique 134 7 18 on Pronoun 24 5 5 it Sem-Class Verbs6 85 17 9 run Other Verbs 104 15 15 have TOTAL 833 121 90 Table 3.3: Semantic and Syntactic Categories of Words from Missing, Added and Replaced Links (on English-Spanish) egories of words that are most likely to be associated with alignment errors. Ta- ble 3.3 provides a count of the semantic classes (Complement, DirectionP, Func- tionalDet, FunctionalNoun, Negation, Oblique, Sem-Class Verbs) and POS labels (Adjective, Adverb, Determiner, Noun, Pronoun, Other Verbs) associated with English word(s) involved in missing/added/replaced links.5 Table 3.3 suggests the following issues with GIZA++ alignments: ? GIZA++ handles Obliques poorly (most commonly associated with catego- rial and structural divergences), leaving them unaligned most of the time. 5The numbers in Table 3.3 reflect double counting in some cases: If an alignment link that is missed for a word is added to another word, it is counted as one missed and one added link. 6Sem-Class Verbs refer to verbs that occur in widely recognized semantically classes, specif- ically, Aspectual, Change of State, Directional, Light, Locational, Modal, Motion, Psych, and Tense. See Section 3.3.1 for more discussion. 83 For example, in the translation of fear to tener miedo de, the Oblique de is left unaligned. ? Alignment of verbs (most commonly associated with conflational, head swap- ping, and thematic divergences) is one of the biggest challenges for GIZA++. For example, in the translation of fear to tender miedo de, the Light-Verb tener is left unaligned. ? Functional determiners and functional nouns (most commonly associated with categorial and structural divergences) are another source of misalign- ment that arises quite frequently. For example, in the translation of com- plained to se quej?o, the Functional-Noun se is left unaligned. The divergence-unraveling approach addresses these issues more specifically. 3.3 DUSTer: System Description This section describes the DUSTer system and demonstrates how the system han- dles divergences effectively in the context of word-level alignment. Previous work demonstrated that mapping words into word classes (Och and Weber, 1998) and grouping/splitting words (Tillmann et al., 1997) are techniques that have proved useful for the tasks of statistical alignment. DUSTer takes this one step further by applying a set of general, linguistically-motivated rules to induce alignment improvements. The key idea behind DUSTer is to relate one or more linguistically-motivated 84 Figure 3.2: Components of DUSTer categories associated with the English input words to those of the FL; the resulting match sets are used to infer corrected alignment links. To achieve this, DUSTer relies heavily on English resources and a set of rules for identifying and resolving common divergence types between languages. Figure 3.2 shows the major components of DUSTer (inside the dotted rec- tangle). The input to DUSTer is an English sentence, a foreign language sentence, a word alignment between those two sentences, a dependency tree corresponding to the English sentence, and a set of universal rules for handling translation diver- gences (which will be described in Section 3.3.2 in detail). The initial alignments may be produced by any existing word alignment system. The dependency tree may be produced by a standard parser. First, the English dependency tree is preprocessed (adding certain linguistic features of the words) to produce an enhanced dependency tree. After this, the 85 enhanced tree is passed to a universal rule-application module, along with the original sentences and initial alignment. Application of the rules results in a list of match sets, i.e., the indices of words matching both English and the FL components of the universal rules. These match sets are then used to infer a final set of partial alignments. These partial alignments may be combined with alignments produced by other algorithms to induce an overall improved result. The remainder of this section presents the resources that are necessary for the application of DUSTer to a language pair (i.e., the parameters and universal rules) and then illustrates the application of DUSTer to alignment correction. 3.3.1 Parameters DUSTer assumes that certain types of words are grouped together into parameter classes based on semantic-class knowledge, e.g., classes of verbs including Aspec- tual, Change of State, Directional, etc. (Levin,1993). Eachparameterisassociated with a list of words and a word may be part of more than one parameter. The parameter classes play an important role in identifying and handling translation divergences. The current classification includes 16 classes of parameters. Table 3.4 presents the parameters and some associated English words. Because the parameters are based on semantic knowledge, the English val- ues can be projected to their corresponding values in a new language. With few exceptions, this can be done simply by translating the words of a parameter class to those of another language. For example, English light verbs (be, do, give, have, 86 Parameter Name Examples in English Aspect Verb begin, complete, end, keep, prevent, quit, repeat Change of State Verb age, alter, change, empty, grow, reverse, tighten Complement as, because, since, than, that, while, when, yet Direction Verb arrive, come, cross, enter, invade, return Directional Preposition across, at, down, from, in, to, toward, up Functional Determiner a, each, every, his, many, some, that, which Functional Noun all, each, every, it, he, him, his, what Light Verb be, do, give, have, make, put, take Location Verb belong, consist, enclose, insert, put, touch Modal Verb can, could, do, have, may, must, should Motion Verb bounce, cut, float, jump, kick, move, run, turn Negation no, not, none Oblique across, by, for, in, of, over, to, under, with Pleonastic it, there Psych Verb admire, bother, envy, love, mean, respect, try Tense Verb shall, will Table 3.4: DUSTer Parameters in English make, put, take) are translated to Spanish light verbs (estar, ser, hacer, dar, tomar, poner, tener), respectively. Note that it is not necessary to list all morphological variants of the same word in the parameter classes.7 For example, in English, the verb begin is listed among the Aspectual Verbs, but not all of its variants begins, began, begun. Appendix A lists all parameters in English and Spanish. Different morphological variants are mapped to the root word in a separate module and each variant is treated as a member of the associated parameter class during the execution of the system. 7DUSTer considers only inflectional morphology, i.e., derivational morphology is ignored inside DUSTer. 87 3.3.2 Universal Rule Set DUSTer makes use of a ?universal rule set,? i.e., general rewriting rules that relate one or more linguistically-motivated categories in English?specifically, part-of- speech (POS) labels and semantic word classes?to those of the FL. The term ?universal? refers to the structure of the rule and implies that the same rule may be applied to different language pairs that exhibit the same phenomena. For example, the following rules may be used to handle two forms of conflation (Tense-Verb and Light-Verb) between English and 3 other languages: 0.AVar.X [English{2 1} Chinese{1} Spanish{1} Hindi{1} ] [Verb<1,i> [TenseV<2,Mod,Verb,C:i>]] <--> [Verb<1,i>] 1.B.X [ English{2 1 3} Spanish{2 1 3 4 5} ] [PsychV<1,i,CatVar:V_N,Verb> [Noun<2,j,Subj>] [Noun<3,k,Obj>]] <--> [LightVB<1,Verb,C:i> [Noun<2,j,Subj>] [Noun<3,i,Obj>] [Oblique<4,Pred,Prep,C:i> [Noun<5,k,PObj>]]] These rules correspond to the mappings ?will eat? ? ?eats? and ?j fears k? ? ?j has fear of k?, respectively. The first line shows the rule name, the languages to which the rule may be applied, and the relative linear order of the nodes in the surface form for each language involved. The ordering numbers (1, 2, 3, ...) correspond to the node identifiers in the rule specified in the subsequent lines. 88 :: :: [ ] :: type [0-9]+[A-Za-z0-9.]*[A-Za-z][A-Za-z0-9.]* :: { } :: [0-9]+ | [0-9]+ :: <--> :: :: [ ] :: < > | :: | :: | , ::