The Web as a Parallel Corpus Philip Resnik  and Noah A. Smith y Abstract Parallel corpora have become an essential resource for work in multi- lingual natural language processing. In this report, we describe our work using the STRAND system for mining parallel text on the World Wide Web, rst reviewing the original algorithm and results and then presenting a set of signi cant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classi cation performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the In- ternet Archive for mining parallel text from the Web on a large scale. Finally,thevalue of these techniques is demonstrated in the construction of a signi cant parallel corpus for a low-density language pair. 1 Introduction Parallel corpora | bodies of text in parallel translation, also known as bitexts |have taken on an importantroleinmachine translation and multilingual natural language processing. They represent resources for automatic lexical acquisition (e.g. Gale and Church, (1991);; Melamed, (1997)), they provide in- dispensable training data for statistical translation models (e.g. Brown et al., (1990);; Melamed (2000)), and they can provide the connection between vocab- ularies in cross-language information retrieval (e.g. Davis and Dunning (1995);; Landauer and Littman(1990);; see also Oard (1997)). More recently, researchers at Johns Hopkins University and the University of Maryland have been explor- ing new ways to exploit parallel corpora in order to develop monolingual re- sources and tools, using a process of annotation, projection, and training: given a parallel corpus of English with a less resource-rich language, we project En- glish annotations across a parallel corpus to a second language, using word-level alignments as the bridge, and then use robust statistical techniques in learning from the resulting noisy annotations (Cabezas, Dorr, and Resnik, 2001;; Diab  Department of Linguistics and Institute for Advanced Computer Studies, Universityof Maryland, College Park, MD 20742. Email: resnik@umiacs.umd.edu y Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218. Email: nasmith@cs.jhu.edu 1 and Resnik, 2002;; Hwa et al., 2002;; Lopez et al., 2002;; Yarowsky, Ngai, and Wicentowski, 2001;; Yarowsky and Ngai, 2001;; Rilo , Schafer, and Yarowsky, 2002). For these reasons, parallel corpora can be thought of as a critical resource. Unfortunately, they are not readily available in the necessary quantities. Un- til very recently, for example, statistical workinmachine translation focused heavily on French-English translation because the Canadian parliamentary pro- ceedings (Hansards) in English and Frenchwere the only large bitext available. Things haveimproved somewhat, but it is still fair to say that for all but rela- tively few language pairs, parallel corpora tend to be accessible only in special- ized forms such as United Nations proceedings, 1 , religious texts (Resnik, Olsen, and Diab, 1999), localized versions of software manuals (Resnik and Melamed, 1997), and the like. Even for the tophandfulof majoritylanguages, the available parallel corpora tend to be unbalanced, representing primarily governmental or newswire-style texts. In addition, like other language resources, parallel corpora are often encumbered by fees or licensing restrictions. For all these reasons, it is dicult to follow the \more data are better data" advice of Church and Mercer ((1993)), abandoning balance in favor of volume, with respect to parallel text. And then there is the Web. People tend to see the Web as a re ection of their own way of viewing the world|asahuge semantic network, or an enormous historical archive, or a grand social experiment. We are no di erent: as computational linguists working on multilingual issues, we view the Web as a great big body of text waiting to be mined, a huge fabric of linguistic data interwoven with parallel threads. This report describes our techniques for mining the Web in order to extract the parallel text it contains. It presents, in revised and considerably extended form, our early work on mining the Web for bilingual text (STRAND, (Resnik, 1998;; Resnik, 1999)), incorporating new work on content-based detection of translations (Smith, 2001;; Smith, 2002) and ecient exploitation of the Inter- net Archive (Resnik and Smith, 2002). In Section 2 welay out the STRAND architecture, which is based on the insight that translated Web pages tend quite strongly to exhibit parallel structure, permitting their exploitation even without lookingat content;; we alsoshowhowwehave improvedSTRAND's performance by training a supervised classi er using structural parameters rather than rely- ing on manually tuned thresholds. In Section 3 we present an approach to de- tecting translations that relies entirely on content rather than structure, demon- strating performance comparable to STRAND's using this orthogonal source of information. In Section 4 we describe howwehave adapted the STRAND ap- proach to the Internet Archive, dramatically improving our ability to identify parallel Web pages on a large scale. Section 5 puts all the pieces together, using structural and combined content-structure matching of pages on the Internet Archive in order to obtain a sizable corpus of English-Arabic Web document 1 E.g., via LDC, http://www.ldc.upenn.edu. 2 Figure 1: Example of a candidate pair. pairs. In Section 6 we present our conclusions and thoughts on future work. 2 The STRAND Web-Mining Architecture STRAND(Resnik, 1998;;Resnik, 1999)is anarchitecture forstructural translation recognition, acquiring natural data. Its goal is to identify naturally occurring pairs of Web pages in parallel translation. In order to do this, it exploits an ob- servation about the waythatWeb page authors disseminate informationin mul- tiple languages: when presenting the same content in two di erent languages, authors exhibit a very strong tendency to use the same document structure (e.g. Figure 1). STRAND therefore locates pages that might be translations of each other, via a number of di erent strategies, and lters out page pairs where the page structures diverge bytoomuch. In this section we describe how STRAND works, and we also discuss several related Web mining methods, focusing on the overall architecture these sys- tems have in common and the important system-speci c variations. Wethen showhow tuning STRAND's structural parameters using supervised training can signi cantly increase its performance. 2.1 STRAND Finding parallel text on the Web consists of three main stages: 3  Location of pages that mighthave parallel translations  Generation of candidate pairs that might be translations  Structural ltering-out of non-translation candidate pairs We consider each of these in turn. 2.1.1 Locating Pages. The original STRAND architecture accomplished the rst step by using the AltaVista search engine (www.av.com) to search for twotypes of Web pages. A parent page is one that contains hypertext links to di erent-language versions of a document, e.g. if wewere looking for English and French bitexts, the page at the left in Figure 2 would lead us to one such candidate pair. A sibling page is a page in one languagethat itself contains a linkto a version of the same page in another language;; e.g. the page at the right of Figure 2 contains a link at the top that says \English". More recentversions of STRAND (unpublished) added a \spider" compo- nent for locating pages that mighthave translations. Given a list of Web sites thoughttocontain bilingual text for a given language pair, it is possible to download all the pages on the site, any of whichmighthave a translation on that site. Although simple to implement, this method of locating pages shifts the burden of narrowing down the possibilities to the process of generating candidate document pairs. 2.1.2 Generating Candidate Pairs. Pairing up potentially translated pages is simple when a search engine has been used to generate parent or sibling pages: one simplypairs together the twochild pages linked to by the parent, or the sibling page together with the page it links to. When all the pages on a site are under consideration, the process is rather di erent. The simplest possibility is to separate the pages on a site into the two languages of interest using automatic language identi cation (e.g. Dunning (1994)), throwing awayany pages that are not in either language, and then generate the cross product. This potentially leads to a very large number of candidate page pairs, and for most of them there is no particular reason to be- lieve they are parallel translations, other than the fact that they appear on the same Web site. The \spider" component of STRAND adds a URL-matching stage, exploiting the fact that the directory structure on manyWeb sites re- ects parallel organization when pages are translations of each other. Matching is done bymanuallycreating a list of substitution rules (e.g. english ) big5), and, for each English URL, applying all possible rules to generate URLs that might appear on the list of pages for the other language. If such a URL 4 Faits saillants des pratiques exemplaires S?minaire sur l?autor?glementation Le vendredi 25 octobre 1996, 40 membres du milieu de la r?glementation ont assist? au s?minaire sur les pratiques exemplaires en mati?re de r?glementation qui visait ? aider les minist?res ? se familiariser avec l?autor?glementation comme solution de rechange aux programmes de r?glementation. L?animateur, M. Zane Brown, directeur g?n?ral, Direction des biens de consommation, a ouvert la s?ance en relatant l?exp?rience d?Industrie Canada par suite de l?introduction de m?canismes de diversification des modes de prestation des services, comme les codes volontaires et l?autor?glementation de l?industrie. Il a not? que para?trait prochainement un document sur la diversification des modes de prestation des services qui traiterait de divers sujets, comme les m?canismes de prestation de rechange des services les plus appropri?s ainsi que les probl?mes soulev?s par chacun. Codes volontaires Un code volontaire est un ensemble d?engagements normalis?s - ne faisant pas explicitement partie d?un r?gime l?gislatif ou r?glementaire - con?us pour influencer, fa?onner, contr?ler ou ?valuer le comportement de ceux qui les ont pris. Ils n??liminent pas, a poursuivi M. Brian Glabb, analyste principal, Affaires r?glementaires, au Secr?tariat du Conseil du Tr?sor, le droit du gouvernement de r?glementer. Ils offrent simplement aux participants une solution de rechange ? la r?glementation par le gouvernement. Au moment o? la r?glementation fait l?objet d?un examen accru du public, les gouvernements ? l??chelle mondiale se tournent vers des approches volontaires en compl?ment de la r?glementation et m?me comme substitut ? celle-ci. Les codes volontaires pr?sentent un certain nombre d?avantages, notamment : * la possibilit? d??tre ?labor?s plus rapidement que des lois; * des co?ts compar?s inf?rieurs de pr?paration et de mise en vigueur; * ils permettent d??viter les probl?mes de comp?tence qui frappent les initiatives de r?glementation; * la facilit? relative avec laquelle ils peuvent ?tre modifi?s en fonction des nouvelles circonstances. Figure 2: Examples of a parent page and sibling page used to nd candidate document pairs. is found, the pair with similar URLs is added to the list of candidate doc- ument pairs. For example, if an English-Chinese site contains a page with URL http://mysite.com/english/home en.html, one combination of substitu- tions mightproduce the URL http://mysite.com/big5/home ch.html,which are probably worth considering as a likely candidate pair. 2 Owing to the combina- torics (an exponential number of possible substitutions), only a xed number of substitution combinations could be tried per English URL;; however, in Sec- tion 4.3 we describe a more scalable URL-matching algorithm. Another possible criterion formatchingis the use ofdocumentlengths. Texts that are translations of each other tend to be similar in length, and it is rea- sonable to assume length(E)  f(length(F)) where f is a linear function whose parameters are tuned for languages E and F. The use of a document-length lter is described in Smith (2001), where such a lter is shown, at the sentence level, to reduce the size of the search space exponentially in the con dence (p in a (1 ; p)-con dence interval for a linear regression model) with only linear loss of good pairs. 2 Big5 is the name of a commonly used character encoding for Chinese. 5 2.1.3 Structural Filtering. The heart of STRAND is a structural ltering process that relies on analysis of the pages' underlying HTML to determine a set of pair-speci c structural values, and then uses those values to decide whether the pages are translations of each other. The rst step in this process is to linearize the HTML structure and ignore the actual linguisticcontent of the documents. Both documents inthe candidate pair are run through a markup analyzer that acts as a transducer, producing a linear sequence containing three kinds of token: [START:element_label] e.g. [START:A], [START:LI] [END:element_label] e.g. [END:A] [Chunk:length] e.g. [Chunk:174] The second step is toalignthe linearized sequences usinga standard dynamic programming technique (Hunt and McIlroy, 1975). For example, consider two documents that begin as follows: Emergency Exit Sortie de Secours

Emergency Exit

Si vous ^etes assis a If seated at an exit and c^ote d'une... . . . . . . The aligned linearized sequence would be as follows: [START:HTML] [START:HTML] [START:TITLE] [START:TITLE] [Chunk:12] [Chunk:15] [END:TITLE] [END:TITLE] [START:BODY] [START:BODY] [START:H1] [Chunk:12] [END:H1] [Chunk:112] [Chunk:122] Using this alignment, we compute four scalar values that characterize the quality of the alignment. dp The di erence percentage, indicating non-shared material, i.e. alignment tokens that are in one linearized le but not the other. n The number of aligned non-markup text chunks of unequal length. r The correlation of lengths of the aligned non-markup chunks. p The signi cance level of the correlation r. 6 The di erence percentage (dp)quanti es the extent to which there are mis- matches in the alignment | sequence tokens on one side that have no corre- sponding token on the other side. In the example above, one documentcontains an H1 header that is missing from the second document. Large numbers of such mismatches can indicate that the two documents do not present the same ma- terial to a great enough extent to be considered translations. This can happen, for example, when two documents are translations up to a point, e.g. an intro- duction, but one document goes on to include a great deal more content than another. Even more frequently, the proportion is high when two documents are prima facie bad candidates for a translation pair. The number of aligned non-markup text chunks (n) helps characterize the qualityof the alignment. The dynamicprogrammingalgorithmtries to optimize the correspondence of identical tokens, which represent markup. 3 As a side- e ect, the non-markup text chunks are placed in correspondence with each other (e.g. the \Emergency Exit" and \Sortie de Secours" chunks in the above example). The more such pairings are found, the more likely the candidate documents are likely to representavalid translation pair. The remaining two parameters (r and p) quantify the extenttowhichthe corresponding non-markup chunks are correlated in length. When twodocu- ments are aligned with each other and are valid translations, there is a reliably linear relationship in the length of translated chunks of text | short pieces correspond with short pieces, medium with medium, and long with long. The Pearson correlation coecient r for the lengths will be closer to 1 when the alignment has succeeded in lining up translated pieces of text, and the p value quanti es the reliabilityof the correlation, e.g. the standard threshold of p<:05 indicates 95% con dence that the correlation was not obtained bychance. In our original work, we used xed thresholds, determined manually byin- spection of development (non-test) data for English-Spanish, to decide whether a candidate pair should be kept or ltered out. Thresholds of dp < 20% and p<0:05 were used. 2.2 STRAND Results Like most search tasks, performance at nding parallel Web pages can be eval- uated using standard measures of precision and recall and bycombining those gures using the F-measure. It is not possible for us to measure recall relative to the entire set of document pairs that should have been found | this would require exhaustiveevaluation using the entire Web, or pooling of results from a large number of di erent systems as done in the TREC information retrieval evaluations. Therefore, recall in this setting is measured relativetothesetof candidate pairs that was generated. 3 \Non-markup" tokens with exactly the same length almost always turn out to be pieces of identical markup, e.g. key=value pairs within HTML tags. 7 Since the \truth" in this task is a matter for human judgment, we rely on bilingual speakers to independently judge whether page pairs are actually translations of each other for anygiven test set. In our experience no bilingual speaker is 100% comfortable saying that another person's translation is a good translation, so in creating the gold standard we instead ask, \Was this pair of pages intended to provide the same contentinthetwo di erent languages?" Asking the question in this way leads to high rates of inter-judge agreement, as measured using Cohen's  measure. 2.2.1 Using Manually-Set Parameters. Using the manually set thresholds for dp and n,wehave obtained 100% preci- sion and 64.1% recall (extrapolated to include a set of unjudged items) in an experiment using STRAND to nd English-FrenchWeb pages (Resnik, 1999). A modi ed version of STRAND was used to obtain English-Chinese pairs (see related work, below), and in a similar formal evaluation, we found that the re- sulting set had 98% precision and 61%recall for Chinese. 4 Both these results are consistent with our preliminary ndings for English-Spanish using a less rigor- ous evaluation (using the judgments of the rst author rather than independent bilingual evaluators) and a very small test set, where precision was near ceiling and recall was in the vicinity of 60% (Resnik, 1998). 2.2.2 Optimizing Parameters using Machine Learning. Based on experiments with several language pairs, it appears that STRAND's structure-based lter consistently throws out a bit more than one third of the candidate document pairs it has found in order to maintain its precision in the 98-100% range. It does so by respecting parameter thresholds that were determined manually using English-Spanish development data;; the same pa- rameters seem to haveworked reasonably well not only for the English-Spanish, but also for English-French and English-Chinese pairs. It is possible, however, that classi cation can be tuned for better performance. In order to investigate this possibility,wetookamachine-learning approach: we used the four struc- tural values (dp, n, r,andp) as features characterizing each document pair, and treated the problem as a binary decision task, using supervised learning to make an attempt at better predicting human judgments. Using the English-French data, we constructed a threefold cross-validation experiment using decision tree induction to predict the class assigned bythe humanjudges. The decisiontree software was C5.0. 5 Each foldhad87 test items and 174 training items. Precision and recall computations include extrapolation from the judged set to an additional 16,328 unjudged items;; the results are 4 http://umiacs.umd.edu/resnik/strand/ 5 Available at http://www.rulequest.com/demoeula.html. 8 precision recall Untuned 1.000 0.641 Fold 1 0.913 0.834 Fold 2 1.000 0.943 Fold 3 1.000 0.724 Average 0.971 0.834 Table 1: E ects of parameter tuning. reported in Table 1 together with baseline results from STRAND's untuned classi er as reported above. Without tuning, the manually-set parameters result in good document pairs being discarded 36% of the time. Our cross-validation results indicate that tuning the parameters cuts that gure by more than half: only 17% of the good pairs will be discarded, at a cost of admitting 3 false positives from every 100 candidates pairs. 2.3 Related Work Several other systems for discovering parallel text, developed independently,can be described as operating within the same three-stage framework as STRAND. ParallelTextMiner (PTMiner, Chen andNie(2000))exploitsalready-existing Web search engines to locate pages by querying for pages in a given language that contain links to pages that are likely to be in the other language of in- terest. Once bilingual sites are located, they are crawled exhaustively.Inor- der to generate candidate pairs, PTMiner uses a URL-matching process simi- lar to the one described above;; for example, the French translation of a URL likehttp://www.foo.ca/english-index.htmlmightbehttp://www.foo.ca/french- index.html. Their matchingprocess uses a mappingof language-speci c pre xes and suxes, and does not handle the case where URL-matching requires multi- ple substitutions. PTMiner also applies a length lter and automatic language identi cation to verify that the pages are in the appropriate languages. Chen and Nie report a 95%-precise English-French corpus of 118MB/135MB of text and a 90%-precise English-Chinese corpus of 137MB/117MB of text, based on inspection. Bilingual Internet Text Search (BITS, Ma and Liberman (1999)) starts with agiven list of domains to search for parallel text. It operates by samplingpages from each domain and identifying their languages;; if a domain is deemed to be multilingual, all pages on the site are crawled exhaustively. BITS appears to consider all possible combinations of Web page pairs in the two languages (i.e. the full cross product within each site), and lters out bad pairs by using a large 9 bilingual dictionary to compute a content-based similarity score and comparing that score to a threshold. For each page pair, the similarity score is similarity(A;;B)= # translation token pairs # tokens in A (1) Translation token pairs are considered within a xed window (i.e., a distance- based measure of cooccurrence is used). 6 In addition to cross-lingual lexical matching, BITS lters out candidate pairs that do not matchwell in terms of le size, anchors (numbers, acronyms, and some named entities), or paragraph counts. Using an English-German bilingual lexicon of 117,793 entries, Ma and Liberman report 99.1% precision and 97.1% recall on a hand-picked set of 600 documents (half in each language) containing 240 translation pairs (as judged byhumans). This technique yielded a 63MB parallel corpus of English-German. Other workon Webmininghasbeen doneby JinxiXu ofBBN (personal com- munication), who began with our STRAND implementation and added a mod- ule for automatically learning string-substitution patterns for URLs, and also implemented a di erent dynamic programming algorithm for assessing struc- tural alignment. Xu used the modi edSTRANDto obtain3376Chinese-English document pairs, whichweevaluated formally (see above), determining that the set has 98% precision and 61% recall. In addition, STRAND has been re-implemented byDavid Martinez and colleagues at InformatikaFakultatea in the Basque Country (personal commu- nication), in order to perform exploratory experiments for discovering English- Basque document pairs. It is worth noting that STRAND, PTMiner, and BITS are all largely inde- pendent of linguistic knowledge about the particular languages, and therefore very easily ported to new language pairs. With the exception of the use of a bilingual dictionary in BITS, these systems require, at most, a set of URL substring patterns for the URL pattern-matching stage (e.g. big5  english in the example above;; see further discussion in Section 4.3), and a modest amount of monolingual data for training n-gram based language identi ers (typically 50,000 to 100,000 characters of text per language). Word-level translations are worth exploiting when they are available, as was the case for the BITS system. In Section 3 we describe abitext-matchingprocess using a content-based similarity score grounded in information theory,andin Section 5 weshowhow structural and content-based criteria can be combined in order to obtain performance superior to that obtained by either method alone. 6 Many details of this technique are left unspeci ed in (Ma and Liberman, 1999), suchas the threshold for the similarity score, the distance threshold, and matching of non-one-word- to-one word entries in the dictionary. 10 Maria does n?t like fruit Maria n? aime pas de fruits X: Y: NULLNULL NULL Figure 3: An example of two texts with links shown. There are seven link tokens, veofwhich are lexical (non-NULL) in X (the English side), six in Y (French). 3 Content-Based Matching The approach discussed thus far relies heavily on document structure. However, as Ma and Liberman (1999)point out, not alltranslators create translated pages that look like the original page. Moreover, structure-based matching is appli- cable only in corpora that include markup, and there are certainly multilingual collections on the Web and elsewhere that contain parallel text without struc- tural tags. Finally, other applications for translation detection exist, suchas sub-document text alignment and cross-lingual duplicate detection (i.e., loca- tion of already-existing translations in a multilingualcorpus). All these consid- erations motivate an approach to matching of translations that pays attention to similarityofcontent, whether or not similarities of structure exist. We present here a generic score of translational similaritythat is based upon anyword-to-word translation lexicon (hand-crafted or automaticallygenerated, or a combination,and possibly highlynoisy). The technique is shown to perform competitively with the structure-based approachofSTRANDonthetaskof identifying English-French document translations. 3.1 Quantifying Translational Similarity We de ne a cross-language similarity score, tsim, for twotextsby starting with generative, symmetric word-to-word model of parallel texts (Melamed's Method A (2000)). 7 Let a link be a pair (x;;y) where x is a word in language L 1 and y isaword in L 2 . The model consists of a bilingual dictionary that gives a probability distribution p over all possible link types. Within a link, one of the words may be NULL, but not both. In the generative process, a sequence of independent link tokens is sampled from that distribution. The model does not accountforword order. An example of two texts with links is illustrated in Figure 3. Next, we desire to compute the probability of the most probable link se- quence that could have accounted for the two texts. 8 The probabilityofa 7 We use the term \text" to refer to a sequence of words of anylength. 8 Of course, all permutations of a given link sequence will have the same probability(since the links are sampled independentlyfrom the same distribution),so the order of the sequence 11 link sequence is simply the product of the p probabilities of the links it con- tains. As noted by Melamed (2000), this problem of nding the best set of links is the maximum-weighted bipartite matching problem (MWBM): Given a weighted bipartite graph G =(V 1 [V 2 ;;E) with edge weights c i;;j (i 2 V 1 ;;j2 V 2 ), nd a matching M  E suchthateachvertex has at most one edge in M, and P e2M c i;;j is maximized. The fastest known MWBM algorithm runs in O(ve+v 2 logv)time(Ahuja, Magnati, and Orlin, 1993). Applied to this prob- lem, that is O(max(jXj;;jYj) 3 ). To use MWBM to nd the most probable link sequence, let the L 1 words be V 1 and the L 2 words be V 2 .Iftwowords x;;y have p(x;;y) > 0, an edge exists between them with weightlogp(x;;y). Hence a sum of weights of links in a matching will be the log-probability of the (unordered) link sequence, and maximizing that sum maximizes the probability. The similarityscore should be high when many of the link tokens in the best sequence do not involve NULL tokens. Further, it should normalize for text length. Speci cally, the score is: tsim = logPr(two-word links in best matching) logPr(all links in best matching) (2) This score is an application of Lin's (1998) information theoretic de nition of similarity. Starting with a set of axioms, Lin derives the measure sim(X;;Y)= logPr(common(X;;Y)) logPr(description(X;;Y)) (3) where X and Y are any objects generated by a probabilistic model. In this discussion, we seek to showhowmultiple linguistic resources can be exploited together to recognize translation. Therefore, the measure is simpli ed by assuming that all links in a given translation lexicon are equiprobable. This reduces the formula for tsim to tsim = #two-word links in best matching # links in best matching (4) Another reason tocomputetsim under the equiprobabilityassumptionis thatwe need not compute the MWBM, but only nd the maximumcardinality bipartite matching (MCBM), since all potential links have the same weight. An O(e p v) (or O(jXjjYj p jXj+jYj) forthis purpose) algorithmexists forMCBM (Ahuja, Magnati, and Orlin, 1993). For example, if the matching shown in Figure 3 is the MCBM (for some translation lexicon), then tsim(X;;Y)= 4 7 under the simplifying assumption. Melamed (2000) used a greedy approximation to MWBM called competi- tive linking. Competitive linking iteratively selects the edge with the highest is not important. 12 weight, links those twovertices, then removes them from the graph. (Ties are broken at random.) A heap-based implementation of competitive linking runs in O(max(jXj;;jYj)logmax(jXj;;jYj)). Under the equiprobability assumption, all the weights are the same, so that competitive linking proceeds simply by randomly making links until no more can be made. If de nition (4) is applied to pairs of documents in the same language, with a \translation lexicon" de ned by the identity relation, then tsim is a vari- ant of resemblance (r), as de ned by Broder et al. (1997) for the problem of monolingual duplicate detection, except that tsim has the advantage of being token-based rather than type-based, incorporating word frequency.Thekey re- sult is that any translation lexicon (or, importantly, union thereof) containing asetofword-to-word entries can be applied to the computation of tsim. Wehave demonstrated that this score can be used to extract translationally equivalent English-Chinese sentence pairs from even a noisy space with high precision (Smith, 2002). It was also shown that combining multiple sources of word-level translation information(dictionaries, word-to-word translation mod- els, cognates) had positive e ects on performance on the sentence-matching task. The competitive linking approximation was also shown to perform nearly as well as MCBM. This technique of using a translation model to de ne translational similarity is generic to di erent sources of lexical translation information. An important feature is that it can be used with any symmetric translation model where events can be divided into those which both sides of a bitext have in common and those which a ect only one side. 3.2 Experiment Wenow apply our content-based similarity measure to the candidate pair clas- si cation task presented by STRAND. Recall that both the original STRAND classi er and those learned using decision tree methods, described in Section 2.2.2, use only structural features of the documents to determine whether they are translations. Here we apply the tsim score to the same task and compare the results with those of the original STRAND classi er. 3.2.1 Translation Lexicon. The word-level translation lexicon is derived from several sources. The rst is an English-French dictionary (a total of 34,808 entries, 4,021 of which are not one-to-one). 9 Each n-to-m entry was processed by stoplisting and then extracting all word-pairs in the remaining cross-product. This technique is liable to introduce noise to the dictionary,sowe used only cross-products that 9 This dictionarywas generatedby Gina Levow, who kindly made it availableto us, using a dictionary derived from one available at http://www.freedict.com. It contains morphological variants but does not include character accents. 13 generated four new entries or less. The result is 39,348 word pairs, 9,045 of whichcontain twowords present in the corpora. Aword-to-word translation model (Melamed, 2000) was trained on a verse- aligned Bible (15,548 verses, averaging 25.5 English words, 23.4 Frenchwords after tokenization). Instead of using competitive linking, we used the maxi- mum weighted bipartite matching algorithm in estimating the parameters of this model. The result includes all word pairs with positive probability (though the probabilities are subsequently ignored) and contains 13,762 word pairs. The third source comprises English-French cognate pairs, identi ed using the method of Tiedemann (1999). Tiedemann's approachinvolved learning language speci c character-to-character weights for the computationofweighted edit-distance to measure cognate-ness. He used a list of known cognates to train the weights. We instead used the weighted translation pairs in the translation model lexicon. Hence the resources required to extract cognates in this wayare no di erent from those required for the translation model. The result is 35,513 word pairs. An additional set of 11,264 exact string matches were added. These entries were undoubtedly quite noisy. The union of these translation lexicons consists of 68,003 unique word pairs. The experiment used only this union translation lexicon;; note that the entries in this lexicon are not weighted (they are assumed to be equiprobable). 3.2.2 Results. In order to compare tsim with structural similarityscoring, we applied it to 325 English-Frenchweb-document pairs for whichhuman evaluations were carried out in Section 2. 10 As there is only one feature under consideration (tsim), the classi er must be a threshold on that value. At di erent thresholds, Cohen's  score of agreement (with each of Resnik's (1999) two judges and their intersec- tion) may be computed for comparison with STRAND, along with recall and precision against a gold standard (for whichwe use the intersection of the judges | the set of examples where the judges agreed). Computingtsim (MCBM on the words in the document pair) is not tractable for very large documents and translation lexicons. However, in preliminary comparisons, we found that representing long documents by as few as their rst 500 words results in excellent performance on the  measure. This allows O(1) estimation of tsim for twodocuments. Further, the competitive linking algorithm appears to be as reliable as MCBM, and it runs signi cantly faster in practice. The results reported here approximated tsim in this way. Of the 325 pairs, 32 were randomly selected as a development set. We manually selected a threshold of  =0:15. This value was chosen because it 10 One additional pair was thrown out because it contained compressed data;; it is assumed that pair would not pass a language identi cation lter. 14 Comparison N Pr(Agree)  prec rec F J1, J2 245 0.98 0.96 J1, STRAND 250 0.88 0.70 J2, STRAND 284 0.88 0.69 J1 \ J2, STRAND 241 0.90 0.75 0.963 0.684 0.800 J1, tsim( = 0:15) 249 0.92 0.83 J2, tsim( = 0:15) 283 0.92 0.82 J1 \ J2, tsim( = 0:15) 240 0.93 0.85 0.680 0.921 0.782 Table 2: Comparison with STRAND. The test set is 293 of the 326 pairs in Resnik's (1999) test set. The 32 development pairs were used to manually select the 0.15 threshold. N is the number of examples for which judgement- comparison was possible in eachcase(humanjudges were sometimes undecided;; those cases are ignored in computing ). was the one that resulted in the maximum  score on the development set. 11  scores against each judge and their intersection were then computed at that threshold on the test set (the remaining 293 pairs). These are compared to  scores of the STRAND system, on the same test set, in Table 2. In every case, the tsim classi er agreed more strongly with the human evaluations, and its F score is competitive with that of STRAND. A simple combination of the classi ers by disjunction (i.e., \(X;;Y) is a translation pair if either classi er says so") yields precision 0.768, recall 0.961, F =0:854, and  (with the judges' intersection) at 0.878. At  =0:15, precision was 0.680 and recall was 0.921, F =0:782. On the same set, STRAND structural classi cation achieved 0.963 precision and 0.684 recall, F =0:800. Figure 4 shows , precision, and recall each plotted against . For this application, the structural and content-based classi ers havecom- plementary strengths;; the former is highly precise while the latter gives high recall. The content-based classi er also more reliably predicts human judge- ments (based on Cohen's ). Given two high-performing methods that use orthogonal information for identifying good candidate pairs (one using only structure, the other using only content), the natural question is whether the techniques can be combined for even better performance. We answer this question in armative in Section 5. First, however, we describe howwehave adapted the STRAND architecture to the Internet Archive in order to generate the candidate pairs on a scale that was previously unattainable. 11 One couldselectsucha thresholdto maximizeanyobjectivefunctionoverthe development set. 15 STRAND: best kappa Cohen?s kappa Precision Recall 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Threshold Figure 4: Performance measures as the threshold varies (all measures are on the test set): the  agreement score with the two judges' intersection, precision, and recall. The  score obtained by STRAND is shown as well. 4 Exploiting the Internet Archive One of the diculties with doing researchonWeb mining is that large-scale crawling of the Web is a signi cantenterprise. Chen and Nie's (2000) PTMiner represents one solution, a carefully thought-out architecture for mining on a large scale. Here we present a di erent solution, taking advantage of an exist- ing large-scale repository of Web pages maintained on an ongoing basis byan organization known as the Internet Archive. 4.1 The Internet Archive The Internet Archive is a non-pro t organization attempting to archive the en- tire publicly available Web, preserving the contentandproviding free access to researchers, historians, scholars, and the general public. 12 Data come from crawls done by Alexa Internet, and hence they represent an industry-level re- source of the sort not easily constructed within academia. Atpresent, the Archivecontains 120T (terabytes) of data, by a conservative estimate, and it is growing at approximately 8T per month. Text on the archive comprises over 10 billion Web pages, and the estimated duplicate rate is a factor of 2, i.e. two copies of everything. 13 The Internet Archive provides public access to the data via the \Wayback Machine" Web interface. As of this writing, a searchfortheACL home page 12 See http://www.archive.org. 13 We are grateful to Paula Keezer of Alexa for these gures. 16 brings up links to 27 snapshots of that page dating back to June 7, 1997. 14 The reader can get to that page directly on the WaybackMachine using a URL that points to the Internet Archive and provides both the desired page and the timestamp indicating which snapshot to retrieve. 15 The Archive also provides researchers direct access to its data via accounts ontheir cluster. The data are stored onthe disk drives of approximately300 ma- chines, each running some variety of Unix, creating what is in essence one huge le system. This provides a researcher the remarkable sensation of having the entire Web on his or her hard drive. 4.2 Properties of the Archive Mining terabytes on the Archive presents a numberofchallenges.  The Archive is a temporal database, but it is not stored in temporalorder. Hence a documentand its translation maybe in les on di erentmachines;; a global merge of data is required for any hope of complete extraction.  Extracting a document for inspection is an expensiveoperationinvolving text decompression.  The Archive's size makes it essential to keep computational complexity low. On the other hand, aspects of the Archive's architecture turned out to make rapid development remarkably feasible.  Almost all data are stored in compressed plain text les, rather than in databases.  The data relevant for our purposes are organized into archive les (arc- les), which containing the stored pages, and index les, whichcontain plain text tuples hURL;;timestamp;;arc le;;o set;;:::i.  A suite of tools exists for, e.g., extracting individual pages from archive les.  The Archive's infrastructure for cluster computing makes it easy to write Unix scripts or programs and run them in parallel across machines. The last of these, the cluster computing tools, turned out to drastically reduce the timeneeded to port STRAND to the Archive, and literallyhalved the size of the STRAND code base. The centerpiece in Archive cluster computing is a parallelization tool called p2,which o ers a UNIX command-line interface 14 http://www.cs.columbia.edu/acl/home.html 15 http://web.archive.org/web/19970607032410/http://www.cs.columbia.edu/acl/home.html. This is a single big URL. 17 that allows one to specify (a) a parallelizable task, (b) a way to split it up, (c) a waytocombine the results, and (d) a set of processors among whichto divide the task. The p2 tool divides up tasks intelligently,invoking each parallel computation on the local machine where the data reside. 16 4.3 STRAND on the Archive In adapting STRAND's three-stage process to the Internet Archive, the primary challenge was in the rst two steps, locating possible translations and matching them up to produce candidate document pairs. Structural ltering remained essentially unchanged. Generating candidate pairs on the Archiveinvolves the following steps: 1. Extracting URLs from index les using simple pattern matching 2. Combining the results from Step 1 into a single huge list 3. Grouping URLs into buckets by handles 4. Generating candidate pairs from buckets Steps 1 and 2 are done via a parallel search operation plus combination of results, e.g. extracting URLs (and their associated bookkeeping data) using a pattern like /(.hk|.tw|.cn)/ to extract all URLs in the Hong Kong, Taiwan, or China domains. 17 Step 3 is potentially tricky owing to computational complexity issues. As noted in Section 2.1.2, examining the cross product of a site's page sets in two di erent languages is potentially very expensive, and matching documents by similarity of URLs can representacombinatoric process in the general case. We arrived at an algorithmically simple solution that avoids this problem, but is still based on the idea of language speci c substrings (LSSs). The idea is to identify a set of language-speci c URL substrings that pertain to the two languages of interest, e.g. based on language names, countries, character code- sets labels, abbreviations, etc. For example, a set of LSSs for English-Arabic might be as follows: 1256, 437, 864, 8859-1, 8859-6, a, ar, ara, arab, arabic, cp1256, cp437, cp864, e, en, eng, english, gb, iso, iso-8859-1, iso-8859-6, latin, latin-1, latin1, uk, us, usa For each URL, we forma \handle" by subtracting out any substrings that match (insensitive to case) any item on the LSS pattern list. The subtraction process is 16 The Archiveintends to release these cluster tools under the GNU Public License. 17 We also takeadvantage of the Archive's list of .com domains paired with the nation in whicheach is registered, making it possible to include commercial sites in the search without an explosion of irrelevant possibilities. 18 URLs: saudifrenchbank.com.sa/English/English.htm ! saudifrenchbank.com.sa/English/English.htm patterns removed: a e a a english english saudifrenchbank.com.sa/Arabic/arabic.htm ! saudifrenchbank.com.sa/Arabic/arabic.htm patterns removed: a e a a arabic arabic Same handle for both URLs: sudifrchbnk.com.s//.htm Figure 5: Example of LSS substraction. implementedreasonably eciently: ifthere are p patterns with maximumlength l, and the URL's length in characters is u, then the current implementationwill do at most p  u string matches of length no more than l. (Currently we use C strcmp for string match.) In practice, this is extremely fast | wecan generate handles for nearly 5000 URLs per second on a 6-year-old Sun Ultra 1 workstation. 18 Figure 5 illustrates handle generation on two real URLs. As one would hope, these two URLs produce the same handle, and as a result, they wind up in the same bucket in Step 3. 19 In Step 4, the URLs in eachbucket are used to generate candidate pairs by taking the cross product and keeping those URL pairs for whichtheURL bookkeepingdataindicates pages thatare inthe correct languages. Forexample, given the bucket containing the two URLs in Figure 5, this step would generate a single pair consisting of the URL for the English page and the URL for the Arabic page, assuming the language ID information associated with eachURL con rmed it was in the proper language. 20 At this point, the candidate generation process is complete. The nal step is to apply STRAND's structural ltering step to each candidate pair | an op- eration that can itself be parallelized since each candidate pair can be processed independently. It is interesting to note that by taking advantage of the Archive's p2 clus- 18 We are grateful to Bill Pugh of the University of Maryland for suggesting this algorithm. 19 Conceptually, they hash to the same bucket in a hash table;; in practice on the Archiveit turns out to be more ecient to create buckets by doing a parallel sort of the entire URL set using the handle as the key, and then creating buckets based on identical handles being on adjacent lines. 20 The Internet Archive tags its data for language using standard n-gram language identi - cation techniques. 19 ter computing tool, together with its simple at-text representations, adapting STRAND's candidate generation process resulted in a dramaticreduction in the size of the program, cutting it in half, as measured in lines of code. 5 Building an English-Arabic Corpus In the previous sections, wehave described methods and results for structural matching, content-based matching, and for dramatically scaling up the number of candidate pairs that can be generated for any given languagepair by using the industrial-strength Web crawls stored on the Internet Archive. In this section we put all these pieces together, describing an experiment in miningthe Internet Archive to nd English-Arabic parallel text. This language pair is of particular global importance, but resources, particularly bilingual text, are not easy to obtain. Moreover, Arabic text is far behind on the Web's exponential growth curve | Arabic text (as opposed to images) did not really start emerging on the Web until the release of Microsoft Windows 98 TM ,which provided Arabic support in its version of Internet Explorer. 5.1 Finding English-Arabic Candidate Pairs on the Inter- net Archive The input resources for this searchwere a list of Internet domains likely to contain Arabic text. 21 The list included twenty-four top-level national domains for countries where Arabic is spoken by a signi cant portion of the population, including Egypt (.eg), Saudi Arabia (.sa), Kuwait (.kw), etc. In addition, we used a list of .com domains known to originate in Arabic-speaking countries. This list provided an additional twenty-one speci c domains (e.g., emirates- bank.com, checkpoint.com) | note that it is by no means exhaustive. In the experiments we report here, we mined twocrawls from 2001, com- prising 8T and 12T | i.e. less than one sixth of the Archiveasitexiststoday { spread over 27 machines. Our list of URLs with relevant domains, obtained by pattern-matching in Archive index les, numbers 19,917,923 pages. 22 The language-speci c substrings given earlier were subtracted from these URLs to generate handles, resulting in 786,880 buckets with an average of twenty- ve pages per bucket. When all possible English-Arabic page pairs were generated from all buckets, the result was 8,294 candidate pairs. A random sample of twohundred candidate pairs was given to twohuman evaluators, bilingual in English and Arabic, who were asked (independently) to answer the question, for each pair, \Is this pair of pages intended to show the same material to two di erent users, one a reader of English and the other 21 We are grateful to Nizar Habash for constructing this list. 22 Pages with the same URL but di erent timestamps are counted separately;; there were 10,701,622 unique URLs. 20 precision recall Fold 1 0.9111 0.9229 Fold 2 0.9302 0.9351 Fold 3 0.9565 0.8818 Average 0.9326 0.9133 Table 3: English-Arabic structural classi cation results. Precision and recall are extrapolated to the entire set of 8,294 candidate pairs. a reader of Arabic?" The judges' answers showed a Cohen's  agreementof 0.6955, which is generally considered fair to good reliability. (Qualitatively,one judge was rather more strict than the other;; when the stricter judge identi ed a page pair as valid translations, the less strict judge virtually always agreed.) 5.2 Evaluating Structure-Based Matching Taking the set of 149 labeled pairs on which the two judges agreed (134 were marked good, 15 bad), we carried out an evaluation of the full candidate set similar to the one for English-French discussed in Section 2.2.2. As before, this was athreefold cross-validationexperimentinwhichdecision tree classi ers were tuned on the features extracted for eacheach candidate pair by structure-based classi cation. In additionto the four structural scores, we included two language identi cation con dence scores (one for the English page, one for the Arabic page) | these were available as part of the Internet Archive's bookkeeping information for each URL and required no additional computation on our part. Table 3 shows extrapolated precision and recall of each fold's classi er applied to the entire set of 8,294 candidate page pairs. The valueof the parameter-tuningprocess is dramaticallycon rmed bycom- paring the learned parameters with STRAND's default parameters (manually determined by Resnik (1999)). Precision gures for the three folds were simi- lar on average but far less consistent (1.0 for the rst two folds, 0.6667 for the third), and none of the three folds achieved recall above 0.12. Upon inspection, wediscovered that nearly 5,000 of the pairs in our candi- date set were from a single domain, maktoob.com. This site supports an online marketplace, and many of the pages discovered by our searchwere dedicated to speci c merchandise categories within that service;; a large portion of these were simply \no items available" and one or two similar messages. We ignored this domain completely in order to be conservative about the yield of page pairs, though we note that many of the pages within it are legitimate parallel text that could be extracted if a duplicates lter were applied). 23 23 One of our human evaluators con rmed that no other domains appeared to signi cantly 21 In order to construct a nal classi er, we trained a decision tree on all 200 of our manuallyjudged examples. This was then applied to the candidate pairs, producing a set of 1,741 HTML document pairs judged to be valid translations of each other. Converting from HTML to plain text and tokenizing, the English documents in this corpus total approximately818,102tokens, with an average of 470 tokens per document;; the Arabic side contains 1,025,461 tokens, averaging 569 tokens per document. 24 5.3 Combining Structural and Content-Based Matching Wecombined the structural and content-based approaches to detecting transla- tions by adding the tsim score to the set of structural features associated with each candidate pair, and then training a new decision tree classi er. 25 Because Arabic is a highly in ected language with many surface forms, we found it necessary to use morphologicalpreprocessing in order to make e ective use of a dictionary.For English, wetokenized the text and used the WordNet lemmatizer to strip suxes. The Arabic texts were tokenized at punctuation, then Romanized and converted to root forms using a morphological analysis tool (Darwish, 2002). This approximately halved the vocabulary size for the Arabic texts (from 89,047 types to 48,212 types). The translation lexicon used to compute tsim contained of 52,211 entries, each containing one English lemma and one Arabic root. Of these, 16,944 con- tained two items that were both present in the candidate set of 8,295 Web page pairs. 26 The approximations discussed in Section 3.2.2 were used: competitive linking on the rst 500 words in eachdocumentwas used to compute the score. Carrying out the same cross-validation experiment (on the same random split of data), the combined structural and content-based classi er produced the results in Table 4;; these are extrapolated to the full set of 8,294 candidate pairs. Averaged over three folds, the classi er achieved 95.06% precision and 98.95% recall (1.8% and 7.62% better than without tsim, respectively). We repeat the average using only STRAND's structural features for reference. dominate the candidate pairs as did maktoob.com, providing some assurance that the rest of the data are diverse. 24 We converted HTML to text using the lynx browser, performed cleanups suchasremov- ing references, and tokenized using the tokenizers included with the Egypt statistical MT package (Al-Onaizan et al., 1999). That tokenizer is somewhat aggressive about separating out punctuation, so, being aggressive in the opposite direction, we also tried counting only tokens containing at least one of [A-Za-z] (which excludes punctuation as well as dates, per- centages, etc.). Using that very conservativecounting method, the size of the English side is approximately 612,673 words. Counting only tokens on the Arabic side whichcontained at least one non-numeric, non-punctuation character yielded 914,815 words. 25 We also did perform the experiment using content-basedmatching alone. However, as noted earlier, tsim's strength is recall, and on this task baseline performance (marking all candidate pairs as good translations) gives 90% precision. Not surprisingly, the performance of the tsim-only classi er did not exceed the baseline. 26 This translation lexicon was used with the kind permission of Kareem Darwish. 22 precision recall Average (structure only) 0.9326 0.9133 Fold 1 0.9167 1.0000 Fold 2 0.9767 0.9686 Fold 3 0.9583 1.0000 Average 0.9506 0.9895 Table 4: English-Arabic combined structural/content-based classi cation re- sults. Precision and recall are extrapolated to the entire set of 8,294 candidate pairs. Tokenization method English tokens Arabic tokens English lemmas, Arabic roots 1,097,674 1,107,600 Egypt tokenizers (Al-Onaizan et al., 1999) 1,170,360 1,512,959 Egypt tokenizers, ignoring words without letters 922,541 1,388,953 Table 5: Yield: the English-Arabic Internet Archive corpus, tokenized several ways After building a single classi er on all 149 test pairs (the set on whichboth human judges agreed), we re-classi ed the entire candidate set. Ignoring again pages from the maktoob.com domain, 2,190 pairs were marked as translations. Table 5 shows word counts for various tokenization schemes: the morphological analysis used for computing tsim, the Egypt tokenizer (which is highly aggres- sive), and counting only tokens with some alphabetic character from the Egypt tokenizer (a conservative approximation). To summarize the results, using the content-based similarity score as a fea- ture not only improved precision, it increased the size of the corpus (in words) by 43{52%, depending on the tokenization scheme. To our knowledge, no bilingualtext corpus of this magnitude is yet available to the general researchcommunityfor this language pair. Even if such resources existed for this particular pair, our success with English-Arabic represents a proof of concept for locatingbilingualtext for other poorly represented language pairs, especially considering the fact that only a small portion of the Internet Archivewas searched on this run. Moreover, it bodes well for locating very large quantities of Web text for English paired with better-represented languages. 23 6 Conclusions Although e orts at discovering parallel text on the Web were rst reported in 1998, Web-based parallel corpora appear to have had only a limited impact on the community. Three reasons for this suggest themselves. Too few languages. Parallel text from the Web has been made available to the community in only a few pairs of languages. As of this writing, the STRAND Web site, 27 containing URL pairs discovered via STRAND runs, con- tains collections only for English-French, English-Chinese, and English-Basque, and we are not aware of any other e orts to publicly disseminate Web-based parallel data. Up to this point, it simply has not been easy to search the Web for parallel text in new language pairs. The most dicult part is nding the candidates | a year or two ago, we attempted to apply the original Web-based STRAND to the problem of nding English-Arabic text and we simply were unable to locate enough search engine hits or sites to yield useful results. Too littledata. Very largeWeb-based paralleltext collectionsare not avail- able to the community. The largest appear to have been obtained by Chen and Nie (2000), who acquired collections for English-French and English-Chinese on the order of 15,000 document pairs using the PTMiner system. However, these collections have not been made available. In contrast, the STRAND collections, whichareavailable to the community in the form of URL pairs, are modest in size: the English-Chinese collection contains fewer than 3500 document pairs, and English-French fewer than 2500. Diculty with dissemination. Web-based collections are dicult to dis- tribute. Standard mechanismsof the sort used by LDC|a CDor downloadable le | are fraught with dicult legal issues, since, technically speaking, redis- tributing the actual contentofWeb pages could require permission from the author of every page. For example, presumably as a risk reduction strategy, the Web track for TREC-2002 (Text REtrieval Conference) limits its attention to the .gov domain and requires the recipient of the data to sign a form that reduces the distributor's liability. 28 Similarly, the Google Programming Con- test dataset arrives with a limited-use license, indemni cation from third-party claims, and a collection limited to the .edu domain, from which, presumably, authors are less likely to bring expensivelawsuits. 29 A possible fourth reason mayhave to do with questions about the utilityof the data. For example, a Web-based parallel collection may be unpredictable in terms of its coverage and the communityiswell aware of the dangers of using training data that is not representative of the test domain. A solution to this problem might be to extract topically relevant subsets of the collection 27 http://umiacs.umd.edu/resnik/strand/ 28 \The limitation on permitted use ... is intended to reduce the risk of any action being broughtbycopyrightowners, but if this happens the Organisation [recipient] agrees to bear all associated liability" (http://www.ted.cmis.csiro.au/TRECWeb/ access to data.html). 29 http://www.google.com/programming-contest/ 24 for particular domains or applications, but of course this requires a \more is better" approach in order to obtain subsets that are large enough to be useful. The work reported in this report addresses each of these major problems. With respect to the number of language pairs, the Internet Archiveo ers us the largest possible sample of pages on the Web, and our techniques make it easy to explore that collection in an ecientway. Although it is probably impossible to crawl more than a small fraction of the Web, the Internet Archive is storing the results of commercial-scale Web crawling and has as its explicit mission the permanent storage of everything that can be found. The fact that wewere able to nd a substantial quantity of English-Arabic text | on the or- der of a million words per side, looking at less than sixth of today's Archive| o ers the hope that it will be possible to nd data for the less well-represented language pairs, if and when those data actually exist. Moreover, the nal imple- mentation we described here retains the almost entirely language-independent character of the original STRAND system, adding only the requirementofa reasonable translation lexicon. Therefore success in mining for parallel text in other languages depends primarily on whether the data exist on the Archive. With regard to corpus size, we demonstrated that the recall of structural matching, and hence its yield, can be signi cantly improved by simple and au- tomaticclassi er construction, requiring only a few hours' work froma bilingual annotator to create the training material. These results are further improved by adding content-based similarity as a feature. The success with English-Arabic | a language that is not one of those usually considered well represented on the Web | encourages us to believe that for other languages of interest we will be similarlysuccessful. Wehave also done a bit of exploration to gauge the poten- tial of the Archive for better represented language pairs, using English-Chinese as an example. By way of context, Chen and Nie (2000) reported that PTMiner found around 15,000 English-Chinese document pairs bycrawling 185 sites in the .hk (Hong Kong) domain, with the run taking about a week. We did a STRAND searchofthetwoInternet Archive crawls used in the English-Arabic study, seeking English-Chinese parallel text in multiple domains where Chinese is a dominant language (e.g. .hk, .tw, .cn). Our initial candidate pair set was generated in approximately 30 hours, and contains over 70,000 candidate page pairs. It is dicult to generalize from the English-Arabic experiment, but one cannot help but conclude that with even a fraction of the yield we obtained on the previous experiment, especially broadening out to include the other ve sixths of the Archive, we should be able to obtain interestingly large quantities of English-Chinese text. In terms of dissemination, the STRAND distribution mechanism models itself after Web search engines, distributing the URLs rather than the pages themselves. This places the legal burden on individual users, who are safe un- der \fair use" provisions if they download pages for their individual use. Until recently the diculty with this solution has been that the collection of URLs deteriorates over time as sites disappear, pages are reorganized, and underly- 25 ing contentchanges | for example, in April 2002, we attempted to download the documents in the STRAND English-French, English-Chinese, and English- Basque collections, and wewere able to successfully access only around 67%, 43%, and 40% of the URL pairs, respectively.However, the Internet Archive's WaybackMachine provides a way to distribute persistent URLs. The quality of the data is a question that will need to be addressed bye orts to actually use those data in natural language applications. It is worth pointing out that, because STRAND expects pages to be very similarin structural terms, the resulting document collections are particularly amenable to sentence- or segment-level alignment. Indeed, just using dynamic programming to align the markup, ignoring the text, produces reasonable rst-pass alignments of the intervening text as a side-e ect. We are currently adapting statistical text- based sentence aligmenttechniques to takeadvantage of the markup available in Web-based document pairs. Finally, our subjective impression is that parallel texts mined from the Web tend to contain a reasonable percentage of high quality translation, especially since pages often come from government or commercial sites. The underlying content, of course, is as rich and diverse as the Web itself, and what weas researchers can do with it is an exciting question that remains to be answered. 7 Acknowledgements This work has been supported in part by Department of Defense Contract RD- 02-5700, DARPA/ITO Cooperative Agreement N660010028910, and ONR MURI Contract FCPO.810548265. The second author is supported byaFannie and John Hertz Foundation Fellowship. Wewould like to thank Nizar Habash, Kareem Darwish, Mona Diab, Ashraf Mohamed, and Usama Soltan for their assistance with Arabic, and Jason Eisner, Rebecca Hwa, and Doug Oard for helpful conversations. Wewould also like to thank Sang Hong for his assistance with implementation,and Brewster Kahle, Andy Jewell, Jad DeFanti,andPaula Keezer for permitting and facilitating our use of the Internet Archive. References Ahuja, Ravindra K., Thomas L. Magnati, and James B. Orlin. 1993. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, Englewood Cli s, NJ. Al-Onaizan, Yaser, Jan Curin, Michael Jahr, Kevin Knight, John La erty, I. Dan Melamed, Franz-Josef Och, David Purdy, Noah A. Smith, and David Yarowsky. 1999. Statistical machine translation. Technical report, JHU. citeseer.nj.nec.com/al-onaizan99statistical.html. 26 Broder, Andrei Z., Steven C. Glassman, Mark S. Manasse, and Geo ery Zweig. 1997. Syntactic clustering of the web. In Sixth International World-Wide Web Conference,Santa Clara, CA, April. Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and P. Roossin. 1990. A statistical approachtomachine translation. Computa- tional Linguistics, 16(2):79{85. Cabezas, Clara, Bonnie Dorr, and Philip Resnik. 2001. Spanish language pro- cessing at University of Maryland: Building infrastructure for multilingual applications. In Proceedings of the Second International Workshop on Span- ish Language Processing and Language Technologies (SLPLT-2). Chen, Jiang and Jian-Yun Nie. 2000. Web parallel text mining for chinese english cross-language informationretrieval. In International Conferenceon Chinese Language Computing, Chicago, Illinois. Church, Kenneth W. and Robert Mercer. 1993. Introduction to the special issue on computationallinguisticsusing large corpora. Computational Linguistics, 19(1):1{24. Darwish, Kareem. 2002. Building a shallow arabic morphological analyser in one day.InWorkshop on Computational Approaches to Semitic Languages, Philadelphia, July. Davis, Mark and Ted Dunning. 1995. A TREC evaluation of query translation methods for multi-lingualtext retrieval. In Fourth Text Retrieval Conference (TREC-4). NIST. Diab, Mona and Philip Resnik. 2002. An unsupervised method for word sense tagging using parallel corpora. In 40th Anniversary Meeting of the Associ- ation for Computational Linguistics (ACL-02), Philadelphia, July. Dunning,Ted. 1994. Statisticalidenti cation of language. ComputingResearch Laboratory Technical Memo MCCS 94-273, New Mexico State University, Las Cruces, New Mexico. Gale, William A. and Kenneth W. Church. 1991. Identifying word correspon- dences in parallel texts. In Fourth DARPA Workshop on Speech and Natural Language, Asilomar, California, February. Hunt, J. W. and M. D. McIlroy. 1975. An algorithm for di erential le com- parison. Technical Memorandum 75-1271-11, Bell Laboratories, October. Hwa, Rebecca, Philip Resnik, AmyWeinberg, and Okan Kolak. 2002. Evaluat- ing translational correspondence using annotation projection. In 40th An- niversary Meeting of the Association for Computational Linguistics (ACL- 02), Philadelphia, July. 27 Landauer, Thomas K. and Michael L. Littman. 1990. Fully automatic cross- language document retrieval using latent semantic indexing. In Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, pages pages 31{38, UW Centre for the New OED and Text Research, Waterloo, Ontario, October. Lin, Dekang. 1998. An information-theoretic de nition of similarity.InFif- teenth International Conference on Machine Learning, Madison, WI, July. Lopez, Adam, Michael Nossal, Rebecca Hwa, and Philip Resnik. 2002. Word- level alignment for multilingual resource acquisition. In Workshop on Lin- guistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data, at the Third International ConferenceonLanguage Re- sources and Evaluation (LREC-2000),LasPalmas, Canary Islands, Spain, June. Ma, Xiaoyi and Mark Liberman. 1999. Bits: A method for bilingual text searchover the web. In Machine Translation Summit VII, September. http://www.ldc.upenn.edu/Papers/MTSVII1999/BITS.ps. Melamed, I. Dan. 1997. Automatic discovery of non-compositional compounds in parallel data. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP-97), Brown University, August. Melamed, I. Dan. 2000. Models of translational equivalence among words. Computational Linguistics, 26(2):221{249, June. Oard, Douglas W. 1997. Cross-language text retrieval research in the USA. In Third DELOS Workshop. European Research Consortium for Informatics and Mathematics, March. Resnik, Philip. 1998. Parallel strands: A preliminary investigation into mining the Web for bilingual text. In Proceedings of the Third Conferenceofthe Association for Machine Translation in the Americas, AMTA-98, in Lecture Notes in Arti cial Intelligence, 1529, Langhorne, PA, October 28-31. Resnik, Philip. 1999. Mining the Web for bilingual text. In Proceedings of the 37th Annual Meeting of the ACL,June. Resnik, Philip and I. Dan Melamed. 1997. Semi-automatic acquisition of domain-speci c translation lexicons. In Fifth Conference on AppliedNat- ural Language Processing,Washington, D.C. Resnik, Philip, Mari Broman Olsen, and Mona Diab. 1999. The Bible as a parallel corpus: Annotating the `Book of 2000 Tongues'. Computers and the Humanities, 33:129{153. 28 Resnik, Philip and Noah A. Smith. 2002. Getting serious about \more is better": mining the internet archive for bilingual text. In review. Rilo , Ellen, Charles Schafer, and David Yarowsky. 2002. Inducing information extraction systems for new languages via cross-language projection. In Nine- teenth International Conference on Computational Linguistics (COLING- 2002),Taipei, Taiwan, August. Smith, Noah A. 2001. Detection of translational equivalence. Un- dergraduate honors thesis, University of Maryland College Park. http://nlp.cs.jhu.edu/nasmith/cmsc-thesis.ps. Smith, Noah A. 2002. From matching words to matching corpora: recognizing translation. In review. Tiedemann, Jorg. 1999. Automatic construction of weighted string similarity measures. In Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, MD, June. Yarowsky,David and Grace Ngai. 2001. Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In Second Meeting of the North American Association for Computational Linguistics (NAACL-2001), Pittsburgh, PA, June. Yarowsky,David, Grace Ngai, and Richard Wicentowski. 2001. Inducing mul- tilingual text analysis tools via robust projection across aligned corpora. In Proceedings of HLT 2001, First International Conference on Human Lan- guage Technology Research. 29