The Web as a Parallel Corpus
Philip Resnik
and Noah A. Smith
y
Abstract
Parallel corpora have become an essential resource for work in multi-
lingual natural language processing. In this report, we describe our work
using the STRAND system for mining parallel text on the World Wide
Web, rst reviewing the original algorithm and results and then presenting
a set of signicant enhancements. These enhancements include the use of
supervised learning based on structural features of documents to improve
classication performance, a new content-based measure of translational
equivalence, and adaptation of the system to take advantage of the In-
ternet Archive for mining parallel text from the Web on a large scale.
Finally,thevalue of these techniques is demonstrated in the construction
of a signicant parallel corpus for a low-density language pair.
1 Introduction
Parallel corpora | bodies of text in parallel translation, also known as bitexts
|have taken on an importantroleinmachine translation and multilingual
natural language processing. They represent resources for automatic lexical
acquisition (e.g. Gale and Church, (1991);; Melamed, (1997)), they provide in-
dispensable training data for statistical translation models (e.g. Brown et al.,
(1990);; Melamed (2000)), and they can provide the connection between vocab-
ularies in cross-language information retrieval (e.g. Davis and Dunning (1995);;
Landauer and Littman(1990);; see also Oard (1997)). More recently, researchers
at Johns Hopkins University and the University of Maryland have been explor-
ing new ways to exploit parallel corpora in order to develop monolingual re-
sources and tools, using a process of annotation, projection, and training: given
a parallel corpus of English with a less resource-rich language, we project En-
glish annotations across a parallel corpus to a second language, using word-level
alignments as the bridge, and then use robust statistical techniques in learning
from the resulting noisy annotations (Cabezas, Dorr, and Resnik, 2001;; Diab
Department of Linguistics and Institute for Advanced Computer Studies, Universityof
Maryland, College Park, MD 20742. Email: resnik@umiacs.umd.edu
y
Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218.
Email: nasmith@cs.jhu.edu
1
and Resnik, 2002;; Hwa et al., 2002;; Lopez et al., 2002;; Yarowsky, Ngai, and
Wicentowski, 2001;; Yarowsky and Ngai, 2001;; Rilo, Schafer, and Yarowsky,
2002).
For these reasons, parallel corpora can be thought of as a critical resource.
Unfortunately, they are not readily available in the necessary quantities. Un-
til very recently, for example, statistical workinmachine translation focused
heavily on French-English translation because the Canadian parliamentary pro-
ceedings (Hansards) in English and Frenchwere the only large bitext available.
Things haveimproved somewhat, but it is still fair to say that for all but rela-
tively few language pairs, parallel corpora tend to be accessible only in special-
ized forms such as United Nations proceedings,
1
, religious texts (Resnik, Olsen,
and Diab, 1999), localized versions of software manuals (Resnik and Melamed,
1997), and the like. Even for the tophandfulof majoritylanguages, the available
parallel corpora tend to be unbalanced, representing primarily governmental or
newswire-style texts. In addition, like other language resources, parallel corpora
are often encumbered by fees or licensing restrictions. For all these reasons, it is
dicult to follow the \more data are better data" advice of Church and Mercer
((1993)), abandoning balance in favor of volume, with respect to parallel text.
And then there is the Web. People tend to see the Web as a re
ection
of their own way of viewing the world|asahuge semantic network, or an
enormous historical archive, or a grand social experiment. We are no dierent:
as computational linguists working on multilingual issues, we view the Web as
a great big body of text waiting to be mined, a huge fabric of linguistic data
interwoven with parallel threads.
This report describes our techniques for mining the Web in order to extract
the parallel text it contains. It presents, in revised and considerably extended
form, our early work on mining the Web for bilingual text (STRAND, (Resnik,
1998;; Resnik, 1999)), incorporating new work on content-based detection of
translations (Smith, 2001;; Smith, 2002) and ecient exploitation of the Inter-
net Archive (Resnik and Smith, 2002). In Section 2 welay out the STRAND
architecture, which is based on the insight that translated Web pages tend quite
strongly to exhibit parallel structure, permitting their exploitation even without
lookingat content;; we alsoshowhowwehave improvedSTRAND's performance
by training a supervised classier using structural parameters rather than rely-
ing on manually tuned thresholds. In Section 3 we present an approach to de-
tecting translations that relies entirely on content rather than structure, demon-
strating performance comparable to STRAND's using this orthogonal source of
information. In Section 4 we describe howwehave adapted the STRAND ap-
proach to the Internet Archive, dramatically improving our ability to identify
parallel Web pages on a large scale. Section 5 puts all the pieces together, using
structural and combined content-structure matching of pages on the Internet
Archive in order to obtain a sizable corpus of English-Arabic Web document
1
E.g., via LDC, http://www.ldc.upenn.edu.
2
Figure 1: Example of a candidate pair.
pairs. In Section 6 we present our conclusions and thoughts on future work.
2 The STRAND Web-Mining Architecture
STRAND(Resnik, 1998;;Resnik, 1999)is anarchitecture forstructural translation
recognition, acquiring natural data. Its goal is to identify naturally occurring
pairs of Web pages in parallel translation. In order to do this, it exploits an ob-
servation about the waythatWeb page authors disseminate informationin mul-
tiple languages: when presenting the same content in two dierent languages,
authors exhibit a very strong tendency to use the same document structure (e.g.
Figure 1). STRAND therefore locates pages that might be translations of each
other, via a number of dierent strategies, and lters out page pairs where the
page structures diverge bytoomuch.
In this section we describe how STRAND works, and we also discuss several
related Web mining methods, focusing on the overall architecture these sys-
tems have in common and the important system-specic variations. Wethen
showhow tuning STRAND's structural parameters using supervised training
can signicantly increase its performance.
2.1 STRAND
Finding parallel text on the Web consists of three main stages:
3
Location of pages that mighthave parallel translations
Generation of candidate pairs that might be translations
Structural ltering-out of non-translation candidate pairs
We consider each of these in turn.
2.1.1 Locating Pages.
The original STRAND architecture accomplished the rst step by using the
AltaVista search engine (www.av.com) to search for twotypes of Web pages. A
parent page is one that contains hypertext links to dierent-language versions
of a document, e.g. if wewere looking for English and French bitexts, the page
at the left in Figure 2 would lead us to one such candidate pair.
A sibling page is a page in one languagethat itself contains a linkto a version
of the same page in another language;; e.g. the page at the right of Figure 2
contains a link at the top that says \English".
More recentversions of STRAND (unpublished) added a \spider" compo-
nent for locating pages that mighthave translations. Given a list of Web sites
thoughttocontain bilingual text for a given language pair, it is possible to
download all the pages on the site, any of whichmighthave a translation on
that site. Although simple to implement, this method of locating pages shifts
the burden of narrowing down the possibilities to the process of generating
candidate document pairs.
2.1.2 Generating Candidate Pairs.
Pairing up potentially translated pages is simple when a search engine has been
used to generate parent or sibling pages: one simplypairs together the twochild
pages linked to by the parent, or the sibling page together with the page it links
to.
When all the pages on a site are under consideration, the process is rather
dierent. The simplest possibility is to separate the pages on a site into the
two languages of interest using automatic language identication (e.g. Dunning
(1994)), throwing awayany pages that are not in either language, and then
generate the cross product. This potentially leads to a very large number of
candidate page pairs, and for most of them there is no particular reason to be-
lieve they are parallel translations, other than the fact that they appear on the
same Web site. The \spider" component of STRAND adds a URL-matching
stage, exploiting the fact that the directory structure on manyWeb sites re-
ects parallel organization when pages are translations of each other. Matching
is done bymanuallycreating a list of substitution rules (e.g. english ) big5),
and, for each English URL, applying all possible rules to generate URLs that
might appear on the list of pages for the other language. If such a URL
4
Faits saillants des pratiques exemplaires
S?minaire sur l?autor?glementation
Le vendredi 25 octobre 1996, 40 membres du milieu de la r?glementation ont assist? au s?minaire sur les
pratiques exemplaires en mati?re de r?glementation qui visait ? aider les minist?res ? se familiariser avec
l?autor?glementation comme solution de rechange aux programmes de r?glementation. L?animateur, M.
Zane Brown, directeur g?n?ral, Direction des biens de consommation, a ouvert la s?ance en relatant
l?exp?rience d?Industrie Canada par suite de l?introduction de m?canismes de diversification des modes
de prestation des services, comme les codes volontaires et l?autor?glementation de l?industrie. Il a not?
que para?trait prochainement un document sur la diversification des modes de prestation des services qui
traiterait de divers sujets, comme les m?canismes de prestation de rechange des services les plus
appropri?s ainsi que les probl?mes soulev?s par chacun.
Codes volontaires
Un code volontaire est un ensemble d?engagements normalis?s - ne faisant pas explicitement partie d?un
r?gime l?gislatif ou r?glementaire - con?us pour influencer, fa?onner, contr?ler ou ?valuer le
comportement de ceux qui les ont pris. Ils n??liminent pas, a poursuivi M. Brian Glabb, analyste
principal, Affaires r?glementaires, au Secr?tariat du Conseil du Tr?sor, le droit du gouvernement de
r?glementer. Ils offrent simplement aux participants une solution de rechange ? la r?glementation par le
gouvernement.
Au moment o? la r?glementation fait l?objet d?un examen accru du public, les gouvernements ? l??chelle
mondiale se tournent vers des approches volontaires en compl?ment de la r?glementation et m?me
comme substitut ? celle-ci. Les codes volontaires pr?sentent un certain nombre d?avantages, notamment
:
* la possibilit? d??tre ?labor?s plus rapidement que des lois;
* des co?ts compar?s inf?rieurs de pr?paration et de mise en vigueur;
* ils permettent d??viter les probl?mes de comp?tence qui frappent les initiatives de r?glementation;
* la facilit? relative avec laquelle ils peuvent ?tre modifi?s en fonction des nouvelles circonstances.
Figure 2: Examples of a parent page and sibling page used to nd candidate
document pairs.
is found, the pair with similar URLs is added to the list of candidate doc-
ument pairs. For example, if an English-Chinese site contains a page with
URL http://mysite.com/english/home en.html, one combination of substitu-
tions mightproduce the URL http://mysite.com/big5/home ch.html,which are
probably worth considering as a likely candidate pair.
2
Owing to the combina-
torics (an exponential number of possible substitutions), only a xed number
of substitution combinations could be tried per English URL;; however, in Sec-
tion 4.3 we describe a more scalable URL-matching algorithm.
Another possible criterion formatchingis the use ofdocumentlengths. Texts
that are translations of each other tend to be similar in length, and it is rea-
sonable to assume length(E) f(length(F)) where f is a linear function whose
parameters are tuned for languages E and F. The use of a document-length
lter is described in Smith (2001), where such a lter is shown, at the sentence
level, to reduce the size of the search space exponentially in the condence (p
in a (1 ; p)-condence interval for a linear regression model) with only linear
loss of good pairs.
2
Big5 is the name of a commonly used character encoding for Chinese.
5
2.1.3 Structural Filtering.
The heart of STRAND is a structural ltering process that relies on analysis
of the pages' underlying HTML to determine a set of pair-specic structural
values, and then uses those values to decide whether the pages are translations
of each other.
The rst step in this process is to linearize the HTML structure and ignore
the actual linguisticcontent of the documents. Both documents inthe candidate
pair are run through a markup analyzer that acts as a transducer, producing a
linear sequence containing three kinds of token:
[START:element_label] e.g. [START:A], [START:LI]
[END:element_label] e.g. [END:A]
[Chunk:length] e.g. [Chunk:174]
The second step is toalignthe linearized sequences usinga standard dynamic
programming technique (Hunt and McIlroy, 1975). For example, consider two
documents that begin as follows:
Emergency Exit Sortie de Secours
Emergency Exit
Si vous ^etes assis a
If seated at an exit and c^ote d'une...
.
.
.
.
.
.
The aligned linearized sequence would be as follows:
[START:HTML] [START:HTML]
[START:TITLE] [START:TITLE]
[Chunk:12] [Chunk:15]
[END:TITLE] [END:TITLE]
[START:BODY] [START:BODY]
[START:H1]
[Chunk:12]
[END:H1]
[Chunk:112] [Chunk:122]
Using this alignment, we compute four scalar values that characterize the
quality of the alignment.
dp The dierence percentage, indicating non-shared material, i.e. alignment
tokens that are in one linearized le but not the other.
n The number of aligned non-markup text chunks of unequal length.
r The correlation of lengths of the aligned non-markup chunks.
p The signicance level of the correlation r.
6
The dierence percentage (dp)quanties the extent to which there are mis-
matches in the alignment | sequence tokens on one side that have no corre-
sponding token on the other side. In the example above, one documentcontains
an H1 header that is missing from the second document. Large numbers of such
mismatches can indicate that the two documents do not present the same ma-
terial to a great enough extent to be considered translations. This can happen,
for example, when two documents are translations up to a point, e.g. an intro-
duction, but one document goes on to include a great deal more content than
another. Even more frequently, the proportion is high when two documents are
prima facie bad candidates for a translation pair.
The number of aligned non-markup text chunks (n) helps characterize the
qualityof the alignment. The dynamicprogrammingalgorithmtries to optimize
the correspondence of identical tokens, which represent markup.
3
As a side-
eect, the non-markup text chunks are placed in correspondence with each
other (e.g. the \Emergency Exit" and \Sortie de Secours" chunks in the above
example). The more such pairings are found, the more likely the candidate
documents are likely to representavalid translation pair.
The remaining two parameters (r and p) quantify the extenttowhichthe
corresponding non-markup chunks are correlated in length. When twodocu-
ments are aligned with each other and are valid translations, there is a reliably
linear relationship in the length of translated chunks of text | short pieces
correspond with short pieces, medium with medium, and long with long. The
Pearson correlation coecient r for the lengths will be closer to 1 when the
alignment has succeeded in lining up translated pieces of text, and the p value
quanties the reliabilityof the correlation, e.g. the standard threshold of p<:05
indicates 95% condence that the correlation was not obtained bychance.
In our original work, we used xed thresholds, determined manually byin-
spection of development (non-test) data for English-Spanish, to decide whether
a candidate pair should be kept or ltered out. Thresholds of dp < 20% and
p<0:05 were used.
2.2 STRAND Results
Like most search tasks, performance at nding parallel Web pages can be eval-
uated using standard measures of precision and recall and bycombining those
gures using the F-measure. It is not possible for us to measure recall relative
to the entire set of document pairs that should have been found | this would
require exhaustiveevaluation using the entire Web, or pooling of results from
a large number of dierent systems as done in the TREC information retrieval
evaluations. Therefore, recall in this setting is measured relativetothesetof
candidate pairs that was generated.
3
\Non-markup" tokens with exactly the same length almost always turn out to be pieces
of identical markup, e.g. key=value pairs within HTML tags.
7
Since the \truth" in this task is a matter for human judgment, we rely
on bilingual speakers to independently judge whether page pairs are actually
translations of each other for anygiven test set. In our experience no bilingual
speaker is 100% comfortable saying that another person's translation is a good
translation, so in creating the gold standard we instead ask, \Was this pair of
pages intended to provide the same contentinthetwo dierent languages?"
Asking the question in this way leads to high rates of inter-judge agreement, as
measured using Cohen's measure.
2.2.1 Using Manually-Set Parameters.
Using the manually set thresholds for dp and n,wehave obtained 100% preci-
sion and 64.1% recall (extrapolated to include a set of unjudged items) in an
experiment using STRAND to nd English-FrenchWeb pages (Resnik, 1999).
A modied version of STRAND was used to obtain English-Chinese pairs (see
related work, below), and in a similar formal evaluation, we found that the re-
sulting set had 98% precision and 61%recall for Chinese.
4
Both these results are
consistent with our preliminary ndings for English-Spanish using a less rigor-
ous evaluation (using the judgments of the rst author rather than independent
bilingual evaluators) and a very small test set, where precision was near ceiling
and recall was in the vicinity of 60% (Resnik, 1998).
2.2.2 Optimizing Parameters using Machine Learning.
Based on experiments with several language pairs, it appears that STRAND's
structure-based lter consistently throws out a bit more than one third of the
candidate document pairs it has found in order to maintain its precision in
the 98-100% range. It does so by respecting parameter thresholds that were
determined manually using English-Spanish development data;; the same pa-
rameters seem to haveworked reasonably well not only for the English-Spanish,
but also for English-French and English-Chinese pairs. It is possible, however,
that classication can be tuned for better performance. In order to investigate
this possibility,wetookamachine-learning approach: we used the four struc-
tural values (dp, n, r,andp) as features characterizing each document pair,
and treated the problem as a binary decision task, using supervised learning to
make an attempt at better predicting human judgments.
Using the English-French data, we constructed a threefold cross-validation
experiment using decision tree induction to predict the class assigned bythe
humanjudges. The decisiontree software was C5.0.
5
Each foldhad87 test items
and 174 training items. Precision and recall computations include extrapolation
from the judged set to an additional 16,328 unjudged items;; the results are
4
http://umiacs.umd.edu/resnik/strand/
5
Available at http://www.rulequest.com/demoeula.html.
8
precision recall
Untuned 1.000 0.641
Fold 1 0.913 0.834
Fold 2 1.000 0.943
Fold 3 1.000 0.724
Average 0.971 0.834
Table 1: Eects of parameter tuning.
reported in Table 1 together with baseline results from STRAND's untuned
classier as reported above.
Without tuning, the manually-set parameters result in good document pairs
being discarded 36% of the time. Our cross-validation results indicate that
tuning the parameters cuts that gure by more than half: only 17% of the good
pairs will be discarded, at a cost of admitting 3 false positives from every 100
candidates pairs.
2.3 Related Work
Several other systems for discovering parallel text, developed independently,can
be described as operating within the same three-stage framework as STRAND.
ParallelTextMiner (PTMiner, Chen andNie(2000))exploitsalready-existing
Web search engines to locate pages by querying for pages in a given language
that contain links to pages that are likely to be in the other language of in-
terest. Once bilingual sites are located, they are crawled exhaustively.Inor-
der to generate candidate pairs, PTMiner uses a URL-matching process simi-
lar to the one described above;; for example, the French translation of a URL
likehttp://www.foo.ca/english-index.htmlmightbehttp://www.foo.ca/french-
index.html. Their matchingprocess uses a mappingof language-specic prexes
and suxes, and does not handle the case where URL-matching requires multi-
ple substitutions. PTMiner also applies a length lter and automatic language
identication to verify that the pages are in the appropriate languages. Chen
and Nie report a 95%-precise English-French corpus of 118MB/135MB of text
and a 90%-precise English-Chinese corpus of 137MB/117MB of text, based on
inspection.
Bilingual Internet Text Search (BITS, Ma and Liberman (1999)) starts with
agiven list of domains to search for parallel text. It operates by samplingpages
from each domain and identifying their languages;; if a domain is deemed to be
multilingual, all pages on the site are crawled exhaustively. BITS appears to
consider all possible combinations of Web page pairs in the two languages (i.e.
the full cross product within each site), and lters out bad pairs by using a large
9
bilingual dictionary to compute a content-based similarity score and comparing
that score to a threshold. For each page pair, the similarity score is
similarity(A;;B)=
# translation token pairs
# tokens in A
(1)
Translation token pairs are considered within a xed window (i.e., a distance-
based measure of cooccurrence is used).
6
In addition to cross-lingual lexical
matching, BITS lters out candidate pairs that do not matchwell in terms of
le size, anchors (numbers, acronyms, and some named entities), or paragraph
counts. Using an English-German bilingual lexicon of 117,793 entries, Ma and
Liberman report 99.1% precision and 97.1% recall on a hand-picked set of 600
documents (half in each language) containing 240 translation pairs (as judged
byhumans). This technique yielded a 63MB parallel corpus of English-German.
Other workon Webmininghasbeen doneby JinxiXu ofBBN (personal com-
munication), who began with our STRAND implementation and added a mod-
ule for automatically learning string-substitution patterns for URLs, and also
implemented a dierent dynamic programming algorithm for assessing struc-
tural alignment. Xu used the modiedSTRANDto obtain3376Chinese-English
document pairs, whichweevaluated formally (see above), determining that the
set has 98% precision and 61% recall.
In addition, STRAND has been re-implemented byDavid Martinez and
colleagues at InformatikaFakultatea in the Basque Country (personal commu-
nication), in order to perform exploratory experiments for discovering English-
Basque document pairs.
It is worth noting that STRAND, PTMiner, and BITS are all largely inde-
pendent of linguistic knowledge about the particular languages, and therefore
very easily ported to new language pairs. With the exception of the use of
a bilingual dictionary in BITS, these systems require, at most, a set of URL
substring patterns for the URL pattern-matching stage (e.g. big5 english in
the example above;; see further discussion in Section 4.3), and a modest amount
of monolingual data for training n-gram based language identiers (typically
50,000 to 100,000 characters of text per language).
Word-level translations are worth exploiting when they are available, as was
the case for the BITS system. In Section 3 we describe abitext-matchingprocess
using a content-based similarity score grounded in information theory,andin
Section 5 weshowhow structural and content-based criteria can be combined in
order to obtain performance superior to that obtained by either method alone.
6
Many details of this technique are left unspecied in (Ma and Liberman, 1999), suchas
the threshold for the similarity score, the distance threshold, and matching of non-one-word-
to-one word entries in the dictionary.
10
Maria does n?t like fruit
Maria n? aime pas de fruits
X:
Y:
NULLNULL
NULL
Figure 3: An example of two texts with links shown. There are seven link
tokens, veofwhich are lexical (non-NULL) in X (the English side), six in Y
(French).
3 Content-Based Matching
The approach discussed thus far relies heavily on document structure. However,
as Ma and Liberman (1999)point out, not alltranslators create translated pages
that look like the original page. Moreover, structure-based matching is appli-
cable only in corpora that include markup, and there are certainly multilingual
collections on the Web and elsewhere that contain parallel text without struc-
tural tags. Finally, other applications for translation detection exist, suchas
sub-document text alignment and cross-lingual duplicate detection (i.e., loca-
tion of already-existing translations in a multilingualcorpus). All these consid-
erations motivate an approach to matching of translations that pays attention
to similarityofcontent, whether or not similarities of structure exist.
We present here a generic score of translational similaritythat is based upon
anyword-to-word translation lexicon (hand-crafted or automaticallygenerated,
or a combination,and possibly highlynoisy). The technique is shown to perform
competitively with the structure-based approachofSTRANDonthetaskof
identifying English-French document translations.
3.1 Quantifying Translational Similarity
We dene a cross-language similarity score, tsim, for twotextsby starting with
generative, symmetric word-to-word model of parallel texts (Melamed's Method
A (2000)).
7
Let a link be a pair (x;;y) where x is a word in language L
1
and
y isaword in L
2
. The model consists of a bilingual dictionary that gives a
probability distribution p over all possible link types. Within a link, one of the
words may be NULL, but not both. In the generative process, a sequence of
independent link tokens is sampled from that distribution. The model does not
accountforword order. An example of two texts with links is illustrated in
Figure 3.
Next, we desire to compute the probability of the most probable link se-
quence that could have accounted for the two texts.
8
The probabilityofa
7
We use the term \text" to refer to a sequence of words of anylength.
8
Of course, all permutations of a given link sequence will have the same probability(since
the links are sampled independentlyfrom the same distribution),so the order of the sequence
11
link sequence is simply the product of the p probabilities of the links it con-
tains. As noted by Melamed (2000), this problem of nding the best set of
links is the maximum-weighted bipartite matching problem (MWBM): Given a
weighted bipartite graph G =(V
1
[V
2
;;E) with edge weights c
i;;j
(i 2 V
1
;;j2 V
2
),
nd a matching M E suchthateachvertex has at most one edge in M,
and
P
e2M
c
i;;j
is maximized. The fastest known MWBM algorithm runs in
O(ve+v
2
logv)time(Ahuja, Magnati, and Orlin, 1993). Applied to this prob-
lem, that is O(max(jXj;;jYj)
3
).
To use MWBM to nd the most probable link sequence, let the L
1
words
be V
1
and the L
2
words be V
2
.Iftwowords x;;y have p(x;;y) > 0, an edge
exists between them with weightlogp(x;;y). Hence a sum of weights of links
in a matching will be the log-probability of the (unordered) link sequence, and
maximizing that sum maximizes the probability.
The similarityscore should be high when many of the link tokens in the best
sequence do not involve NULL tokens. Further, it should normalize for text
length. Specically, the score is:
tsim =
logPr(two-word links in best matching)
logPr(all links in best matching)
(2)
This score is an application of Lin's (1998) information theoretic denition of
similarity. Starting with a set of axioms, Lin derives the measure
sim(X;;Y)=
logPr(common(X;;Y))
logPr(description(X;;Y))
(3)
where X and Y are any objects generated by a probabilistic model.
In this discussion, we seek to showhowmultiple linguistic resources can be
exploited together to recognize translation. Therefore, the measure is simplied
by assuming that all links in a given translation lexicon are equiprobable. This
reduces the formula for tsim to
tsim =
#two-word links in best matching
# links in best matching
(4)
Another reason tocomputetsim under the equiprobabilityassumptionis thatwe
need not compute the MWBM, but only nd the maximumcardinality bipartite
matching (MCBM), since all potential links have the same weight. An O(e
p
v)
(or O(jXjjYj
p
jXj+jYj) forthis purpose) algorithmexists forMCBM (Ahuja,
Magnati, and Orlin, 1993). For example, if the matching shown in Figure 3 is
the MCBM (for some translation lexicon), then tsim(X;;Y)=
4
7
under the
simplifying assumption.
Melamed (2000) used a greedy approximation to MWBM called competi-
tive linking. Competitive linking iteratively selects the edge with the highest
is not important.
12
weight, links those twovertices, then removes them from the graph. (Ties are
broken at random.) A heap-based implementation of competitive linking runs
in O(max(jXj;;jYj)logmax(jXj;;jYj)). Under the equiprobability assumption,
all the weights are the same, so that competitive linking proceeds simply by
randomly making links until no more can be made.
If denition (4) is applied to pairs of documents in the same language, with
a \translation lexicon" dened by the identity relation, then tsim is a vari-
ant of resemblance (r), as dened by Broder et al. (1997) for the problem of
monolingual duplicate detection, except that tsim has the advantage of being
token-based rather than type-based, incorporating word frequency.Thekey re-
sult is that any translation lexicon (or, importantly, union thereof) containing
asetofword-to-word entries can be applied to the computation of tsim.
Wehave demonstrated that this score can be used to extract translationally
equivalent English-Chinese sentence pairs from even a noisy space with high
precision (Smith, 2002). It was also shown that combining multiple sources of
word-level translation information(dictionaries, word-to-word translation mod-
els, cognates) had positive eects on performance on the sentence-matching
task. The competitive linking approximation was also shown to perform nearly
as well as MCBM.
This technique of using a translation model to dene translational similarity
is generic to dierent sources of lexical translation information. An important
feature is that it can be used with any symmetric translation model where
events can be divided into those which both sides of a bitext have in common
and those which aect only one side.
3.2 Experiment
Wenow apply our content-based similarity measure to the candidate pair clas-
sication task presented by STRAND. Recall that both the original STRAND
classier and those learned using decision tree methods, described in Section
2.2.2, use only structural features of the documents to determine whether they
are translations. Here we apply the tsim score to the same task and compare
the results with those of the original STRAND classier.
3.2.1 Translation Lexicon.
The word-level translation lexicon is derived from several sources.
The rst is an English-French dictionary (a total of 34,808 entries, 4,021 of
which are not one-to-one).
9
Each n-to-m entry was processed by stoplisting and
then extracting all word-pairs in the remaining cross-product. This technique is
liable to introduce noise to the dictionary,sowe used only cross-products that
9
This dictionarywas generatedby Gina Levow, who kindly made it availableto us, using a
dictionary derived from one available at http://www.freedict.com. It contains morphological
variants but does not include character accents.
13
generated four new entries or less. The result is 39,348 word pairs, 9,045 of
whichcontain twowords present in the corpora.
Aword-to-word translation model (Melamed, 2000) was trained on a verse-
aligned Bible (15,548 verses, averaging 25.5 English words, 23.4 Frenchwords
after tokenization). Instead of using competitive linking, we used the maxi-
mum weighted bipartite matching algorithm in estimating the parameters of
this model. The result includes all word pairs with positive probability (though
the probabilities are subsequently ignored) and contains 13,762 word pairs.
The third source comprises English-French cognate pairs, identied using
the method of Tiedemann (1999). Tiedemann's approachinvolved learning
language specic character-to-character weights for the computationofweighted
edit-distance to measure cognate-ness. He used a list of known cognates to train
the weights. We instead used the weighted translation pairs in the translation
model lexicon. Hence the resources required to extract cognates in this wayare
no dierent from those required for the translation model. The result is 35,513
word pairs. An additional set of 11,264 exact string matches were added. These
entries were undoubtedly quite noisy.
The union of these translation lexicons consists of 68,003 unique word pairs.
The experiment used only this union translation lexicon;; note that the entries
in this lexicon are not weighted (they are assumed to be equiprobable).
3.2.2 Results.
In order to compare tsim with structural similarityscoring, we applied it to 325
English-Frenchweb-document pairs for whichhuman evaluations were carried
out in Section 2.
10
As there is only one feature under consideration (tsim), the
classier must be a threshold on that value. At dierent thresholds, Cohen's
score of agreement (with each of Resnik's (1999) two judges and their intersec-
tion) may be computed for comparison with STRAND, along with recall and
precision against a gold standard (for whichwe use the intersection of the judges
| the set of examples where the judges agreed).
Computingtsim (MCBM on the words in the document pair) is not tractable
for very large documents and translation lexicons. However, in preliminary
comparisons, we found that representing long documents by as few as their
rst 500 words results in excellent performance on the measure. This allows
O(1) estimation of tsim for twodocuments. Further, the competitive linking
algorithm appears to be as reliable as MCBM, and it runs signicantly faster in
practice. The results reported here approximated tsim in this way.
Of the 325 pairs, 32 were randomly selected as a development set. We
manually selected a threshold of =0:15. This value was chosen because it
10
One additional pair was thrown out because it contained compressed data;; it is assumed
that pair would not pass a language identication lter.
14
Comparison N Pr(Agree) prec rec F
J1, J2 245 0.98 0.96
J1, STRAND 250 0.88 0.70
J2, STRAND 284 0.88 0.69
J1 \ J2, STRAND 241 0.90 0.75 0.963 0.684 0.800
J1, tsim( = 0:15) 249 0.92 0.83
J2, tsim( = 0:15) 283 0.92 0.82
J1 \ J2, tsim( = 0:15) 240 0.93 0.85 0.680 0.921 0.782
Table 2: Comparison with STRAND. The test set is 293 of the 326 pairs in
Resnik's (1999) test set. The 32 development pairs were used to manually
select the 0.15 threshold. N is the number of examples for which judgement-
comparison was possible in eachcase(humanjudges were sometimes undecided;;
those cases are ignored in computing ).
was the one that resulted in the maximum score on the development set.
11
scores against each judge and their intersection were then computed at that
threshold on the test set (the remaining 293 pairs). These are compared to
scores of the STRAND system, on the same test set, in Table 2. In every
case, the tsim classier agreed more strongly with the human evaluations, and
its F score is competitive with that of STRAND. A simple combination of the
classiers by disjunction (i.e., \(X;;Y) is a translation pair if either classier
says so") yields precision 0.768, recall 0.961, F =0:854, and (with the judges'
intersection) at 0.878.
At =0:15, precision was 0.680 and recall was 0.921, F =0:782. On the
same set, STRAND structural classication achieved 0.963 precision and 0.684
recall, F =0:800. Figure 4 shows , precision, and recall each plotted against
.
For this application, the structural and content-based classiers havecom-
plementary strengths;; the former is highly precise while the latter gives high
recall. The content-based classier also more reliably predicts human judge-
ments (based on Cohen's ).
Given two high-performing methods that use orthogonal information for
identifying good candidate pairs (one using only structure, the other using only
content), the natural question is whether the techniques can be combined for
even better performance. We answer this question in armative in Section 5.
First, however, we describe howwehave adapted the STRAND architecture to
the Internet Archive in order to generate the candidate pairs on a scale that
was previously unattainable.
11
One couldselectsucha thresholdto maximizeanyobjectivefunctionoverthe development
set.
15
STRAND: best kappa
Cohen?s kappa
Precision
Recall
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Threshold
Figure 4: Performance measures as the threshold varies (all measures are on
the test set): the agreement score with the two judges' intersection, precision,
and recall. The score obtained by STRAND is shown as well.
4 Exploiting the Internet Archive
One of the diculties with doing researchonWeb mining is that large-scale
crawling of the Web is a signicantenterprise. Chen and Nie's (2000) PTMiner
represents one solution, a carefully thought-out architecture for mining on a
large scale. Here we present a dierent solution, taking advantage of an exist-
ing large-scale repository of Web pages maintained on an ongoing basis byan
organization known as the Internet Archive.
4.1 The Internet Archive
The Internet Archive is a non-prot organization attempting to archive the en-
tire publicly available Web, preserving the contentandproviding free access
to researchers, historians, scholars, and the general public.
12
Data come from
crawls done by Alexa Internet, and hence they represent an industry-level re-
source of the sort not easily constructed within academia. Atpresent, the
Archivecontains 120T (terabytes) of data, by a conservative estimate, and it is
growing at approximately 8T per month. Text on the archive comprises over 10
billion Web pages, and the estimated duplicate rate is a factor of 2, i.e. two
copies of everything.
13
The Internet Archive provides public access to the data via the \Wayback
Machine" Web interface. As of this writing, a searchfortheACL home page
12
See http://www.archive.org.
13
We are grateful to Paula Keezer of Alexa for these gures.
16
brings up links to 27 snapshots of that page dating back to June 7, 1997.
14
The reader can get to that page directly on the WaybackMachine using a URL
that points to the Internet Archive and provides both the desired page and the
timestamp indicating which snapshot to retrieve.
15
The Archive also provides researchers direct access to its data via accounts
ontheir cluster. The data are stored onthe disk drives of approximately300 ma-
chines, each running some variety of Unix, creating what is in essence one huge
le system. This provides a researcher the remarkable sensation of having the
entire Web on his or her hard drive.
4.2 Properties of the Archive
Mining terabytes on the Archive presents a numberofchallenges.
The Archive is a temporal database, but it is not stored in temporalorder.
Hence a documentand its translation maybe in les on dierentmachines;;
a global merge of data is required for any hope of complete extraction.
Extracting a document for inspection is an expensiveoperationinvolving
text decompression.
The Archive's size makes it essential to keep computational complexity
low.
On the other hand, aspects of the Archive's architecture turned out to make
rapid development remarkably feasible.
Almost all data are stored in compressed plain text les, rather than in
databases.
The data relevant for our purposes are organized into archive les (arc-
les), which containing the stored pages, and index les, whichcontain
plain text tuples hURL;;timestamp;;arcle;;oset;;:::i.
A suite of tools exists for, e.g., extracting individual pages from archive
les.
The Archive's infrastructure for cluster computing makes it easy to write
Unix scripts or programs and run them in parallel across machines.
The last of these, the cluster computing tools, turned out to drastically
reduce the timeneeded to port STRAND to the Archive, and literallyhalved the
size of the STRAND code base. The centerpiece in Archive cluster computing
is a parallelization tool called p2,which oers a UNIX command-line interface
14
http://www.cs.columbia.edu/acl/home.html
15
http://web.archive.org/web/19970607032410/http://www.cs.columbia.edu/acl/home.html.
This is a single big URL.
17
that allows one to specify (a) a parallelizable task, (b) a way to split it up,
(c) a waytocombine the results, and (d) a set of processors among whichto
divide the task. The p2 tool divides up tasks intelligently,invoking each parallel
computation on the local machine where the data reside.
16
4.3 STRAND on the Archive
In adapting STRAND's three-stage process to the Internet Archive, the primary
challenge was in the rst two steps, locating possible translations and matching
them up to produce candidate document pairs. Structural ltering remained
essentially unchanged.
Generating candidate pairs on the Archiveinvolves the following steps:
1. Extracting URLs from index les using simple pattern matching
2. Combining the results from Step 1 into a single huge list
3. Grouping URLs into buckets by handles
4. Generating candidate pairs from buckets
Steps 1 and 2 are done via a parallel search operation plus combination of
results, e.g. extracting URLs (and their associated bookkeeping data) using a
pattern like /(.hk|.tw|.cn)/ to extract all URLs in the Hong Kong, Taiwan,
or China domains.
17
Step 3 is potentially tricky owing to computational complexity issues. As
noted in Section 2.1.2, examining the cross product of a site's page sets in two
dierent languages is potentially very expensive, and matching documents by
similarity of URLs can representacombinatoric process in the general case.
We arrived at an algorithmically simple solution that avoids this problem,
but is still based on the idea of language specic substrings (LSSs). The idea
is to identify a set of language-specic URL substrings that pertain to the two
languages of interest, e.g. based on language names, countries, character code-
sets labels, abbreviations, etc. For example, a set of LSSs for English-Arabic
might be as follows:
1256, 437, 864, 8859-1, 8859-6, a, ar, ara, arab, arabic, cp1256,
cp437, cp864, e, en, eng, english, gb, iso, iso-8859-1, iso-8859-6,
latin, latin-1, latin1, uk, us, usa
For each URL, we forma \handle" by subtracting out any substrings that match
(insensitive to case) any item on the LSS pattern list. The subtraction process is
16
The Archiveintends to release these cluster tools under the GNU Public License.
17
We also takeadvantage of the Archive's list of .com domains paired with the nation in
whicheach is registered, making it possible to include commercial sites in the search without
an explosion of irrelevant possibilities.
18
URLs:
saudifrenchbank.com.sa/English/English.htm
! saudifrenchbank.com.sa/English/English.htm
patterns removed:
a e a a english english
saudifrenchbank.com.sa/Arabic/arabic.htm
! saudifrenchbank.com.sa/Arabic/arabic.htm
patterns removed:
a e a a arabic arabic
Same handle for both URLs:
sudifrchbnk.com.s//.htm
Figure 5: Example of LSS substraction.
implementedreasonably eciently: ifthere are p patterns with maximumlength
l, and the URL's length in characters is u, then the current implementationwill
do at most p u string matches of length no more than l. (Currently we
use C strcmp for string match.) In practice, this is extremely fast | wecan
generate handles for nearly 5000 URLs per second on a 6-year-old Sun Ultra 1
workstation.
18
Figure 5 illustrates handle generation on two real URLs. As one would hope,
these two URLs produce the same handle, and as a result, they wind up in the
same bucket in Step 3.
19
In Step 4, the URLs in eachbucket are used to generate candidate pairs
by taking the cross product and keeping those URL pairs for whichtheURL
bookkeepingdataindicates pages thatare inthe correct languages. Forexample,
given the bucket containing the two URLs in Figure 5, this step would generate
a single pair consisting of the URL for the English page and the URL for the
Arabic page, assuming the language ID information associated with eachURL
conrmed it was in the proper language.
20
At this point, the candidate generation process is complete. The nal step
is to apply STRAND's structural ltering step to each candidate pair | an op-
eration that can itself be parallelized since each candidate pair can be processed
independently.
It is interesting to note that by taking advantage of the Archive's p2 clus-
18
We are grateful to Bill Pugh of the University of Maryland for suggesting this algorithm.
19
Conceptually, they hash to the same bucket in a hash table;; in practice on the Archiveit
turns out to be more ecient to create buckets by doing a parallel sort of the entire URL set
using the handle as the key, and then creating buckets based on identical handles being on
adjacent lines.
20
The Internet Archive tags its data for language using standard n-gram language identi-
cation techniques.
19
ter computing tool, together with its simple
at-text representations, adapting
STRAND's candidate generation process resulted in a dramaticreduction in the
size of the program, cutting it in half, as measured in lines of code.
5 Building an English-Arabic Corpus
In the previous sections, wehave described methods and results for structural
matching, content-based matching, and for dramatically scaling up the number
of candidate pairs that can be generated for any given languagepair by using the
industrial-strength Web crawls stored on the Internet Archive. In this section
we put all these pieces together, describing an experiment in miningthe Internet
Archive to nd English-Arabic parallel text. This language pair is of particular
global importance, but resources, particularly bilingual text, are not easy to
obtain. Moreover, Arabic text is far behind on the Web's exponential growth
curve | Arabic text (as opposed to images) did not really start emerging on
the Web until the release of Microsoft Windows 98
TM
,which provided Arabic
support in its version of Internet Explorer.
5.1 Finding English-Arabic Candidate Pairs on the Inter-
net Archive
The input resources for this searchwere a list of Internet domains likely to
contain Arabic text.
21
The list included twenty-four top-level national domains
for countries where Arabic is spoken by a signicant portion of the population,
including Egypt (.eg), Saudi Arabia (.sa), Kuwait (.kw), etc. In addition, we
used a list of .com domains known to originate in Arabic-speaking countries.
This list provided an additional twenty-one specic domains (e.g., emirates-
bank.com, checkpoint.com) | note that it is by no means exhaustive.
In the experiments we report here, we mined twocrawls from 2001, com-
prising 8T and 12T | i.e. less than one sixth of the Archiveasitexiststoday
{ spread over 27 machines. Our list of URLs with relevant domains, obtained
by pattern-matching in Archive index les, numbers 19,917,923 pages.
22
The
language-specic substrings given earlier were subtracted from these URLs to
generate handles, resulting in 786,880 buckets with an average of twenty-ve
pages per bucket. When all possible English-Arabic page pairs were generated
from all buckets, the result was 8,294 candidate pairs.
A random sample of twohundred candidate pairs was given to twohuman
evaluators, bilingual in English and Arabic, who were asked (independently)
to answer the question, for each pair, \Is this pair of pages intended to show
the same material to two dierent users, one a reader of English and the other
21
We are grateful to Nizar Habash for constructing this list.
22
Pages with the same URL but dierent timestamps are counted separately;; there were
10,701,622 unique URLs.
20
precision recall
Fold 1 0.9111 0.9229
Fold 2 0.9302 0.9351
Fold 3 0.9565 0.8818
Average 0.9326 0.9133
Table 3: English-Arabic structural classication results. Precision and recall
are extrapolated to the entire set of 8,294 candidate pairs.
a reader of Arabic?" The judges' answers showed a Cohen's agreementof
0.6955, which is generally considered fair to good reliability. (Qualitatively,one
judge was rather more strict than the other;; when the stricter judge identied
a page pair as valid translations, the less strict judge virtually always agreed.)
5.2 Evaluating Structure-Based Matching
Taking the set of 149 labeled pairs on which the two judges agreed (134 were
marked good, 15 bad), we carried out an evaluation of the full candidate set
similar to the one for English-French discussed in Section 2.2.2. As before, this
was athreefold cross-validationexperimentinwhichdecision tree classiers were
tuned on the features extracted for eacheach candidate pair by structure-based
classication. In additionto the four structural scores, we included two language
identication condence scores (one for the English page, one for the Arabic
page) | these were available as part of the Internet Archive's bookkeeping
information for each URL and required no additional computation on our part.
Table 3 shows extrapolated precision and recall of each fold's classier applied
to the entire set of 8,294 candidate page pairs.
The valueof the parameter-tuningprocess is dramaticallyconrmed bycom-
paring the learned parameters with STRAND's default parameters (manually
determined by Resnik (1999)). Precision gures for the three folds were simi-
lar on average but far less consistent (1.0 for the rst two folds, 0.6667 for the
third), and none of the three folds achieved recall above 0.12.
Upon inspection, wediscovered that nearly 5,000 of the pairs in our candi-
date set were from a single domain, maktoob.com. This site supports an online
marketplace, and many of the pages discovered by our searchwere dedicated to
specic merchandise categories within that service;; a large portion of these were
simply \no items available" and one or two similar messages. We ignored this
domain completely in order to be conservative about the yield of page pairs,
though we note that many of the pages within it are legitimate parallel text
that could be extracted if a duplicates lter were applied).
23
23
One of our human evaluators conrmed that no other domains appeared to signicantly
21
In order to construct a nal classier, we trained a decision tree on all 200
of our manuallyjudged examples. This was then applied to the candidate pairs,
producing a set of 1,741 HTML document pairs judged to be valid translations
of each other. Converting from HTML to plain text and tokenizing, the English
documents in this corpus total approximately818,102tokens, with an average of
470 tokens per document;; the Arabic side contains 1,025,461 tokens, averaging
569 tokens per document.
24
5.3 Combining Structural and Content-Based Matching
Wecombined the structural and content-based approaches to detecting transla-
tions by adding the tsim score to the set of structural features associated with
each candidate pair, and then training a new decision tree classier.
25
Because Arabic is a highly in
ected language with many surface forms, we
found it necessary to use morphologicalpreprocessing in order to make eective
use of a dictionary.For English, wetokenized the text and used the WordNet
lemmatizer to strip suxes. The Arabic texts were tokenized at punctuation,
then Romanized and converted to root forms using a morphological analysis
tool (Darwish, 2002). This approximately halved the vocabulary size for the
Arabic texts (from 89,047 types to 48,212 types).
The translation lexicon used to compute tsim contained of 52,211 entries,
each containing one English lemma and one Arabic root. Of these, 16,944 con-
tained two items that were both present in the candidate set of 8,295 Web page
pairs.
26
The approximations discussed in Section 3.2.2 were used: competitive
linking on the rst 500 words in eachdocumentwas used to compute the score.
Carrying out the same cross-validation experiment (on the same random
split of data), the combined structural and content-based classier produced
the results in Table 4;; these are extrapolated to the full set of 8,294 candidate
pairs. Averaged over three folds, the classier achieved 95.06% precision and
98.95% recall (1.8% and 7.62% better than without tsim, respectively). We
repeat the average using only STRAND's structural features for reference.
dominate the candidate pairs as did maktoob.com, providing some assurance that the rest of
the data are diverse.
24
We converted HTML to text using the lynx browser, performed cleanups suchasremov-
ing references, and tokenized using the tokenizers included with the Egypt statistical MT
package (Al-Onaizan et al., 1999). That tokenizer is somewhat aggressive about separating
out punctuation, so, being aggressive in the opposite direction, we also tried counting only
tokens containing at least one of [A-Za-z] (which excludes punctuation as well as dates, per-
centages, etc.). Using that very conservativecounting method, the size of the English side is
approximately 612,673 words. Counting only tokens on the Arabic side whichcontained at
least one non-numeric, non-punctuation character yielded 914,815 words.
25
We also did perform the experiment using content-basedmatching alone. However, as
noted earlier, tsim's strength is recall, and on this task baseline performance (marking all
candidate pairs as good translations) gives 90% precision. Not surprisingly, the performance
of the tsim-only classier did not exceed the baseline.
26
This translation lexicon was used with the kind permission of Kareem Darwish.
22
precision recall
Average (structure only) 0.9326 0.9133
Fold 1 0.9167 1.0000
Fold 2 0.9767 0.9686
Fold 3 0.9583 1.0000
Average 0.9506 0.9895
Table 4: English-Arabic combined structural/content-based classication re-
sults. Precision and recall are extrapolated to the entire set of 8,294 candidate
pairs.
Tokenization method English tokens Arabic tokens
English lemmas, Arabic roots 1,097,674 1,107,600
Egypt tokenizers (Al-Onaizan et al., 1999) 1,170,360 1,512,959
Egypt tokenizers, ignoring words without letters 922,541 1,388,953
Table 5: Yield: the English-Arabic Internet Archive corpus, tokenized several
ways
After building a single classier on all 149 test pairs (the set on whichboth
human judges agreed), we re-classied the entire candidate set. Ignoring again
pages from the maktoob.com domain, 2,190 pairs were marked as translations.
Table 5 shows word counts for various tokenization schemes: the morphological
analysis used for computing tsim, the Egypt tokenizer (which is highly aggres-
sive), and counting only tokens with some alphabetic character from the Egypt
tokenizer (a conservative approximation).
To summarize the results, using the content-based similarity score as a fea-
ture not only improved precision, it increased the size of the corpus (in words)
by 43{52%, depending on the tokenization scheme.
To our knowledge, no bilingualtext corpus of this magnitude is yet available
to the general researchcommunityfor this language pair. Even if such resources
existed for this particular pair, our success with English-Arabic represents a
proof of concept for locatingbilingualtext for other poorly represented language
pairs, especially considering the fact that only a small portion of the Internet
Archivewas searched on this run. Moreover, it bodes well for locating very large
quantities of Web text for English paired with better-represented languages.
23
6 Conclusions
Although eorts at discovering parallel text on the Web were rst reported in
1998, Web-based parallel corpora appear to have had only a limited impact on
the community. Three reasons for this suggest themselves.
Too few languages. Parallel text from the Web has been made available
to the community in only a few pairs of languages. As of this writing, the
STRAND Web site,
27
containing URL pairs discovered via STRAND runs, con-
tains collections only for English-French, English-Chinese, and English-Basque,
and we are not aware of any other eorts to publicly disseminate Web-based
parallel data. Up to this point, it simply has not been easy to search the Web
for parallel text in new language pairs. The most dicult part is nding the
candidates | a year or two ago, we attempted to apply the original Web-based
STRAND to the problem of nding English-Arabic text and we simply were
unable to locate enough search engine hits or sites to yield useful results.
Too littledata. Very largeWeb-based paralleltext collectionsare not avail-
able to the community. The largest appear to have been obtained by Chen and
Nie (2000), who acquired collections for English-French and English-Chinese on
the order of 15,000 document pairs using the PTMiner system. However, these
collections have not been made available. In contrast, the STRAND collections,
whichareavailable to the community in the form of URL pairs, are modest in
size: the English-Chinese collection contains fewer than 3500 document pairs,
and English-French fewer than 2500.
Diculty with dissemination. Web-based collections are dicult to dis-
tribute. Standard mechanismsof the sort used by LDC|a CDor downloadable
le | are fraught with dicult legal issues, since, technically speaking, redis-
tributing the actual contentofWeb pages could require permission from the
author of every page. For example, presumably as a risk reduction strategy,
the Web track for TREC-2002 (Text REtrieval Conference) limits its attention
to the .gov domain and requires the recipient of the data to sign a form that
reduces the distributor's liability.
28
Similarly, the Google Programming Con-
test dataset arrives with a limited-use license, indemnication from third-party
claims, and a collection limited to the .edu domain, from which, presumably,
authors are less likely to bring expensivelawsuits.
29
A possible fourth reason mayhave to do with questions about the utilityof
the data. For example, a Web-based parallel collection may be unpredictable
in terms of its coverage and the communityiswell aware of the dangers of
using training data that is not representative of the test domain. A solution
to this problem might be to extract topically relevant subsets of the collection
27
http://umiacs.umd.edu/resnik/strand/
28
\The limitation on permitted use ... is intended to reduce the risk of any action being
broughtbycopyrightowners, but if this happens the Organisation [recipient] agrees to bear
all associated liability" (http://www.ted.cmis.csiro.au/TRECWeb/ access to data.html).
29
http://www.google.com/programming-contest/
24
for particular domains or applications, but of course this requires a \more is
better" approach in order to obtain subsets that are large enough to be useful.
The work reported in this report addresses each of these major problems.
With respect to the number of language pairs, the Internet Archiveoers
us the largest possible sample of pages on the Web, and our techniques make
it easy to explore that collection in an ecientway. Although it is probably
impossible to crawl more than a small fraction of the Web, the Internet Archive
is storing the results of commercial-scale Web crawling and has as its explicit
mission the permanent storage of everything that can be found. The fact that
wewere able to nd a substantial quantity of English-Arabic text | on the or-
der of a million words per side, looking at less than sixth of today's Archive|
oers the hope that it will be possible to nd data for the less well-represented
language pairs, if and when those data actually exist. Moreover, the nal imple-
mentation we described here retains the almost entirely language-independent
character of the original STRAND system, adding only the requirementofa
reasonable translation lexicon. Therefore success in mining for parallel text in
other languages depends primarily on whether the data exist on the Archive.
With regard to corpus size, we demonstrated that the recall of structural
matching, and hence its yield, can be signicantly improved by simple and au-
tomaticclassier construction, requiring only a few hours' work froma bilingual
annotator to create the training material. These results are further improved by
adding content-based similarity as a feature. The success with English-Arabic
| a language that is not one of those usually considered well represented on the
Web | encourages us to believe that for other languages of interest we will be
similarlysuccessful. Wehave also done a bit of exploration to gauge the poten-
tial of the Archive for better represented language pairs, using English-Chinese
as an example. By way of context, Chen and Nie (2000) reported that PTMiner
found around 15,000 English-Chinese document pairs bycrawling 185 sites in
the .hk (Hong Kong) domain, with the run taking about a week. We did a
STRAND searchofthetwoInternet Archive crawls used in the English-Arabic
study, seeking English-Chinese parallel text in multiple domains where Chinese
is a dominant language (e.g. .hk, .tw, .cn). Our initial candidate pair set was
generated in approximately 30 hours, and contains over 70,000 candidate page
pairs. It is dicult to generalize from the English-Arabic experiment, but one
cannot help but conclude that with even a fraction of the yield we obtained
on the previous experiment, especially broadening out to include the other ve
sixths of the Archive, we should be able to obtain interestingly large quantities
of English-Chinese text.
In terms of dissemination, the STRAND distribution mechanism models
itself after Web search engines, distributing the URLs rather than the pages
themselves. This places the legal burden on individual users, who are safe un-
der \fair use" provisions if they download pages for their individual use. Until
recently the diculty with this solution has been that the collection of URLs
deteriorates over time as sites disappear, pages are reorganized, and underly-
25
ing contentchanges | for example, in April 2002, we attempted to download
the documents in the STRAND English-French, English-Chinese, and English-
Basque collections, and wewere able to successfully access only around 67%,
43%, and 40% of the URL pairs, respectively.However, the Internet Archive's
WaybackMachine provides a way to distribute persistent URLs.
The quality of the data is a question that will need to be addressed byeorts
to actually use those data in natural language applications. It is worth pointing
out that, because STRAND expects pages to be very similarin structural terms,
the resulting document collections are particularly amenable to sentence- or
segment-level alignment. Indeed, just using dynamic programming to align
the markup, ignoring the text, produces reasonable rst-pass alignments of the
intervening text as a side-eect. We are currently adapting statistical text-
based sentence aligmenttechniques to takeadvantage of the markup available
in Web-based document pairs.
Finally, our subjective impression is that parallel texts mined from the Web
tend to contain a reasonable percentage of high quality translation, especially
since pages often come from government or commercial sites. The underlying
content, of course, is as rich and diverse as the Web itself, and what weas
researchers can do with it is an exciting question that remains to be answered.
7 Acknowledgements
This work has been supported in part by Department of Defense Contract RD-
02-5700, DARPA/ITO Cooperative Agreement N660010028910, and ONR
MURI Contract FCPO.810548265. The second author is supported byaFannie
and John Hertz Foundation Fellowship. Wewould like to thank Nizar Habash,
Kareem Darwish, Mona Diab, Ashraf Mohamed, and Usama Soltan for their
assistance with Arabic, and Jason Eisner, Rebecca Hwa, and Doug Oard for
helpful conversations. Wewould also like to thank Sang Hong for his assistance
with implementation,and Brewster Kahle, Andy Jewell, Jad DeFanti,andPaula
Keezer for permitting and facilitating our use of the Internet Archive.
References
Ahuja, Ravindra K., Thomas L. Magnati, and James B. Orlin. 1993. Network
Flows: Theory, Algorithms, and Applications. Prentice Hall, Englewood
Clis, NJ.
Al-Onaizan, Yaser, Jan Curin, Michael Jahr, Kevin Knight, John Laerty,
I. Dan Melamed, Franz-Josef Och, David Purdy, Noah A. Smith, and David
Yarowsky. 1999. Statistical machine translation. Technical report, JHU.
citeseer.nj.nec.com/al-onaizan99statistical.html.
26
Broder, Andrei Z., Steven C. Glassman, Mark S. Manasse, and Geoery Zweig.
1997. Syntactic clustering of the web. In Sixth International World-Wide
Web Conference,Santa Clara, CA, April.
Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and
P. Roossin. 1990. A statistical approachtomachine translation. Computa-
tional Linguistics, 16(2):79{85.
Cabezas, Clara, Bonnie Dorr, and Philip Resnik. 2001. Spanish language pro-
cessing at University of Maryland: Building infrastructure for multilingual
applications. In Proceedings of the Second International Workshop on Span-
ish Language Processing and Language Technologies (SLPLT-2).
Chen, Jiang and Jian-Yun Nie. 2000. Web parallel text mining for chinese
english cross-language informationretrieval. In International Conferenceon
Chinese Language Computing, Chicago, Illinois.
Church, Kenneth W. and Robert Mercer. 1993. Introduction to the special issue
on computationallinguisticsusing large corpora. Computational Linguistics,
19(1):1{24.
Darwish, Kareem. 2002. Building a shallow arabic morphological analyser in
one day.InWorkshop on Computational Approaches to Semitic Languages,
Philadelphia, July.
Davis, Mark and Ted Dunning. 1995. A TREC evaluation of query translation
methods for multi-lingualtext retrieval. In Fourth Text Retrieval Conference
(TREC-4). NIST.
Diab, Mona and Philip Resnik. 2002. An unsupervised method for word sense
tagging using parallel corpora. In 40th Anniversary Meeting of the Associ-
ation for Computational Linguistics (ACL-02), Philadelphia, July.
Dunning,Ted. 1994. Statisticalidentication of language. ComputingResearch
Laboratory Technical Memo MCCS 94-273, New Mexico State University,
Las Cruces, New Mexico.
Gale, William A. and Kenneth W. Church. 1991. Identifying word correspon-
dences in parallel texts. In Fourth DARPA Workshop on Speech and Natural
Language, Asilomar, California, February.
Hunt, J. W. and M. D. McIlroy. 1975. An algorithm for dierential le com-
parison. Technical Memorandum 75-1271-11, Bell Laboratories, October.
Hwa, Rebecca, Philip Resnik, AmyWeinberg, and Okan Kolak. 2002. Evaluat-
ing translational correspondence using annotation projection. In 40th An-
niversary Meeting of the Association for Computational Linguistics (ACL-
02), Philadelphia, July.
27
Landauer, Thomas K. and Michael L. Littman. 1990. Fully automatic cross-
language document retrieval using latent semantic indexing. In Proceedings
of the Sixth Annual Conference of the UW Centre for the New Oxford English
Dictionary and Text Research, pages pages 31{38, UW Centre for the New
OED and Text Research, Waterloo, Ontario, October.
Lin, Dekang. 1998. An information-theoretic denition of similarity.InFif-
teenth International Conference on Machine Learning, Madison, WI, July.
Lopez, Adam, Michael Nossal, Rebecca Hwa, and Philip Resnik. 2002. Word-
level alignment for multilingual resource acquisition. In Workshop on Lin-
guistic Knowledge Acquisition and Representation: Bootstrapping Annotated
Language Data, at the Third International ConferenceonLanguage Re-
sources and Evaluation (LREC-2000),LasPalmas, Canary Islands, Spain,
June.
Ma, Xiaoyi and Mark Liberman. 1999. Bits: A method for bilingual text
searchover the web. In Machine Translation Summit VII, September.
http://www.ldc.upenn.edu/Papers/MTSVII1999/BITS.ps.
Melamed, I. Dan. 1997. Automatic discovery of non-compositional compounds
in parallel data. In Proceedings of the 2nd Conference on Empirical Methods
in Natural Language Processing (EMNLP-97), Brown University, August.
Melamed, I. Dan. 2000. Models of translational equivalence among words.
Computational Linguistics, 26(2):221{249, June.
Oard, Douglas W. 1997. Cross-language text retrieval research in the USA.
In Third DELOS Workshop. European Research Consortium for Informatics
and Mathematics, March.
Resnik, Philip. 1998. Parallel strands: A preliminary investigation into mining
the Web for bilingual text. In Proceedings of the Third Conferenceofthe
Association for Machine Translation in the Americas, AMTA-98, in Lecture
Notes in Articial Intelligence, 1529, Langhorne, PA, October 28-31.
Resnik, Philip. 1999. Mining the Web for bilingual text. In Proceedings of the
37th Annual Meeting of the ACL,June.
Resnik, Philip and I. Dan Melamed. 1997. Semi-automatic acquisition of
domain-specic translation lexicons. In Fifth Conference on AppliedNat-
ural Language Processing,Washington, D.C.
Resnik, Philip, Mari Broman Olsen, and Mona Diab. 1999. The Bible as a
parallel corpus: Annotating the `Book of 2000 Tongues'. Computers and the
Humanities, 33:129{153.
28
Resnik, Philip and Noah A. Smith. 2002. Getting serious about \more is
better": mining the internet archive for bilingual text. In review.
Rilo, Ellen, Charles Schafer, and David Yarowsky. 2002. Inducing information
extraction systems for new languages via cross-language projection. In Nine-
teenth International Conference on Computational Linguistics (COLING-
2002),Taipei, Taiwan, August.
Smith, Noah A. 2001. Detection of translational equivalence. Un-
dergraduate honors thesis, University of Maryland College Park.
http://nlp.cs.jhu.edu/nasmith/cmsc-thesis.ps.
Smith, Noah A. 2002. From matching words to matching corpora: recognizing
translation. In review.
Tiedemann, Jorg. 1999. Automatic construction of weighted string similarity
measures. In Joint SIGDAT Conference on Empirical Methods in Natural
Language Processing and Very Large Corpora, College Park, MD, June.
Yarowsky,David and Grace Ngai. 2001. Inducing multilingual pos taggers
and np bracketers via robust projection across aligned corpora. In Second
Meeting of the North American Association for Computational Linguistics
(NAACL-2001), Pittsburgh, PA, June.
Yarowsky,David, Grace Ngai, and Richard Wicentowski. 2001. Inducing mul-
tilingual text analysis tools via robust projection across aligned corpora. In
Proceedings of HLT 2001, First International Conference on Human Lan-
guage Technology Research.
29