Enhancing Automatic Acquisition of Thematic Structure in a
Large-Scale Lexicon for Mandarin Chinese
Mari Broman Olsen, Bonnie Dorr, and Scott Thomas
UMIACS
University of Maryland
College Park, MD 20742
phone: +1 (301) 405-6754
fax: +1 (301) 314-9658
fmolsen,dorr,scthmasg@umiacs.umd.edu
Abstract
This paper describes a renement to our procedure for porting lexical conceptual structure
into new languages. Specically we describe a two-step process for creating candidate thematic
grids for Mandarin Chinese verbs, using the English verb heading the VP in the subdenitions
to separate senses, and roughly parsing the verb complement structure to match to our thematic
structure templates. The procedure is part of a larger process of creating a usable lexicon for
interlingual machine translation from a large on-line resource with both too much and too little
information necessary for our system.
Keywords: lexical conceptual structure, lexicon acquisition, machine translation, polysemy
1 Introduction
In previous work on Spanish and Arabic
[
Dorr et al., 1997, Dorr, 1997a
]
, we reported the results of
an acquisition process for verb databases in new languages, using automatic assignment of candidate
thematic structure templates (\theta grids") and hand verication of the output. A substantial
number (39%)of the hand modications to the outputof the acquisition programwereeliminations of
theta grids that were assigned to target language verbs because the English glosses were polysemous.
This paper reports on acquisition of a Mandarin Chinese verb database from an on-line resource
ten times as large as those we had worked with for Spanish and Arabic (600k, rather than 60k). The
procedure is part of a larger process of creating a usable lexicon for interlingual machine transla-
tion from a large on-line resource with both too much and too little information necessary for our
interlingual machine translation system
[
Dorr, 1997b, Hogan and Levin, 1994
]
.
We attempt to reduce the amount of labor involved in the hand-correction scheme through
creating separate entries for polysemous verbs and parsing their English denitions with respect to
our database of thematic grids. Specically, we nd the head verb of each subgloss (i.e. 'run' in 'run
away calling for help') and assign candidate theta grids associated with run in English. Furthermore,
we eliminate some candidate grids in favor of those matching the complement pattern in the gloss
| the prepositional phrase 'away', above. This procedure results in a reduction of 11% | 15,565
| possible candidate thematic grids that will not have to be evaluated by hand. Furthermore, the
separation of sense allows candidate grids to be evaluated with respect to a particular verb sense.
Noise that is introduced from polysemy in the English glosses (e.g. 'run a machine' and 'run a
race') may therefore be limited to the relevant verb sense, rather than combined in a \bag of grids"
attached to a verb.
Thisreduction hasbeen accomplished without substantivelossofrelevant denitions, asevaluated
in a preliminary task, assigning theta grids to verbs in 10 sentences from a corpus of 10 Xinhua
articles. We will be able to have further evaluation on the 263 verbs from this corpus, which appear
in 423 dierent senses (average 1-2 senses per verb), as well as on the full set of verbs, as the
hand-checking progresses.
We see this work as a step in the direction suggested by Dorr, Marti and Castellon
[
1997
]
, who
observe that \The amount of time required for the hand-verication process [for Spanish] would be
greatly reduced if the issue of polysemy had been addressed earlier in the process." That research
generated 18353 candidate theta grids, representing 3623 verbs in the initial Spanish-English lexicon.
Of that, 3025 entries were veried as correct (16.5%), and 15328 (83.5%) had to be modied in some
way. There were 6082 deletions of entries, 334 reclassications (resulting in changes of entries) and
6295 renements of entries. The renements included 3648 deletions of non-applicable entries, 2747
changes to prepositions, optional roles made obligatory, etc. 2617 entries (955 verbs) deleted due to
rarity of usage and/or disjointness with respect to WordNet concepts 1213 new entries added (1092
verbs not in initial database). That is, there were a total of 9730 deletions, representing 63.5% of
the required modications, and 53% of the total number of candidate grids.
Thus, an automatic process that reduces the number of deletions in a principled way would
substantially reduce the hand-correction process. We rst describe the role of the theta grids in our
system. We then describe our lexicon acquisition procedure, with respect to the verbs, detailing, at
every step, how we attempted to deal with polysemy and overgeneration of theta grids.
2 Thematic Structure: Theta Grids
Thematic structure serves as the interface between the syntactic component (parsing) and the lexical-
semantic component, the Lexical Conceptual Structure (LCS). In particular, the assignment of a set
1
of thematic roles to a structure allows a unique interlingual LCS structure to be created, as in the
following pair, representing a sentence like Derek lled the bucket with water. The thematic structure
and LCS gloss roughly as `Agent causes something to be VERBed with something.'
:THETA_ROLES "_ag_th,mod-poss(with)"
:LCS (cause (* thing 1) (be ident (* thing 2) (at ident (thing 2) (!!-ed 9)))
((* with 15) poss (*head*) (thing 16)))
Thethetagridsthereforemapdirectly intoLCS variables, including ag(ent),th(eme), exp(eriencer),
and goal. In our grid scheme, obligatory theta roles are preceded by an underscore, and optional
roles with a comma, e.g. ag,th for John ate (lunch).
Verbs in a language can take more than one theta grid. The set of theta grids permitted by
a verb allows it to be grouped with others in a class taking the same semantic structures, repre-
sented by the LCS
[
Levin, 1993
]
. For example, verbs like ll, carpet, cloak and plug allow the grid
ag th,mod-poss(with), as in Derek lled the bucket with water, and not Derek lled the water
into the bucket Verbs like pour, drip, dribble, and slosh allow the latter pattern, but not the former:
,ag th,src(),goal(),asin Derek poured the water into the bucket, but not Derek poured the bucket with
water. The grids therefore group verbs by \semantic structure"
[
Levin and Hovav, 1995
]
. In contrast
to \semantic content," a term used to label idiosyncratic aspects of verb meaning, semantic structure
is important in determining syntactic patterning within and across languages
[
Dorr and Oard, 1998,
Grimshaw, 1993, Pinker, 1984, Pinker, 1989
]
.
3 Verb Selection
The assignment of thematic grids is one step in the creation of a lexicon from a large (600k entries)
machine readable Chinese-English dictionary. The dictionary was compiled by hand, by the Chinese-
English Translation Assistance (CETA) group from some 250 dictionaries, some general purpose,
others domain-specic or bilingual (Russian-Chinese, English Chinese, etc.). The CETA group,
startedin 1965and continuing intothe present decade, wasajoint government-academicproject. The
machine readable version of the resulting Optilex dictionary is licensed from the MRM corporation,
Kensington, MD.
As is often the case, the information required by our machine translation system is not directly
encoded in Optilex, including part of speech tagging. We identied the verbs by a simple process.
We parsed the DEF: eld in each optilex.dict entry derived from 20 of the more general of those
dictionaries, using using regular expressions to nd English glosses beginning with the innitival 'to'.
If so, the whole entry is used to generate as many new verb entries as there are verbs in the entry's
DEF: eld. As an example, the excerpt from following optilex.dict entry has 4 entries in its DEF:
eld. PY gives the Pinyin representation, and the DEF: eld is the English gloss. Optilex includes
other elds not listed, including HWT and HWS elds for Chinese pictograms, STC for the Standard
Telegram Code, and REF: for the dictionaries the entry came from.
... PY : bian1 ta4
DEF: 1. to whip, to og 2. <g> to chastise, to castigate ...
After processing, each denition has a single entry in the verb-entries.txt le:
(1) ... DEF: to castigate ...
(2) ... DEF: to chastise ...
2
(3) ... DEF: to og ...
(4) ... DEF: to whip
Besides missing information, the dictionary contained additional information that needed to be
eliminated. Some of the 250 resources used to create the dictionary were very domain-specic,
including, for example, Collier's North China Colloquial Collection, a publication listing many re-
gionalisms not observed anywhere else in China, and the Faxue Cidian, a dictionary of legal terms
from Shanghai. We eliminated many archaic and technical verbs by eliminating verbs identied by
Optilex as derived from these sources.
1
Nevertheless, entries varied widely in specicity, from the general verbs (and other words) to the
extremely specic, as in the examples below.
(5) po4 shi3 compel
(6) po4 shi3 force
(7) ben1 pao3 run
(8) zou3 walk
(9) chu1 kou3 speak
2
(10) bi1 gong1 force the sovereign to abdicate (th = sovereign) (prop = to abdicate)
(11) ben1 zou3 xiang1 gao4 run around spreading the news (mod-loc = spreading news)
(12) ci3 walk on the ball of the foot
3
(13) chui1 xu1 speak in favor of somebody in exaggerated terms
4 Pairing Verbs and Candidate Thematic Grids
4.1 English glosses
Next, we needed to assign thematic grids to the verbs. We estimated that creating thematic grids
by hand would take an estimated 6 person months for a lexicon on the order of 60k
[
Dorr et al.,
1997
]
. As mentioned above, for the Arabic and Spanish lexicon, we created candidate thematic grids
by pairing target language words with the thematic grids associated with their English gloss, with
hand correction over a period of two weeks.
1
We included verbs from the following sources: BF Chinese-English Dictionary 1978, BE same as BE but Chinese-
Chinese 1978, AR Atlas of the PRC 1977 (for Chinese placenames), AO Gazetteer of the PRC (also for Chinese
placenames), BQ extra new entries from the rst two above BE and BF CJ standardized FBIS translations of Chinese
communist terms, CM specialized terms extracted from Mao's works, CU Hong Kong glossary of Chinese communist
terms, EJ 1981 idiom dictionary, EK 1982 idiom dictionary, FA Foreign Exchange terms 1963, IP International political
economics glossary 1980, IQ Beijing social sciences academy economics terms 1983, NA world place names 1981, PP
primary political economics glossary 1956, TM McGraw-Hill general scientic and technical dictionary 1963, VF Lin
Yutang's dictionary 1972, VT 1973 Beijing foreign exchange glossary, WB Liang Shih-ch'iu's traditional dictionary
1973, YG Stanford's dictionary of Chinese communist terms 1973.
2
As in 'to speak ill of someone.' This meaning is the rst listed, although John Kovarik (p.c.) claims it is less
popular than at least two others, including 'exit', as in exit signs.
3
The diamond means no glyph mapping is available for the character code.
3
We did the same initial step for Chinese, as well. However, as described above, the senses had
already been separated into dierent entries. We thus had a candidate theta grid set paired with a
specic sense of a verb. The le Chinese-grids was created by rst matching the main verb of the
English glosses to one or more entries in the English-grids le.
Separating polysemous entries helps us here, since not all grids are associated with all meanings
of the verb. For example, a wide range of grids is available for the run verbs.
(14) 26.3 ag chi2 run
(15) 26.3 ag ben th chi2 run
(16) 26.3 ag th,ben(for) chi2 run
(17) 47.5.1 ag,mod-loc() chi2 run
(18) 47.5.1 loc th chi2 run
(19) 47.5.1 th loc() chi2 run
(20) 47.7 th goal() chi2 run
(21) 47.7 th src(from) goal(to) chi2 run
(22) 51.3.2 ag chi2 run
(23) 51.3.2 th,src(),goal() chi2 run
In contrast, a relatively small number is available for othermeanings of this character. In previous
work, all grids were associated to the single entry, with hand-separation of senses necessary, and the
opportunity for human error great, with humans deleting or retaining grids depending on which sense
of the verb they had in mind.
(24) 31.2 exp perc,mod-poss(in) chi2 support
(25) 47.8 th loc chi2 support
In the case at hand, it turns out that chi2 means 'run', as in 'run a business' or 'run a machine',
whereas the theta grids were derived from the motion verb run in English. Should the grids prove
inappropriate in the hand-verication stage, they can be deleted without aecting the entries glossed
'support'. In previous work, the checker was presented with a \bag of grids", without a link to a
specic meaning.
4.2 Parsing the Denition
After assigning a candidate set to each gloss, we matched various phrases in each gloss to the theta
roles in the grid entries. This permits some automatic modication of the grids that in earlier work
had been done by hand. For instance, the gloss `to force the sovereign to abdicate' matches the
English-verb database entry with roles ag th,prop(to) for the verb to force; that is, the English
verb takes an agent, theme, and optionally a propositional element, the latter matching \to V ...".
Thus an entry in the Chinese-grids le is created that reads, in part, ... ag ... (th =
sovereign) (prop = to abdicate) . That is, the matched theta roles are added, with their the-
matic assignments, to the end of the entry, resulting in 11,360 distinct theta role assignments. In
4
some cases the original theta-roles list becomes empty; in which case it appears as 0, which is the
theta grid for verbs with no semantic arguments, such as rain in English It's raining.
Similarly, PPs in a gloss are matched to a grid element, if possible, and that grid element is
removed from the grid. For roles with an unspecied preposition, we heuristically assigned certain
roles to certain preps, namely:
from: src or instr
for: purp
with: instr or mod-poss
without: mod-poss
into: goal
to: goal
against: goal
under: mod-loc
around: mod-loc
along: mod-loc
We observed that prepositional phrases with prep = `of' seem to almost always attach to the
preceding phrase, rather than appear as an argument|at least in these glosses. Thus we ignored
the possibility of of PP as an argument, always making it a part of the argument that precedes it.
Adverbs, in positions where they typically modify the verb (rather than an adjective)|that is,
near the verb or at the end of the gloss|become MANNER components. (`Typical' was determined
by looking at the results for these particular glosses.)
A gloss that ends with a dangling preposition was taken as a sign that, where the English verb
takes a PP, the Chinese verb lls the same role with a bare NP argument. Thus the parentheses are
removed from the grid for that role.
These actions all have the eect of matching elements in the gloss to elements in the grid, and
eliminating these elements from the grid, thus reducing the number of grids that have to be hand-
checked. We also deleted certain entries entirely, prior to hand checking. In particular, if the set of
theta roles lexicalized in the Chinese verbs by an English gloss (which may be the empty set) for
one entry of the (polysemous) verb is a proper subset of that for another entry of that verb, the
corresponding grid is discarded. This allowed a 11% reduction in the number of entries that needed
to be hand checked.
For example, a Chinese verb glossed as ll would pull in the theta grid discussed above, shown in
the following entry from our theta grid database. (The initial number is the verb class, from Levin
[
1993
]
.
(26) 9.8 . ag th,mod-poss(with) mi2 man4 ll
But a verb which incorporates the with element in its meaning would not allow that element
to appear as another argument in the complement structure, so we automatically eliminate the
mod-poss role from the grid.
5
(27) 9.8 . ag th tian2 tu3 ll in with earth (mod-poss = earth)
Similarly, this heuristic allowed four potential candidates to be reduced to one, for the following
verb.
(28) Remaining:
29.3 ag th bao4 gao4 make known (pred = known)
(29) Suppressed:
26.1 ag ben th,instr() bao4 gao4 make known
26.1 ag th,instr(),ben(for) bao4 gao4 make known
26.1 ag th pred(into),ben(for) bao4 gao4 make known
That is, in class 29.3 PRED was incorporated in the verb meaning, whereas nothing wasincorporated
for the other grids. So for this gloss, only the 29.3 grid is used|the empty set being a proper subset
of a non-empty set.
This is comparable to what was done by hand for Spanish: In the case of escribir, two theta grids
were automatically assigned, ag th,goal (as in He wrote his name on the photo) and ag th (as in
He wrote his name). The latter was left in the database since it provides the most basic thematic
requirements for the verb.
5 Results
Using this metric, 15565 thematic grids were eliminated, representing 11% of the total number
of candidate thematic grids. We are beginning the process of hand-evaluation of the theta grids,
beginning with the verbs in 10 articles from Xinhua, comparable to the Wall Street Journal in
content. For the verbs in 10 articles from Xinhua, 124 grids were suppressed for 47 verbs (29 classes)
leaving 3041 grids, for 263 verbs (characters, rather than denitions). Number of grids assigned to
a given verb (a character set) average 11.6, and range from 0 (for verbs for which a grid cannot be
found in our current database) to 183. A set of 51 theta grids were generated for the 13 verbs in ten
sentences from these articles. Chinese speakers deleted 17 grids, or 33.3%. Although these results
are a tiny subset of the full verb lexicon, this gure compares favorably to the 53% deletion required
of the Spanish data. Importantly, none of the relevant grids had been discarded by our algorithm.
4
As the hand-verication work progresses, we will evaluate the results on a broader scale, tracking
necessary modications tothe full set oftheta grids for the 120kverb senses. We predict, forexample,
that the number of deletions should be less than that for the Spanish.
6 Conclusions and Future Work
We have described a procedure for automatically reducing the amount of hand-checking necessary
for building the thematic grid structure for verbs in Chinese. We anticipate that this procedure
will save us time over our original checking procedure. The latter, in turn reduced the amount of
time required to create thematic structure from 6 person months (for a lexicon with 60k entries and
3-4k verbs) to approximately two weeks of hand verication. The time savings for our project is
4
The copular grid for the verb shi4 was added to the set, using a grid assigned to other copular verbs, namely wei2
and zuo4. Somewhat surprisingly, the absence of the copular grid in our candidate set resulted from an absence in
Optilex of the copular meaning for that verb.
6
even more imperative, since we have some 150k verb-sense entries. This procedure provides further
streamlining for the process of acquiring large-scale lexica for NLP applications with non-optimal
on-line resources.
In addressing the polysemy problem in this context, we have, as a side-product, produced a
sense-to-syntax mapping, tying verb senses of a particular character to a set of grids representing
syntactic as well as semantic structure. This resource, in turn, could be used not only for machine
translation, but for testing and applying word sense disambiguation algorithms for Chinese.
Acknowledgments
This work has been supported, in part, by NSA Contract MDA904-96-C-1250. The second author is
also supported by DARPA/ITO Contract N66001-97-C-8540, Army Research Laboratory contract
DAAL01-97-C-0042, NSF PFF IRI-9629108 and Logos Corporation, NSF CNRS INT-9314583, and
Alfred P. Sloan Research Fellowship Award BR3336. We would like to thank members of the fol-
lowing lab groups at Maryland: Computational Linguistics and Information Processing (CLIP), and
Language And Media Processing (LAMP), particularly Galen Wilkerson for his implementation and
description of verb selection, and John Kovarik, on loan from the Department of Defense.
References
[
Dorr and Oard, 1998
]
Bonnie J. Dorr and Douglas W. Oard. Evaluating resources for query trans-
lation in cross-language information retrieval. In Proceedings of the First International Conference
on Language Resources and Evaluation, Granada, Spain, 1998.
[
Dorr et al., 1997
]
Bonnie J. Dorr, Antonia Marti, and Irene Castellon. Spanish EuroWordNet and
LCS-Based Interlingual MT. In Proceedings of the MT Summit Workshop on Interlinguas in MT,
San Diego, CA, October 1997.
[
Dorr, 1997a
]
Bonnie J. Dorr. Large-Scale Acquisition of LCS-Based Lexicons for Foreign Language
Tutoring. In Proceedings of the ACL Fifth Conference on Applied Natural Language Processing
(ANLP), pages 139{146, Washington, DC, 1997.
[
Dorr, 1997b
]
Bonnie J. Dorr. Large-Scale Dictionary Construction for Foreign Language Tutoring
and Interlingual Machine Translation. Machine Translation, 12(4):271{322, 1997.
[
Grimshaw, 1993
]
Jane Grimshaw. Semantic Structure and Semantic Content in Lexical Represen-
tation. unpublished ms., Rutgers University, New Brunswick, NJ, 1993.
[
Hogan and Levin, 1994
]
Chris Hogan and Lori Levin. Data Sparseness in the Acquisition of Syntax-
Semantics Mappings. In Proceedings of the Post-COLING94 International Workshop on Directions
of Lexical Research, pages153{159,Nicoletta Calzolari and Chengming Guo(co-chairs), Tshinghua
University, Beijing, 1994.
[
Levin and Hovav, 1995
]
Beth Levin and Malka Rappaport Hovav. The Elasticity of Verb Meaning.
In Proceedings of the Tenth Annual Conference of the Israel Association for Theoretical Linguistics
and the Workshop on the Syntax-Semantics Interface, Bar Ilan University, Israel, 1995.
[
Levin, 1993
]
Beth Levin. English Verb Classes and Alternations: A Preliminary Investigation.
University of Chicago Press, Chicago, IL, 1993.
7
[
Pinker, 1984
]
Steven Pinker. Language Learnability and Language Development. MIT Press, Cam-
bridge, MA, 1984.
[
Pinker, 1989
]
Steven Pinker. Learnability and Cognition: The Acquisition of Argument Structure.
The MIT Press, Cambridge, MA, 1989.
8