This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
0 and j = X(q) — 1 and t = right then add also ((q, 0, , right, ε) to sm+i. [1.2] If i = m then [1.2.1] if j > 0 then for each (q', j', k', right, β) ε Sk such that rightj+1(q') = left(q) and ƒ = λ(q') — 1, for each rule ξ such that left(q') = left(ξ) and right(q') = pX and right (ξ) = pY, for each rule r such that left(r) = Y, add
56
M. CORI, M. DE FORNEL & J.-M. MARANDIN
[2] Scan: if t ≠ cut then: [2.1] If i ≠ m and i < m +p +1 and j < λ(q) and rightj+1(q) ε cat(ui+1) then [2.1.1] add q,t+ 1, k,t,α to Si+1. [2.1.2] If i = m — 1 and t = right and j = λ (q) — 1 and j ≠ 0 then for each rule ξ such that left(q) = left(ξ) and right(q) = pX and right (ξ) = pY, if Y ε cat{um+2) then add <ζ ,J + l,k,t,a) to Sm+2. [2.2] If i = m and t = right and j = λ (q) — 1 and j ≠ and rightj+1(q) ε cat(um+2) then add q, j + 1, k, t, α to 5m+2. [2.3] If i = m and j 0 then for each q', j',k'rightβ) ε Sk such that rightj'+i(q') = left(q) and ƒ = λ(q') - 1, for each rule ξ such that left(q') = left(ξ) and left(q') = pX and right (ξ) = pY if € cat(um+2) then add (ξ, ƒ + 1, ', right, β) to Sm+2. [3] Complete: [3.1] If ; = λ(q) and t ≠ cut then for each q'', j' ,k',t',ß ε Sk such that rightj'+1(q') = left(q), add q' j', + l,k', t'.ßqa to Si. [3.2] If i = m and λ fø) > j 0 then for each q'', j', k', t', β ε Sk such that rightj'+1(q') = left(q), [3.2.1] If t ≠ cut and rightj (q) ε VT then add q', j' + 1, k',cut, ßq[j]a to Sm. [3.2.2] If t = cut then add q', j' + 1,' ,cut,ßqa to 5m. If (0,1,0, right, a) belongs to 5m+1+p then a new tree If (0,j,0,cut,ß) belongs to Sm then an interrupted tree The tree β represents the substring u1u2...um. Note mar is left-recursive, it may happen that there are an interrupted trees.
TV is given as a. is given as β. that if the gram infinite number of
PARSING REPAIRS 5.4
57
Remarks
The algorithm can be easily extended to handle cascaded repairs. A sim pler version13 should be used to provide the inputs for the understanding component: it only yields the repaired trees. The repaired trees are the relevant inputs for building discourse units, i.e., sentential turn construc tional units in conversation (see Schegloff 1979). For example, they allow the possible completion of the current turn and make transition to the next turn possible. 6
Conclusion
The main result of the study is the following: even though the configuration "interrupted utterance + repairs(s)" does not belong to the syntactic rep ertoire of French, it is submitted to a syntactic well-formedness condition. The REP is a simple and unified account of the regularity of self-repair. It comes into line with Schegloff's observation (1979:277): "the effect [of suc cessful repair] is the resumption of the turn-unit before the repair initiation or, if the repair operation involves reconstruction of the whole turn-unit, production of the turn-unit to completion". It gives a more adequate con tent to Levelt's claim: "speakers repair in a linguistically principled way". Thanks to the augmentation of the Earley algorithm, we claim that parsers can parse repairs in a syntactically principled way. REFERENCES Blanche-Benveniste, Claire. 1987. "Syntaxe, choix de lexique, et lieux de bafouil lage". DRLAV 36-37.123-157. Paris: Université de Paris-VIII. Cori, Marcel, Michel de Fornel & J.-M. Marandin. 1995. "Analyse syntaxique de I' auto-réparation". Colloque 'Le traitement automatique du langage naturel', 209-219. Marseille. De Smedt, Koenraad & Gerard Kempen. 1987. "Incremental sentence produc tion, self-correct ion and coordination". Natural Language Generation ed. by G. Kempen, 365-376. Dordrecht: Kluwer Academic. Dowty, David. 1988. "Type raising, functional composition and non-constituent conjunction". Categorial Grammars and Natural Language Structures ed. by Richard Oehrle et al., 153-197. Dordrecht: D. Reidel. 13
Without cut as a value of type in the definition of a state and without [3.2] in the definition of Complete.
58
M. CORI, M. DE FORNEL & J.-M. MARANDIN
Earley, Jay. 1970. "An Efficient Context-Free Parsing Algorithm", Communica tions of the Association for Computing Machinery 13:2.94-102. Fornel, Michel de. 1992a. "The Return Gesture: Some Remarks on Context, Inference, and Iconic Gesture". The Contextualization of Language ed. by P. Auer & A. Di Luzio, 159-176. Amsterdam: John Benjamins. 1992b. "De la pertinence du geste dans les séquences de réparation". Les formes de la conversation ed. by Coneinet al., 119-154. Paris: CNET/CNRS. & J.M. Marandin. Forthcoming. "L'analyse grammaticale des auto-repa rations" . Frederking, Robert. 1988. Integrated Natural Language Dialog. Dordrecht: Kluwer. Fromkin, Victoria. 1973. Speech Errors as Linguistic Evidence. The Hague: Mouton. Gardent, Claire. 1991. Gapping and VP Ellipsis in a Unification-ased Gram mar. Ph.D. dissertation, University of Edinburgh, Edinburgh, Scotland. Hankamer, Jorge. 1973. "Unacceptable Ambiguity". Linguistic Inquiry 4.17-68. Hindle, Donald. 1983. "Deterministic Parsing of Syntactic Non-fluencies". Pro ceedings of the 21st Meeting of the Association of Computational Linguistics (ACL'83), 123-128. Kay, Martin, M. Gawron & P. Norvig. 1993. "Verbmobil: A Translation System for Face-to-face Dialog", (= CLSI Lectures Notes, 33). Chicago: Chicago University Press. Kroch, Anthony & D. Hindle. 1982. "On the Linguistic Character of NonStandard Input". Proceedings of the 20th Meeting of the Association of Com putational Linguistics (ACL'82), 161-163. Levelt, William. 1983. "Monitoring and Self-Repair in Speech". Cognition WAX֊ 104. 1989. Speaking: from Intention to Articulation. Press. Priist, Hub. 1993. On Discourse Structuring, dissertation, University of Amsterdam.
Cambridge, Mass.: MIT
VP Anaphora and Gapping. Ph.D.
Sag, Ivan, Th. Wasow, G. Gazdar & S. Weisler. 1985. "Coordination and how to distinguish categories". Natural Language and Linguistic Theory 3:2.117171. Schegloff, Emanuel. 1979. "The relevance of repair to syntax-for-conversation" Syntax and Semantics 12. ed. by T. Givon, 261-286. New-York: Academic Press. , G. Jefferson & H. Sacks. 1977. "The preference for self-correction in the organisation of repair in conversation". Language 2.361-382.
Parsing for Targeted Errors in Controlled Languages M A T T H E W F. H U R S T
University of Edinburgh Abstract The use of Controlled Languages in technical documentation is be coming a large concern for many organisations. Authoring texts which conform to these specifications is a problematic process. Tech nological support for the writing process may offer a number of aids, including style, or grammar, checkers. The ability to recognise vari ations to the prescribed grammar is at the heart of such systems. This paper presents a variation on the chart parsing method which encodes the grammar as finite state automata productions instead of a linear description of constituents. The system allows the grammar writer to define a number of variations to a grammar rule which are represented as transformations to the automata. 1
Introduction
The SEATS (Specialised English Author Training System) project aims to create technology capable of supporting the process of writing technical documentation according to the stylistic requirements of a Controlled Lan guage. Central to this support is a style checker based on a flexible pars ing mechanism. This paper introduces the notion of Controlled Language, overviews some relevant previous work in the area of robust parsing and describes a novel parser which uses finite state automata as a rule system in the chart parsing paradigm. 2
Controlled language and grammar checking
A controlled language (CL) is a restricted variation on some natural lan guage. The purpose of defining a CL for some domain is to control aspects of the language used to describe a task in that domain. The control is de signed to reduce the ambiguity inherent in natural languages, making the text easier to understand, and less prone to incorrect interpretation. A typ ical application area is one in which the correct execution of a procedure manipulating objects in the domain is safety (or legally) critical. Any aspect of a natural language may be controlled by more or less formal rules. These rules may be specific (e.g., preventing the use of a
60
MATTHEW F. HURST
particular word) or general (using a model of that linguistic component, be it lexical, grammatical or discourse level). Lexical control says something about the use of lexical items, typically providing a dictionary of approved words. Grammatical control endorses the use of a set of constructions. Discourse level control stipulates the introduction of topics, the structure of information introduction and so on. Controlled Languages offer an ideal application field for language check ing technology. Whereas the task of free text checking can never offer a complete coverage of the language, controlled languages can generate gram matical models very close, if not identical to, the intended coverage. Ad ditionally, free text may contain unseen lexical items, whereas controlled languages have a finite lexicon which forms part of the language definition.
3
Robust parsing
The field of robust parsing, or robust analysis, provides a useful set of tech niques which can be applied to the task of detecting and reporting errors in text. The goals of robust parsing differ slightly to those of error detection. • Robust Parsing aims to provide an analysis of ill-formed text. A grammar and lexicon are used together with some set of techniques to align the text with the grammar. • Error Detection aims to detect the cause of failed analysis of ill-formed text, and report the error. In general, any technique for robust analysis can be applied to the task of error detection by augmenting certain data structures with the appropriate record of the ill-formedness consumed.
3.1
Positive and negative detection
The first consideration in classifying techniques for error checking is the distinction between positive and negative detection. 'Positive detection' is concerned with writing rules which form the errors, i.e., ungrammatical rules. These rules are then used in the general analysis strategy, e.g., pars ing. If they complete, then an error may have been found, at which point further analysis may be done. 'Negative detection' classifies methods which provide a model of the correct language and employs techniques to compare this model with the input.
TARGETED ERRORS IN CONTROLLED LANGUAGES 3.2
61
Targeted and untargeted detection
Another dimension of technique classification is that of the mode of de tection. 'Targeted detection' employs some declaration of the flexibility required in order to detect errors. This declaration is expressed as some form of annotation to the language model. 'Untargeted detection' tech niques are those which use some general principal to align the model with the input (or the input with the model). The difference between targeted detection and positive detection is that in targeted detection, the core model is the correct grammar rule; the an notations to this rule describe the required flexibility. Positive detection, on the other hand, uses grammar rules which centre on the error as being the key concept. Untargeted detection usually appears as an algorithmic component which provides some form of relaxation to the grammatical model. 3.3
Single phase and multiple phase
This classification of techniques refers to the time at which the error rules are considered. A 'Single Phase' approach would incorporate the rule system at the same time as parsing. This approach would be appropriate to positive detection strategies, as they are identical in implementation to a normal parsing of text. Extending this approach to negative strategies introduces interesting computational problems due to the multiplicity of possibilities. A 'Multiple Phase' approach would incorporate the detection of errors by first analysing the text as if it were well formed, and then reworking this analysis, incorporating the error mechanisms allowed by the definition of the error technique. 3.4
Current methods
Methods which can be classified to some degree in the above manner can be found in the literature on robust analysis. Mellish (1989) describes an example of a negative, untargeted, mul tiple phase approach to robust analysis. The method uses a grammar of English (hence negative) which it uses to construct a well formed substring table (chart) employing a bottom up parsing algorithm. Following this, it uses a modified top-down parser (hence, multiple phase) to attempt to complete the parse with the minimum errors. The use of this general, grammar independent technique is an example of an untargeted approach.
62
MATTHEW F. HURST
Compare this with the negative, untargeted, single phase approach of Goeser (1992). Ballim & Russell (1994) describe a single phase approach which of fers the grammar developer a weakly targeted, negative grammar envir onment in which to construct and experiment with rules. Here, grammar rules are annotated with bounds on the relaxations that may provide flexib ility for certain constituents. Another single phase approach is described by Wang (1992). This method differs from the others mentioned here in that it employs a novel view of the parsing process, not relying on conven tional grammar rules. Its flexibility is derived from a mechanism capable of only three simple actions. Consequently, as it has no traditional grammar model used in its analysis it cannot strictly be classified with the other sys tems, however an approximation is as an untargeted (it presents a general mechanism capable of producing parses of ill-formed input) approach. The positive/negative distinction doesn't apply as there is no grammar. Strzalkowski (1992) describes a parsing system built for speed. Its ro bust capabilities are untargeted and work through a mechanism which skips ungrammatical input. The paper mentions that, through the use of a time out facility, no distinction is made between ungrammatical and simply expensive input. Skipped input can later be attached to the analysis, so the method is multiphase. Statistical approaches to under generation exist (e.g., Briscoe & Waegner 1994). This technique approaches the problem by defining probabilit ies to all possible rules (modulo certain constraints described) over a ter minal/nonterminal set in CNF. This approach is designed to be a single phase approach. However, its robust capabilities are captured during a stochastic training phase, consequently the normal model of a 'correct' grammar and an 'incorrect' input is less appropriate to this type of analysis. Work specifically in the area of error detection is less numerous. Douglas & Dale (1992) describe a system capable of relaxing constraints at a different level to those in which we are concerned with there. The robust PATR model can be used to relax constraints represented as the feature structures of PATR rules in order to accept ill-formed input. The sort of ill-formedness which this method handles are the normal constraints of PATR notation. The implementation of the parser described below allows for variation between single phase and multiple phase parsing, though currently is imple mented as a single phase process. It uses targeted negative detection (note that it is always possible to add positive detection to any parsing mech anism simply by adding grammatically ill-formed rules). It was decided
TARGETED ERRORS IN CONTROLLED LANGUAGES
63
to use targeted detection for purposes of speed. Mechanisms for arbitrary insertion and deletion for example are typically complex; Mellish (1989) reports a (worse case) 10 times increase in time taken when one error is introduced into a sentence. Additionally, the target controlled language (AECMA 1989) has many descriptions of variations to the correct gram mar which are not permitted. Writing targeted negative grammars fits this type of language definition. Finally, using a similar model for encoding the language and possible errors as that of the manual will provide a consistent view of the language, a factor which we think will aid the learning of the language as well as the construction of a complete grammar. 4
Chart parsing with finite state automata
The operations required to perform parsing using a well formed substring table are typically described as follows. 1. Rule invocation. An inactive edge is entered and rules are found for which this edge represents the initial constituent. 2. Combining with active edges. An inactive edges is entered and active edges are looked for with which it may combine. 3. Extension of active edges (usually termed the fundamental rule: Gazdar & Mellish 1989:193). An active edge is entered and inactive edges are looked for to complete or extend the span of the edge. A number of primitive operations are required to support these general operations. • matching: matching must be carried out between the constituents of rules. • addition of information: the creation of a new edge through step 2 or 3 can be viewed as the addition of information. This addition may be a simple update of a dotted rule, e.g., when using atomic categories, or may require more sophisticated operations like the unification of graphs in the case of a feature structure representation. The use of dotted rules (Earley 1986; Kay 1980) is the key behind the effi ciency of the paradigm. Traditionally, the grammars used in such parsing schemes have been straight-forward context free grammars. These gram mars may be implemented as simple atomic category rules (Andrews & Brown 1993) or more complex information representations such a unifica tion formalisms. In both cases, it is important to ensure that the primitive operations of matching and addition of information can be carried out in
64
MATTHEW F. HURST
efficient ways. The efficiency of the matching process can be increased by the use of indexing systems, both for the rule look up, in which case rules are stored according to the index value of their initial daughter, and for the storage of edges in the chart, active edges being indexed on their next required daughter, and inactive edges being indexed on their mother. For the edges in the chart, an index is a vector over a vertex. Entering edges in to the chart means entering them under the appropriate index at the delim iting vertices. An example of indexing is described by Andrews & Brown (1993). The use of finite state automata (FSA) as a grammar description has a similar form to the standard production described by a context free rule. Instead of a simple series of daughters, the right hand side consists of an FSA (Figure 1). The use of the language of regular expressions to describe finite state automata is well documented (Aho, Sethi & Ullman 1986:83; Gazdar & Mellish 1989:134) as are algorithms for constructing the machines from these descriptions.
Fig. 1: A finite state production The processes required to form well formed substring tables must be modi fied to accommodate the more complex rule description system of the FSA. In fact this alteration is not at all complex and is really a transfer of the notion of the 'dot' in the dotted rule from marking the next constituent to be consumed to marking the state in which the FSA is in. Again, efficiency is maintained by the use of indexing mechanisms, and it is these mechanism which allow the operations 1, 2 and 3 to remain unaltered. Active edges, and the set of rules, are indexed by the set of possible matches available at any given state, i.e., the arcs representing transitions between states via the consumption of those constituents. In this way, no extra complexity in computation occurs as the indexing mechanism acts as an abstract inter face between the representation and the algorithm. The indexing of edges as they enter the chart can be carried out in constant time as it can be accomplished by the simple addition of a precomputed index of arcs of a state and the index for the vertex. So in Figure 1, if the automaton is in
TARGETED ERRORS IN CONTROLLED LANGUAGES
65
state 2, the edge is indexed by and D. Inactive edges are indexed by their mother categories as before. 5
Encoding grammatical variation with finite state automata
A set of transformations of the FSA allows for the encoding of targeted errors, grammatical variation, to be held in the rule as extensions to the core (correct) production. 5.1
Deletion
A deleted constituent is simply encoded by the use of a epsilon arc. Deleting from Figure 1 results in the FSA in Figure 2.
Fig. 2: A finite state production with deletion
5.2
Insertion
Inserting a constituent is not as straight-forward. There are 2 possible ways to describe an insertion. 1. By describing the material which precedes it, i.e., the preceding con stituent. 2. By describing the material which follows it, i.e., the following con stituent. 3. By describing the complete context, i.e., the preceding and following constituent. The first two cases are trivial, and are achieved by placing an extra state in the FSA either before or after the appropriate constituent and adding arcs for the new constituent and an epsilon arc for optionality. Figure 3 shows the insertion of | before C. The third case, however, requires a little more work. For example, insert ing | between and cannot be achieved by placing an extra state between
66
MATTHEW F. HURST
Fig. 3: A finite state production with insertion states 1 and 2 as this would also represent an insertion of | between and and and D. In fact, to describe the implementation of the third type of description, it is necessary to complete the brief description of the construction of finite state automata from regular expressions presented above. As described by Aho, Sethi & Ullman (1986):122-123, the generation of finite state automata from regular expression is a three case algorithm. The third case, that describing the construction of automata constructed from the disjunction (|) and zero or more repetitions (*) requires the insertion of epsilon productions. The full description of states 1 and 2 in Figure 1 would be that in Figure 4.
Fig. 4: A finite state production in full The algorithm for constructing the FSA guarantees that a state will have at most one exiting arc that is not empty, i.e., an epsilon arc. From any state, it is straight forward to compute the set of states which are reachable through epsilon arcs (this can be down off line to avoid any addition of complexity to the process). Inserting an extra constituent with a full description of context (preceding and following constituents) can then be achieved by using the set or reachable states from the entry state of the pre ceding context arc (in this example case, B, and checking for a match with the following context, in this case, C. The entry state for the arc labelled by is 2, and both 2a and 2b are reachable from 2. The transformation then produces the FSA appearing in Figure 5.
TARGETED ERRORS IN CONTROLLED LANGUAGES
67
Fig. 5: A finite state production with insertion A check has to be made to ensure that the preceding context and the following context can not be ignored through the traversal of epsilon arcs. This can occur if optionality has been defined for those constituents, either through the rule definition, or through the inclusion of some deletion arcs. Insertion and deletion arcs are marked as such to distinguish them from the arcs present in the rule prior to transformation. 6
Complexity
The complexity of the parsing mechanism should be considered as an ex tension to the expression for the normal representation of CFG rules. In fact the complexity of the algorithm is not the issue, rather it is the com plexity of the grammar. This is because the processes of the algorithm are of the same complexity, locally, as of the normal version. Rule access is the same (rules are indexed on the possible first daughters as before), the fundamental rule is the same, modulo the number of reachable arcs match ing inactive edges in the chart that an edge has emitting from its current FSA state. The factor of the number of reachable arcs is an attribute of the grammar, and not the algorithm itself. The entry of an inactive edge, resulting in looking backwards in the chart for possible active edges with which to combine is unchanged, again due to the indexing of edges by the set of reachable constituent arcs. Consequently, use of FSAs should be thought of an encoding technique which, in effect, reduces the number of rules required. For example, the two rules: 1. A → B C D 2A→BCE may be represented as one rule: 1. A → B D|E
68
MATTHEW F. HURST
In terms of the edges generated in the construction of the chart, up to the point of recognising and C, there is only one set of analyses. In the case with the full rule representation, the two rules represent parallel analyses of the component of the production. 7
Further work
The framework for parsing errorfull text requires the addition of feature structure relaxation techniques to make a complete system. There are many techniques, both targeted and untargeted, for dealing with inconsistencies in unification formalisms (Douglas & Dale 1992; Vogel & Cooper 1994 and others in Schöter & Vogel 1995). It is intended to use a simple model of unification relaxation, using feature structure paths as a description of the targeted point in the structure at which conflict is expected (as with parsing techniques, untargeted relaxation methods for unification are computation ally more expensive than targeted ones). Some grammar fragments have already been written for the target con trolled language (AECMA 1989), however, the proof of the value of the project will require a full grammar for this domain. 1 8
Conclusions
The field of Controlled Languages is particularly attractive to language tech nology developers as it provides a domain specific area in which a specific grammar is used as well as a finite set of lexical items. Consequently, check ing systems can be implemented which are more reliable than those used for unrestricted text. This paper has presented a method for parsing with a view to detecting errors. The parsing technology uses standard chart parsing augmented with rules expressed as finite state automata. As implemented, the targeted errors are represented declaratively as variations to the underlying finite state automaton of a rule. 1
The parsing system is also being put to use in a related project in the Language Technology Group in Edinburgh: the Construction Industry Specification, Analysis and Understanding project, which deals with sublanguage documents of the construction industry. As part of this project it is flexibility, and not error detection that is required. Consequently, a grammar has been developed for the domain which makes use of the regular expressions which may be used to define rules to incorporate variation, much as the insertion and deletion declarations do for the checking task.
TARGETED ERRORS IN CONTROLLED LANGUAGES
69
REFERENCES Aho, Alfred V., Ravi Sethi & Jeffrey D. Ullman. 1986. Compilers: Techniques and Tools. Mass.: Addison-Wesley.
Principles,
Andrews, N. A. & J. Brown. 1993 "A High-Speed Natural-Language Parser". AISB Quarterly, Winter. UK: AISB. Association Europeenne Des Constructeurs De Materiel Aerospatial. 1989. A Guide for the Preparation of Aircraft Maintenance Documentation in the International Aerospace Maintenance Language, 5th edition. Paris: AECMA. Ballim, Afzal & Graham Russell. 1994. "LHIP: Extended DCGs for Configur able Robust Parsing". Proceedings of the 15th International Conference on Computational Linguistics (COLING,94)) 501-507. Kyoto, Japan. Briscoe, Ted & Nick Waegner. 1994. "Robust Stochastic Parsing Using the Inside-Outside Algorithm". Reader for European Summer School on Logic Language & Information'94, Advanced Course CA4: Robust Parsing. Douglas, Shona & Robert Dale. 1992. "Towards Robust PATR". Proceedings of the 14th International Conference on Computational Linguistics (COLING92), 468-474. Nantes, France. Earley, Jay. 1986. "An Efficient Context-Free Parsing Algorithm". Readings In Natural Language Processing ed. by Barbara J. Grosz, Karen Sparck Jones & Bonnie Lynn Webber, 25-23. Calif.: Morgan Kaufmann. Gazdar, Gerald & Chris Mellish. 1989. Natural Language Processing in PRO LOG. Wokingham, U.K.: Addison-Wesley. Goeser, Sebastian. 1992. "Chart Parsing of Robust Grammars". Proceedings of the 14th International Conference on Computational Linguistics (COLING92) ed. by Christian Boitet, 120-126. Nantes, France. Kay, Martin. 1986. "Algorithm Schemata and Data Structures in Syntactic Processing". Readings in Natural Language Processing ed. by Barbara J. Grosz, Karen Sparck Jones & Bonnie Lynn Webber, 35-70. Calif.: Morgan Kaufmann. Mellish, Chris S. 1989. "Some Chart-Based Techniques for Parsing Ill-Formed Input". Proceedings of the 27th Annual Meeting of the Association for Com putational Linguistics, 102-109. Schöter, Andreas & Carl Vogel, eds. 1995. Edinburgh Working Papers in Cog nitive Science, vol.10: Nonclassical Feature Systems. Edinburgh: University of Edinburgh. Strzalkowski, Tomek. 1992. "TTP: A Fast and Robust Parser for Natural Lan guage". Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), 198-204. Nantes, France.
70
MATTHEW F. HURST
Vogel, Carl & Robin Cooper. 1995. "Robust Chart Parsing with Mildly Incon sistent Feature Structures". Edinburgh Working Papers in Cognitive Science, vol 10: Nonclassical Feature Systems ed. by Andreas Schöter & Carl Vogel, 197-216. Edinburgh: University of Edinburgh. Wang, Jin. 1992. "Syntactic Preferences for Robust Parsing with Semantic Pref erences". Proceedings of the lįth International Conference on Computational Linguistics (COLING-92), 239-245. Nantes, France.
Applicative and Combinatory Categorial Grammar (from syntax to functional semantics) ISMAIL BISKRI & J E A N - P I E R R E DESCLÈS
ISHA - LALIC, France Abstract Applicative and Combinatory Categorial Grammar is an extension of Steedman's Combinatory Categorial Grammar by a canonical as sociation between rules and Curry's combinators on the one hand and meta-rules which control type-raising operations on the other hand. This model is included in the general framework of Applic ative and Cognitive Grammar (Desclès) with three levels of repres entation: (i) phenotype (concatened expressions); (ii) genotype (ap plicative expressions); (iii) the cognitive representations (meaning of linguistic predicates). The aim of the paper is: (i) an automatic pars ing of phenotype expressions that are underlying to sentences; (ii) the constructing of applicative expressions. The theoretical analysis is applied to spurious ambiguity and coordination. 1
Model of Applicative and Cognitive Grammar
Applicative and Cognitive Grammar (Desclès 1990) is an extension of the Universal Applicative Grammar (Shaumyan 1987). It postulates three levels of representations of languages: (i) Phenotype level (or phenotype) where the particular characteristics of natural languages are described (for example order of words, morphological cases, etc.). The linguistic expressions of this level are concatened linguistic units, the concatenation is noted by: u1 — u2— ... — un;] (ii) Genotype level (or genotype) where grammatical invariants and structures that are underlying to sentences of phenotype level are expressed. The genotype level is structured like a formal language called genotype language; it is described by a grammar called applicative grammar] (iii) The cognitive level where the meanings of lexical predicates are represented by semantic cognitive schemes. Representations of levels two and three are expressions of typed combin atory logic (Curry & Feys 1958; Shaumyan 1987). We abstract operators associated with elimination and introduction inference rules like in Gentzen calculus. For instance, we present combinators B, C * , S, Þ, with the fol , 2U ,3 are typed applicative expressions): lowing rules (U1U
ISMAIL BISKRI & JEAN-PIERRE DESCLÈS
72
introduction rules
elimination rules
These rules lead to β-reduction or /3-expansion: (( U1U2 )U3) ≥ (U1(U2U3)) ((*U1)U2 ((S
U1U2
) )U3)
((Φ U1U2U3)U4)
≥
(U2U1)
≥
((U1U3)(U2U3))
>
(U1(U2U4)(U3U4))
In what follows, we are interested in relations between the two first levels (phenotype and genotype) by implementing a system of formal analysis called Applicative and Combinatory Categorial Grammar (ACCG) which explicitly connects phenotype expressions to its underlain representations in the genotype 1 . This system consists of: 1. the syntactical analysis of concatened expressions of phenotype by using Combinatory Categorial Grammar. 2. the constructing from the result of syntactical analysis of the func tional semantic interpretation of phenotype expressions. 1.1
Categorial grammars
Categorial Grammars assign syntactical categories to each linguistic unit. Syntactical categories are orientated types developed from basic types and from two constructive operators / and \ . (i) N (nominal syntagm) and S (sentence) are basic types. (ii) If X and Y are orientated types then X/Y and X\Y are orientated types. 2 1
2
In phenotype, linguistic expressions are concatened according to syntagmatic rules of French. In genotype, expressions are arranged according to the applicative order. Here, we choose Steedman's notation (1989): X/Y and X\Y are functional orientated types. A linguistic unit u with the type X/Y (respectively X\Y) is considered as
APPLICATIVE AND COMBINATORY CATEGORIAL GRAMMAR 73 A linguistic unit u with orientated type X will be designed by [X : u] Both rules of application (forward and backward) are noted:
The premises in each rule are concatenations of linguistic units with orient ated types considered as being operators or operands, the consequence of each rule is an applicative expression with orientated type. Combinatory Categorial Grammar (Steedman 1989) generalises clas sical Categorial Grammars by introducing the operation of type-raising and composition on functional types. The new proposed rules aim at quasiincremental (from right to left) in order to eliminate the problem of spurious ambiguity (Haddock 1987; Pareschi & Steedman 1987). 1.2
Applicative and Combinatory Categorial Grammar
In ACCG, we consider that the rules of Steedman's Combinatory Categorial Grammar introduce the combinatore B, C*, S into the syntagmatic se quence. This introduction makes it possible to turn one concatened struc ture to one applicative structure.The rules of ACCG are: Type-raising rules:
The premises of rules are typed concatened expressions; results are applic ative expressions (typed) with an eventual introduction of one combinator. The type-raising of an unit u introduces the combinator C * ; the composition operator (or function) whose typed operand Y is positioned on the right (respectively on the left) of operator.
74
ISMAIL BISKRI & JEAN-PIERRE DESCLÈS
of two concatened units introduces the combinators and S. With such rules we can analyse a sentence by means of a quasi-incremental strategy from left to right. The choice of such a strategy is motivated by: 1. our own comprehension that we believe to be incremental, some sen tences; that is to say, each term contributes to the gradual construct ing of meaning (Haddock 1987; Steedman 1989); 2. the control of spurious ambiguity problem (Pareschi & Steedman 1987; Steedman 1989). Example: John
loves
The first rule (>T) applied to the typed unit [N : John] turns operand to operator. It constructs an applicative structure (G*Johm) whose type is S/(S\N). The introduction of the combinator C* illustrates in the ap plicative representation the type-raising: (C* John) works like an operator with his functional type. The rule (>B) combinates the typed linguistic units [S/(S\N) : (C * John)} and [{S\N)/N : loves] with the combinator in order to compose the two functional units (C * John) and loves. A full processing based upon Applicative and Combinatory Categorial Gram mar is carried out in two main steps: 1. The first step is illustrated by the checking of the proper syntactic connection and by the constructing of predicative structures with some combinatore introduced in certain positions of syntagmatic structure. 2. The second step consists in using the β-reduction rules of combinatore in order to create a predicative structure that is underlying phenotype expression. The obtained expression is an applicative one and be longs to genotype language. ACCG generates processes that associate one applicative structure to one concatened expression of phenotype. What remains to be eliminated is the combinatore of obtained expres sion in order to construct the normal form (in the technical meaning of β -reduction) that expresses the functional semantic interpretation. This calculus is completely done in genotype.
APPLICATIVE AND COMBINATORY CATEGORIAL GRAMMAR 75 Therefore, this process that we propose takes the shape of a compilation whose steps are summed up in Figure 1:
Let us deal with a simple example: John loves Mary.
|1 [N:John]-[(S\N)/N:loves]-[N:Mary]
Typed concatened structure of phenotype
2 [S/(S\N):(C * John)H(S\N)/N: loves}-[N:Mary] 3 [S/N:(B(C * John) loves)]-[N:Mary] 4 [S:((B(C * John) loves) Mary)}
(>T) (>B)
5 [S: ((B ( C * John) loves) Mary)] 6 [S: ( ( * John) {loves Mary))} 7 [S: ((loves Mary) John)]
() (C*) |
(>) Typed applicative structure of genotype Normal form of genotype
The type raising (>T) allocating the operand Johnmakes it possible to generate the operator (C * John) that the functional rule (>B) composes with the operator loves. The complex operator (B (C * John) loves) is ap plied to the operand Maryin order to form the applicative expression of genotype ((B (C * John) loves) Mary). The reduction of combinators in genotype constructs the functional semantic interpretation that is underly ing to phenotype expression (input).
76 2
ISMAIL BISKRI & JEAN-PIERRE DESCLÈS Structural reorganisation
The syntactic analysis from left to right raises the problem of non- determ inism introduced by the presence in the language of backward modifiers that stand as operators which are applied to the whole or a part of a structure previously constructed. If, in the first case the use of a rule of application allows the analysis to be carried on 3 , it is quite different for the second case where the analysis blocks. For a sentence like John loves Mary madly, the parser at first cre ates the constituent [S : ((B (C * John) loves) Mary)]. This last constituent is not combinable with madly, with the type (S\N)\(S\N). As a matter of fact, madly is an operator whose operand (loves Mary) stands on its left. A quasi-incremental analysis from left to right makes easy the application of a combinatory rule as soon as possible. This factor gets as direct con sequence to absorb 4 loves and Mary into ((B (C * John) loves) Mary), which obviously does not allow us to directly construct (loves Mary). The raised problem comes back to the possibility of a backtracking. But this backtracking is the kind one to increase the computational cost (memory and time execution) of one syntactic analysis. However, an intel ligent backtracking (that we will propose later on) can allow us to reduce this cost considerably, and at the same time by constructing proper semantic analysises and by eliminating spurious ambiguities. So, such a backtrack ing will decompose the constituent already constructed in two components whose one of them may be combined with the backward modifier. Formally, this operation of structural reorganisation is realised by the two following successive steps: (a) the reorganisation of constituent already constructed isolates two sub- categories at each time, and tests if the backward modifier may be combined on left 5 or not with one of these two sub- categories. We then proceed to the reduction of combinators until the test gives us a positive value. At the end of the process we will recover a new typed applicative structure equivalent to the first one. 3
4 5
Let us take the example of sentence John hit Mary yesterday where the backward modifier yesterday operates on the whole sentence John hit Mary; yesterday whose syntactic type is S\S, in order to continue the analysis, it is enough to apply yesterday to John hit Mary by the rule (<). That is to say, Mary does not appear clearly as the operand of the operator to love. In our terminology, u1 may be combined on the left with u2 if one of these two cases is possible: — one of the following rules <,
APPLICATIVE AND COMBINATORY CATEGORIAL GRAMMAR 77 Example: In the case of the statement John loves Mary madly, the steps of reorganisation are: Constituent constructed: [S : ((B (C * John) loves) Mary)] The two sub-categories are: [S/N : (B (C * John) loves)] and [TV : Mary] Test: [S/N : (B (C * John)loves)] may not be combined on the left with [(S\N)\(S\N) : madly] [N : Mary] may not be combined on left with [(S\N)\(S\N) : madly] Reduction of combinator B: [S : ((C * John)(loves Mary))] The two sub-categories are: [S/(S\N) : (C * Jean)]; [S\N : (lovesMary)] Test: [S/(S\N) : (C * John)] may not be combined on the left with [(S\N)\(S\N) : madly] [S\N : (loves Mary)] may be combined on the left with [(S\N)\(S\N) : madly] Stop of combinatore reduction process. We recover the category in output: [S : ((C * John)(loves Mary))]. (b) decomposition realised by means of the two rules:
We read these rules as follows: • For (>dec): If we have an applicative structure (u1 u2) with type X, ul of type X/Y and u2 of type Y, then we can construct a new concatened expression formed by both categories [X/Y:ul] and [Y:u2]. • For (<dec): If we have an applicative structure (ul u2) with type X, ul of type X \ Y and u2 of type Y, then we can construct a new concatened expression formed by both categories [Y:u2] and [X \ Y:ul]. Let us notice that the two rules (>dec) and (<dec) are respectively inverse to the rules of functional application (>) and (<). Both rules allow us to construct again a new concatened ordering of the structure operator/operand coming from the reorganisation. For the sentence John loves Mary madly the decomposition is applied to the structure that arises from reorganisation: [5 : ((C * John)(loves Mary))]. With the rule (>dec), we produce the concatened ordering: [S/(S\N) : (C * John)} ֊ [S\N : (loves Mary)}. These two steps enter the complete analysis of the sentence John loves Mary madly as following (step 5 for reorganisation and step for decom-
78
ISMAIL BISKRI & JEAN-PIERRE DESCLÈS
position): 1 4 5 6 7 8
Typed concatened structure of phenotype (1) [N:John-[(S\N)/N:loves]-[N:Mary]-[(S\N)\(S\N):modly] [S:((B (C * John) loves) Mary)]-[(S\N)\(S\N):madly] [S:((C* John)(loves Mary))]-[(S\N)\(S\N):madly] [S/(S\N):(C * John)]-[S\N:(loves Mary)]~[(S\N)\(S\N):madly] [S/(S\N):(C* John)]-[S\N:(madly (loves Mary))] [S:((C* John)(madly (loves Mary)))]
(B) (>dec) (<) (>)
9 10
Typed applicative structure of phenotype (9-10) [S:((C* John)(madly (loves Mary)))] [S:((madly (loves Mary)) John)] (C*) Normal form of genotype
3
Coordination
Coordination is the action of joining two words or two expressions of the same kind or having the same function. Within the framework of Categorial Grammars, Steedman (1989), Barry and Pickering (1990) consider that two linguistic units may be coordinated in order to give one linguistic unit of type X if and only if each unit has type X. Even if this definition remains incomplete, knowing that coordination presents itself under different shapes, it points out the way to follow in an ideal manner and in order to settle a fiable solution. We present four types of examples of coordination with AND. We may coordinate 6 : 1. Two segments of the same kind, with the same structure and contigu ous to AND: [John loves]S/N and [William hates]S/N these pictures 2. Two segments into an elliptic construction: John loves [Mary madly] and [Jenny wildly] [John] loves [Mary] and [William Jenny] 3. Two segments of different structures: Mary walks [slowly] and [with happiness]. John [sings] and [plays the violin]. 4. Two segments without distributivity: The flag is [white] and [red] (≠ The flag is white and the flag is red). 6
The categories to be coordinated are between square brackets.
APPLICATIVE AND COMBINATORY CATEGORIAL GRAMMAR 79 To the conjunction AND we associate the morphologic type (X\X)/X. How ever the context gives more specifications to assign a type to AND. The hypothesis 1 and 2 make it possible to assign a type to AND by taking into account the context. Hypothesis 1: The constructed category that immediately fol lows the conjunction AND determinates the type of coordina tion. This hypothesis leads us to indirectly introduce an interruption in the quasiincremental analysis: as soon as we encounter the conjunction AND, we temporarily interrupt the quasi-incremental analysis in order to construct the second member of the coordination. We propose a second hypothesis: Hypothesis 2: When we have a typed coordination X defined by the hypothesis 1, the first member of coordination is the typed category X which immediately precedes the conjunction. The rules that we have to bring out through these both hypothesises con sequently emanate from the idea that both members of coordination have the same syntactic types X corresponding to different functional semantic interpretations. The result of rules application keeps the same syntactic type X. We set up two abstract types for the conjunction. The first one concerns the distributive conjunction, we will note it CONJD. The second type concerns the non-distributive conjunction and we will note it CONJN.
We apply the ruleto the cases of distributive coordination. In order to take into account the distributivity at the level of applicative structure, we use the combinator Φ. We apply the rule to the cases of non-distributive coordination (see example E3). With the quasi-incremental analysis, during the application of the hy pothesis 2, two typical cases occur: 1. the constituent produced before encountering the conjunction is of the same type than the constituent determinated by coordination. This constituent is then the first member of coordination. For in stance, the analysis of the sentence: [John loves]S/N and [William hates]S/N these pictures constructs [S/N:(B (C * John) loves)] before encountering the conjunction. This constituent has the same type than the second member [S/N:(B (C * William) hates)], the constitu ent determinated by the first hypothesis. The constituent [S/N:(B (C * John) loves)] is then the first member of coordination.
80
ISMAIL BISKRI & JEAN-PIERRE DESCLÈS
2. the constituent determinated before encountering conjunction has not the same type than the constituent determinated by coordination. It is necessary to modify the structure of this constituent. For instance, analysis of sentence: John loves [Mary madly) and [Jenny wildly) con structs [S : ((C * John) (madly (loves Mary)))) before the analysis of conjunction. The second member of coordination is [(S\N)\((S\N)/N): (B wildly (C * Jenny)))7. In that second case, the process of structural reorganisation allows us: • either to directly isolate the first member of coordination (See the steps and 7 of example El) • or to isolate the binary structure operator/operand which contains the first member of coordination. In this case, it is necessary to as sociate the structural reorganisation with the use of logical equival ences (These equivalences are the direct consequences of introduction and elimination of the combinators and C * ) of combinatory logic (a,b,c,d) (See the step 8 of example E2): (a)
⇔
(ul(u2u3))
((B u1 u2) u3)
(b) ((u1 u2) u3) ⇔ ((B (C * u3) u1) u2) (c)
(u1 (u2
u3))
⇔
(( u1 ( * u3)) u2)
(d) ((u1 u2) u3) ⇔ (( ( * u3)( * u2)) u1)
4
Meta-rules
We add to our formalism different metarules that control type raising. These metarules, on the one hand, indicate to us that a rule of type raising has to be applied, and on the other hand choose the particular type raising to be realised. We do not consider these metarules as an absolute computational tool, we convey a linguistic and logical pertinence to them. They may have an interpretation if we take into account some prosodic factors. In what follows, we present three metarules among those that we have conceived. These last ones are ten in number (Biskri 1995). Let us take ul and u2 in the concatened expression ul-u2: 7
The sentence John loves Mary madly and Jenny wildly is ambiguous. In our example, we consider Jenny as an object.
APPLICATIVE AND COMBINATORY CATEGORIAL GRAMMAR 81 Meta-rule 1: If ul has type N and u2 has type (Y \ N)/Z, then we apply type-raising before (>T) to ul: [TV : ul ⇒ Y/(Y\N) : (C*u1)] Example: John eats the apple [N : John] - [(S\N)/N : eats] - [N/N : the] - [N : apple] [S/{S\N) : (C * John)] - [{S\N)/N : eats] - ... In this case Y = S ; Z = N. Meta-rule 2: If ul has type N (ul preceded by and) and u2 has type N, then we apply type-raising before (>T) to ul : [N : ul ⇒ S/(S\N) : (C *u1)] Example: John loves Mary and William Jenny ... - [CONJD : and] - [N : W i l l i a m [N : Jenny] ... - [S/(S\N) : (C * William)] - [N : Jenny] Meta-rule 3: If u2 has type N and ul has type Y/X (ul preceded by and), then we apply the backward type-raising (
5
Examples
El: John loves Mary and hates Jenny P h e n o t y p e (1) 1 [N:John]-[(S\N)/N:loves]-[N:Mary]-[CONJD:and]-[(S\N)/N: 4 5 6 7
hates]-[N:Jenny]
[S:((B ( C , John) loves) Mary)]-[CONJD:and]-[(S\N)/N: hates]-[N:Jenny] [S:((B ( C , John) loves) Mary)]-[CONJD:and]-[S\N:(hates Jenny)] (>) [S:((C * John) (loves Mary))]-[CONJD:and]-[S\N:(hates Jenny)] (B) [S/(S\N):(C * John)]-[S\N:(loves Mary)]-[CONJD:and][S\N:(hates Jenny)] (>dec) () 8 [S/(S\N):(C * John)]-[S\Ν:(Φ and (loves Mary) (hates Jenny))] 9 [S:((C * John)(Φ and (loves Mary) (hates Jenny)))] (>) Genotype (10-12) 10 [S:((C* John)(Φ and (loves Mary) (hates Jenny)))] 11 [S:((Φ and (loves Mary) (hates Jenny)) John)] (C * ) (Փ) 12 [S:(and ((loves Mary) John)((hates Jenny) John))]
E2: John loves Mary and William Jenny
82
ISMAIL BISKRI & JEAN-PIERRE DESCLÈS
P h e n o t y p e (1) 1 [N:John]-[(S\N)/N:loves]-[:Mary]-[CONJD:and]-[N:William]-[N-Jenny] 4 5 6 7 8 9 10
[S:((B ( C * John) loves) Mary)]-[CONJD:and]-[N:William]-[N:Jenny] ... -[CONJD:and]-[S/(S\N):(C * William)]-[N: Jenny] (>T),M2 ... -[CONJD:and]-[S/(S\N):(C * William)]-[(S\N)\(S/(S\N)):(C * Jenny)] ( < T ) , M 3 ... -[CONJD:and]-[S\(S/(S\N)):(B ( * William)(C* Jenny)] (>Bx) [S:((B ( C * John)(B ( C * Mary)) loves)]-[CONJD:and]-... (d) [(S\N)/N:loves]-[S\(S/(S\N)):(B ( C * John)(C* Many))-[CONJD:and]-... (<dec) [(S\N)/N:loves]-[S\((S/(S\N)): (Φ and ( ( , John)(C* Mary))(B ( * William)(* Jenny)))] (
Genotype (12-19) 12 [S:((Φ and ( (* John) ( C . Mary))(B ( * William)(* Jenny))) loves)] 13 [S:(and (( (* John)(* Many)) loves) (( ( * William)(C* Jenny)) loves))] 14 [S:(and ( ( C * J o h n ) ( ( C , Many) loves))((B ( * William)(C * Jenny)) loves))] 15 [S:(and (((C * Mary) loves) John)((B ( C * William)(C* Jenny)) loves))] 16 [S:(and ( ( l o v e sMary) Jo/m) (( ( C * William)(C, Jenny)) loves))] 17 [S:(and ((loves Mary) Jo/m) ((* William)((C * Jenny) loves)))] 18 [S:(and ((loves Mary) John)(((C* Jenny) loves) William))] ' 19 [S:(and ((loves Mary) John)((loves Jenny) William))]
E3: the flag is white and red
(Φ) () (C * ) (C * ) (B) (C*) (C * )
the flag is white and the flag is red
P h e n o t y p e (1) 1 [N/N:the]-[N:flag]-[(S\N)/(N\N):is]-[N\N:white]-[CONJN:and]-[N\N:red] 2 3 4 5 6 7 8
[N:(the flag)]-[(S\N)/(N\N): is]-[N\N:white]-[CONJN:and]-[N\N:red] [S/(S\N):(C * (the flag))]-[(S\N)/(N\N):is]-[N\N:white]-[CONJN:and]-... ( > T ) , M 1 [S/(N/N):(B (C* (the flag)) is)]-[N\N:white]-[CONJN:and]-N\N:red] (>B) [S:((B ( C * (the flag)) is)}-[N\N:white}-[CONJN:and}-N\N:red] (>) [S/(N/N):(B (C* (the flag)) is)]-[N\N:white]-[CONJN:and]-[N\N:red] (>dec) [S/(N/N):(B (C* (the flag)) is)]-[N\N:(and white red)] () [S:((B ( C * (the flag)) is)(and white red))] (>dec)
Genotype (9-11) 9 [S:((B ( C * (the flag)) is)(and white red))] 10 [S:((C * (the flag))(is (and white red)))] 11 [S:((is(and white red))(the
flag))]
(B) (C*)
Other examples and more details are provided in (Biskri 1995). Analysises are implemented. Here, we do not give the details of the algorithm.
APPLICATIVE AND COMBINATORY CATEGORIAL GRAMMAR 83 6
Conclusion
We have presented a model of analysis within the framework of Applicative Cognitive Grammar that realises the interface between syntax and semantic. For many French examples this model is able to realise the following aims: • to produce an analysis which verifies the syntactic correction of state ments. • to develop automatically the predicative structures that yield the func tional semantic interpretation of statements. Moreover, this model has the following characteristics: 1. We do not make any calculus parallel to syntactic calculus like Monta gue's one (1974). A first calculus verifies the syntactic correction, this calculus is carried on by a construction of functional semantic interpretation. This has been made possible by the introduction of combinators to some specific positions of syntagmatic order. 2. We introduce some components of functional semantic by some ap plicative syntactic tools (combinators). 3. We calculate the functional semantic interpretation by some applicat ive syntactic methods (combinators reduction). In order to sum up, we interpret by means of absolute syntactic techniques. The distinction syntax/semantic should be then thought again in another perspective. REFERENCES Ades, Anthony & Mark Steedman. 1982. "On the Order of Words". Linguistics and Philosophy 4.517-558. Barry, Guy & Martin Pickering. 1992. "Dependency and Constituency in Cat egorial Grammar". Word Order in Categorial Grammar / L'ordre des mots dans les grammaires catégorielles ed. by Alain Lecomte, 38-57. ClermontFerrand: Adosa. Biskri, Ismail. 1995. La Grammaire categorielle combinatoire applicative dans le cadre de la grammaire applicative et cognitive. Ph.D. dissertation, EHESS, Paris. Buszkowski, Wojciech, W. Marciszewsk: & Joan Van Benthem. 1988. Categorial Grammar. Amsterdam & Philadelphia: John Benjamins. Curry, Haskell . & Robert Feys. 1958. Combinatory Logic. vol.I, Amsterdam: North-Holland. Deselès, Jean-Pierre. 1990. Langages applicatifs, langues naturelles et cognition. Paris: Hermes.
84
ISMAIL BISKRI & JEAN-PIERRE DESCLÈS & Frederique Segond. 1992. "Topicalisation: Categorial Analysis and Ap plicative Grammar". Word Order in Categorial Grammar ed. by Alain Le comte, 13-37. Clermont-Ferrand: Adosa.
Haddock, Nicholas. 1987. "Incremental Interpretation and Combinatory Cat egorial Grammar". Working Papers in Cognitive Science, I: Categorial Gram mar, Unification Grammar and Parsing ed. by Nicholas Haddock et al., 7184. University of Edinburgh. Lecomte, Alain. 1994. Modeles logiques en théorie linguistique: Éléments pour une théorie informationnelle du langage. Work synthesis. Grenoble:: Uni versité de Grenoble. Moortgat, Michael. 1989. Categorial Investigation, Logical and Linguistic pects of the Lambek Calculus. Dordrecht: Foris.
As
Oehrle, Richard T., Emmon Bach & Deidre Wheeler. 1988. Categorial Grammars and Natural Languages Structures. Dordrecht: Reidel. Pareschi, Remo & Mark Steedman. 1987. "A Lazy Way to Chart Parse with Categorial Grammars". Proceeding of the 27th Annual Meeting of the Asso ciation for Computational Linguistics (ACL'87). Stanford. Shaumyan, Sebastian K. 1987. A Semiotic Theory of Natural Language. Bloom ington: Indiana University Press. Steedman, Mark. 1989. Work in Progress: Combinators and Grammars in Nat ural Language Understanding. Summer Institute of Linguistics, Tucsoni, Uni versity of Arizona. Szabolcsi, Anna. 1987. "On Combinatory Categorial Grammar". Proceeding of the Symposium on Logic and Language, 151-162. Budapest: Akademiai Kiadó.
PARSETALK
about Textual Ellipsis
U D O HAHN & MICHAEL STRUBE
Freiburg University Abstract We present a hybrid methodology for the resolution of textual ellipsis. It incorporates conceptual proximity criteria applied to ontologically well-engineered domain knowledge bases and an approach to cen tering based on functional topic/comment patterns. We state gram matical predicates for textual ellipsis and then turn to the procedural aspects of their evaluation within the framework of an actor-based implementation of a lexically distributed parser. 1
Introduction
Text phenomena, e.g., textual forms of anaphora or ellipsis, are a particu larly challenging issue for the design of natural language parsers, since lack ing recognition facilities either result in referentially incohesive or invalid text knowledge representations. At the conceptual level, textual ellipsis (also called functional anaphora) relates an elliptical expression to its ante cedent by conceptual attributes (or roles) associated with that antecedent (see, e.g., the relation between "Zugriffszeit" (access time) and "Laufwerk" (hard disk drive) in (3) and (2) below). Thus it complements the phe nomenon of nominal anaphora (cf. Strube & Hahn 1995), where an ana phoric expression is related to its antecedent in terms of conceptual gener alisation (as, e.g., "Rechner" (computer) refers to "LTE-Lite/25" (a partic ular notebook) in (2) and (1) below). The resolution of text-level anaphora contributes to the construction of referentially valid text knowledge repres entations, while the resolution of textual ellipsis yields referentially cohesive text knowledge bases. (1) Der LTE-Lite/25 wird mit der ST-3141 von Seagate ausgestattet. (The LTE-Lite/25 is - with the ST-3141 from Seagate - equipped.) (2) Der Rechner hat durch dieses neue Laufwerk ausreichend Platz für WindowsProgramme. (The computer provides - because of this new hard disk drive - sufficient storage for Windows programs.) (3) Darüber hinaus ist die Zugriffszeit mit 25 ms sehr kurz. (Also - is - the access time of 25 ms - quite short.)
86
UDO HAHN & MICHAEL STRUBE
Fig, 1: Fragment of the information technology domain knowledge base In the case of textual ellipsis, the conceptual entity that relates the topic of the current utterance to discourse elements mentioned in the preceding one is not explicitly mentioned in the surface expression. Hence, the missing conceptual link must be inferred in order to establish the local coherence of the whole discourse (for an early statement of that idea, cf. Clark (1975)). For instance, in (3) the proper conceptual relation between "Zugriffszeit" (access time) and "Laufwerk" (hard disk drive) must be determined. This relation can only be made explicit if conceptual knowledge about the domain is supplied. It is obvious (see Figure 11) that the concept A C C E S S - T I M E is bound in a direct associative or aggregational relation, viz. access-time, to the concept H A R D - D I S K - D R I V E , while its relation to the instance LTEL I T E - 2 5 is not so tight (assuming property inheritance). A relationship between A C C E S S - T I M E and S T O R A G E - S P A C E or SOFTWARE is excluded at the conceptual level, since they are not linked via any conceptual role. 1
The following notational conventions apply to the knowledge base for the information technology domain to which we refer throughout the paper (see Figure 1): Angular boxes from which double arrows emanate contain instances (e.g., LTE-LITE 2 5), while rounded boxes contain generic concept classes (e.g., NOTEBOOK). Directed unlabelled links relate concepts via the isa relation (e.g., NOTEBOOK and COMPUTER-SYSTEM), while links labelled with an encircled square represent conceptual roles (definitional roles are marked by "d"). Their names and value constraints are attached to each circle (e.g., COMPUTER-SYSTEM - has-central-unit - CENTRAL-UNIT, with small ital ics emphasising the role name). Note that any sub concept or instance inherits the conceptual attributes from its superconcept or concept class (this is not explicitly shown in Figure 1).
PARSETALK ABOUT TEXTUAL ELLIPSIS
87
Nevertheless, the association of concepts through conceptual roles is far too unconstrained to properly discriminate among several possible antecedents in the preceding discourse context. We therefore propose a basic heur istic for conceptual proximity, which takes the path length between concept pairs into account. It is based on the common distinction between concepts and roles in classification-based terminological reasoning systems (cf. MacGregor (1991) for a survey). Conceptual proximity takes only conceptual roles into consideration, while it does not consider the generalisation hier archy between concepts. The heuristic can be phrased as follows: If fully connected role chains between the concepts denoted by a possible ante cedent and an elliptical expression exist via one or more conceptual roles, that particular role composition is preferred for the resolution of textual ellipsis whose path contains the least number of roles. Whenever several connected role chains of equal length exist, functional constraints which are based on topic/comment patterns apply for the selection of the proper ante cedent. Hence, only under equal-length conditions grammatical information from the preceding sentence is brought into play (for a precise statement in terms of the underlying text grammar, cf. Table 5 in Section 4). To illustrate these principles, consider the sentences (1)-(3) and Fig ure 1. According to the convention above H A R D - D I S K - D R I V E is conceptu ally most proximate to the elliptical occurrence of A C C E S S - T I M E (due to the direct conceptual role linking H A R D - D I S K - D R I V E - access-time -A C C E S S T I M E with unit length 1), while the relationship between L T E - L I T E - 2 5 and A C C E S S - T I M E exhibits a greater conceptual distance (counting with unit length 2, due to the composition of roles between L T E - L I T E - 2 5
has-hd-drive ֊ H A R D - D I S K - D R I V E - access-time ֊ A C C E S S - T I M E ) . 2
Ontological engineering for ellipsis resolution
Metrical criteria incorporating path connectivity patterns in network-based knowledge bases have often been criticised for lacking generality and in troducing ad hoc criteria likely to be invalidated when applied to different domain knowledge bases (DKB). The crucial point about the presumed un reliability of path-length criteria addresses the problem how the topology of such a network can be 'normalised' such that formal distance measures uniformly relate to intuitively plausible conceptual proximity judgements. Though we have no formal solution for this correspondence problem, we try to eliminate structural idiosyncrasies by postulating two ontology engineer ing (OE) principles (cf. also Simmons (1992) and Mars (1994)):
88
UDO HAHN & MICHAEL STRUBE
1. Clustering into Basic Categories. The specification of the upper level of the ontology of some domain (e.g., information technology (IT)) should be based on a stable set of abstract, yet domain-oriented ontologicai categories inducing an almost complete partition on the en tities of the domain at a comparable level of generality (e.g., hardware, software, companies in the IT world). Each specification of such a ba sic category and its taxonomic descendents constitutes the common ground for what Hayes (1985) calls clusters and Guha & Lenat (1990) refer to as micro theories, i.e., self-contained descriptions of concep tually related proposition sets about a reasonable portion of the commonsense world within a single knowledge base partition (subtheory). 2. Balanced Deepening. Specifications at lower levels of that onto logy, which deal with concrete objects of the domain (e.g., notebooks, laser printers, hard disk drives in the IT world), must be carefully balanced, i.e., the extraction of attributes for any particular category should proceed at a uniform degree of detail at each decomposition level. The ultimate goal is that any subtheory have the same level of representational granularity, although these granularities might differ among various subtheories (associated with different basic categories). Given an ontologically well-engineered DKB, the ellipsis resolution problem, finally, has to be projected from the knowledge to the symbol layer of repres entations. By this, we mean the abstract implementation of knowledge rep resentation structures in terms of concept graphs and their emerging path connectivity patterns. At this level, we draw on early experiments from cognitive psychologists such as Rips et al. (1973) and more recent research on similarity metrics (Rada et al. 1989) and spreading-activation-based inferencing, e.g., by Charniak (1986). They indicate that the definition of proximity in semantic networks in terms of the traversal of typed edges (e.g., only via generalisation or via attribute links) and the corresponding counting of nodes that are passed on that traversal is methodologically valid for computing semantically plausible connections between concepts.2 The OE principles mentioned above are supplemented by the following linguistic regularities which hold for textual ellipsis: 1. Adherence to a Focused Context. Valid antecedents of elliptical expressions mostly occur within subworld boundaries (i.e., they remain within a single knowledge base cluster, micro theory, etc.). Given the 2
An alternative to simple node counting for the computation of semantic similarity, which is based on a probabilistic measure of information content, has recently been proposed by Resnik (1995).
PARSETALK ABOUT TEXTUAL ELLIPSIS
89
OE constraints (in particular, the one requiring each subworld to be characterised by the same degree of conceptual density), path length criteria make sense for estimating the conceptual proximity. 2. Limited Path Length Inference. Valid pairs of possible ante cedents and elliptical expressions denote concepts in the DKB whose conceptual relations (role chains) are constructed on the basis of rather restricted path length conditions (in our experiments, no valid chain ever exceeded unit length 5). This corresponds to the implicit require ment that these role chains must be efficiently computable. 3
Functional centering principles
Conceptual criteria are of tremendous importance, but they are not suffi cient for the proper resolution of textual ellipsis. Additional criteria have to be supplied in the case of equal role length for alternative antecedents. We therefore incorporate into our model various functional criteria in terms of topic/comment patterns which originate from (dependency) structure ana lyses of the underlying utterances. The framework for this type of informa tion is provided by the well-known centering model (Grosz et al. 1995). Ac cordingly, we distinguish each utterance's backward-looking center (Cb(Un)) and its forward-looking centers (Cf(Un)). The ranking imposed on the ele ments of the Cf reflects the assumption that the most highly ranked element of Cf(Un) is the most preferred antecedent of an anaphoric or elliptical ex pression in the utterance U n+1 , while the remaining elements are (partially) ordered according to decreasing preference for establishing referential links. The main difference between the original centering approach and our proposal concerns the criteria for ranking the forward-looking centers. While Grosz et al. assume (for the English language) that grammatical roles are the major determinant for the ranking on the C f , we claim that for German - a language with relatively free word order - it is the functional informa tion structure of the sentence in terms of topic/comment patterns. In this framework, the topic (theme) denotes the given information, while the com ment (rheme) denotes the new information (for surveys, cf. Danes (1974) and Dahl (1974)). This distinction can easily be rephrased in terms of the centering model. The theme then corresponds to the C b (U n ), the most highly ranked element of (Cf(Un_1) which occurs in Un. The theme/rheme hierarchy of Un is determined by the (C f (U n _ 1 ): elements of Un which are contained in Cf(Un-1) (context-bound discourse elements) are less rhematic than elements of Un which are not contained in ( C f ( U n - 1 ) (unbound ele-
90
UDO HAHN & MICHAEL STRUBE
ments). The distinction between context-bound and unbound elements is important for the ranking on the Cf, since bound elements are generally ranked higher than any other non-anaphoric elements. The rules for the ranking on the Cf are summarised in Table 1. They are organised at three layers. At the top level, >TCbase denotes the basic relation for the overall ranking of topic/comment (TC) patterns. The second relation in Table 1, > TCboundtype denotes preference relations exclusively dealing with multiple occurrences of bound elements in the preceding utterance. The bottom level of Table 1 is constituted by >prec, which covers the prefer ence order for multiple occurrences of the same type of any topic/comment pattern, e.g., the occurrence of two anaphora or two unbound elements (all heads in a sentence are ordered by linear precedence relative to their text position). The proposed ranking, though developed and tested for German, prima facie not only seems to account for other free word order languages as well but also extends to fixed word order languages like English, where grammatical roles and information structure, unless marked, coincide. Table 1: Functional ranking on Cf based on topic/comment patterns context-bound element(s) >TCbase unbound element(s) anaphora >TCboundtype elliptical antecedent >TCboundtype elliptical expression nominal head1 >prec nominal head2 >prec ... >prec nominal headn Given these basic relations, we may define the composite relation >TC (cf. Table 2). It summarises the criteria for ordering the items on the forwardlooking centers CF (X and y denote lexical heads). Table 2: Global topic/comment
relation
>TC := { (x, ) | if χ and y both represent the same type of TC patterns then the relation >prec applies to x and y else if x and y both represent different forms of bound elements then the relation >TCboundtype applies to x and y else the relation >TCbase applies to x and y } 4
Grammatical predicates for textual ellipsis
We here build on the ParseTalk model, a fully lexicalised grammar theory which employs default inheritance for lexical hierarchies (Hahn et al. 1994). The grammar formalism is based on dependency relations between lexical
PARSETALK ABOUT TEXTUAL ELLIPSIS
91
heads and modifiers at the sentence level. The dependency specifications3 allow a tight integration of linguistic knowledge (grammar) and conceptual knowledge (domain model), thus making powerful terminological reasoning facilities directly available for the parsing process. Accordingly, syntactic analysis and semantic interpretation are closely coupled. The resolution of textual ellipsis is based on two criteria, a structural and a conceptual one. The structural condition is embodied in the predicate is ΡotentialElliptic Antecedent (cf. Table 3). An elliptical relation between two lexical items is restricted to pairs of nouns. The elliptical phrase which occurs in the n-th utterance is restricted to a definite NP, the antecedent must be one of the forward-looking centers of the preceding utterance. Table 3: Grammar predicate for a potential elliptical antecedent
Į
isPotentialEllipticAntecedent (x, y, η) :⇔ x isac* Noun Λ isac* Noun Λ 3 ζ: (y head ζ Λ ζ isac* DetDefinite) Λ y Є Un Λ x Є Cf(Un-1)
The function Proximity Score (cf. Table 4) captures the basic conceptual condition in terms of the role-related distance between two concepts. More specifically, there must be a connected path linking the two concepts under consideration via a chain of conceptual roles. Finally, the predicate PreferredConceptualBridge (cf. Table 5) combines both criteria. A lexical item χ is determined as the proper antecedent of the elliptical expression y if it is a potential antecedent and if there exists no alternative antecedent ζ whose Proximity Score either is below that of χ or, if their ProximityScore is equal, whose strength of preference under the TC relation is higher than that of x. 3
We assume the following conventions to hold: = {Word, Nominal, Noun, DetDefin ite,...} denotes the set of word classes, and isac = {(Nominal, Word), (Noun, Nominal), (DetDefinite, Nominal),...} cCxC denotes the subclass relation which yields a hierarch ical ordering among these classes. The concept hierarchy consists of a set of concept names F = {COMPUTER-SYSTEM, NOTEBOOK, ACCESS-TIME, T I M E - M S - P A I R , . . . }
(cf. Figure 1) and a subclass relation isaF = {(NOTEBOOK, COMPUTER-SYSTEM), (ACCESS-TIME, TIME-MS-PAIR),...} F x F. The set of role names R = [has-part, has-hd-drive, has-property, access-time,...} contains the labels of admitted conceptual roles. These role names are also ordered in terms of a conceptual hierarchy, viz. isaR = {(has-hd-drive, has-part), (access-time, has-property),...} ΊΖ x ΊΖ. The relation permit F x R x F characterises the range of possible conceptual roles among con cepts, e.g., (HARD-DISK-DRIVE, access-time, ACCESS-TIME) Є permit. Furthermore, object. refers to the concept denoted by object, while head denotes a structural
92
UDO HAHN & MICHAEL STRUBE ProximityScore (from- concept, to-concept)
Table 4: Conceptual distance function
Ι
PreferredConceptualBridge (χ, y, η) :⇔ isPotentialEllipticAntecedent (χ, y, n) Λ - z : isPotentialEllipticAntecedent (ζ, y, n) Λ ( ProximityScore (z., .) < ProximityScore(x.c, y.x) V ( ProximityScore (z.c, y.x) = ProximityScore (x.x, .) Λ z >TC x ) ) Table 5: Preferred conceptual bridge for textual ellipsis
5
Text cohesion parsing: Ellipsis resolution
The actor computation model (Agha & Hewitt 1987) provides the back ground for the procedural interpretation of lexicalised grammar specifica tions in terms of so-called word actors (Hahn et al. 1994). Word actors communicate via asynchronous message passing; an actor can only send messages to other actors it knows about, its so-called acquaintances. The arrival of a message at an actor triggers the execution of a method that is composed of grammatical predicates, as those given in the previous section. The resolution of textual ellipsis depends on the results of the resolution of nominal anaphora and on the termination of the semantic interpretation of the current sentence. A SearchTextEllipsisAntecedent message will only be triggered at the occurrence of the definite noun phrase NP when NP is not a nominal anaphor and NP is not already connected via a Pof-type relation (e.g., property-of, physical-part-of)4. 4
relation within dependency trees, viz. χ being the head of y. Associated with the set R is the set of inverse roles R-1. This distinction becomes important for already established relations like has-property (subsuming access-time, etc.) or has-physical-part (subsuming has-hd-dnve, etc.) insofar as they do not block the initialisation of the ellipsis resolution procedure, whereas the existence of their inverses, we here refer to as Pof-type relations, viz. property-of (subsuming accesstime-of, etc.) and physical-part-of (subsuming hd-drive-of etc.), does. This is simply due to the fact that the semantic interpretation of a phrase like "the access time of the new hard disk drive", as opposed to that of its elliptical counterpart "the access time" in sentence (3), where the genitive object is elliptified (zeroed), already leads to the creation of the Pof-type relation the ellipsis resolution mechanism is supposed to determine. This blocking condition has been proposed and experimentally validated by Katja Markert.
PARSETALK ABOUT TEXTUAL ELLIPSIS
93
Der Rechner hat durch dieses neue Laufwerk ausreichend Platz für Windows-Programme. Darüber hinaus ist die Zugriffszeit mit 25 ms sehr kurz. The computer provides - because of this new HD-drive - sufficient storage for Windows programs. Also - is - the access time of 25 ms - quite short.
Fig. 2: Sample parse for text ellipsis resolution The message passing protocol for establishing cohesive links based on the recognition of textual ellipsis consists of two phases: 1. In phase i, the message is forwarded from its initiator to the sentence delimiter of the preceding sentence, where its state is set to phase 2. 2. In phase 2, the sentence delimiter's acquaintance Cf is tested for the predicate PreferredConceptualBridge. Note that only nouns and pronouns are capable of responding to the SearchTextEllipsis Antecedent message and of being tested as to whether they fulfil the required criteria for an elliptical relation. If the text ellipsis predic ate PreferredConceptualBridge succeeds, the determined antecedent sends a TextEllipsisAntecedentFound message to the initiator of the SearchTextEllipsisAntecedent message. Upon receipt of the AntecedentFound message, the discourse referent of the elliptical expression is conceptually related to the antecedent's referent via the most specific (common) Pof-type relation, thus preserving local coherence at the conceptual level of text propositions. In Figure 2 we illustrate the protocol for establishing elliptical rela tions by referring to the already introduced text fragment (2)-(3) which is repeated at the bottom line of Figure 2. Sentence (3) contains the def inite NP die Zugriffszeit (the access time). Since, at the conceptual level, A C C E S S - T I M E does not subsume any lexical item in the preceding text (cf. Figure 1), the anaphora test fails. The conceptual correlate of die Zugriffs zeit has also not been integrated in terms of a Pof-type relation into the conceptual representation of the sentence as a result of the semantic inter pretation. Consequently, a S'earchTextEllipsisAntecedent message is created by the word actor for Zugriffszeit. That message is sent directly to the sentence delimiter of the previous sentence (phase 1), where the predicate PreferredConceptualBridge is evaluated for the acquaintance Cf (phase 2).
94
UDO HAHN & MICHAEL STRUBE
The concepts are examined in the order given by the C f , first L T E - L I T E - 2 5 (unit length 2), then S E A G A T E - S T - 3 1 4 1 (unit length 1). Since no paths shorter than those with unit length 1 can exist, the test terminates. Even if another item in the centering list following S E A G A T E - S T - 3 1 4 1 would have this shortest possible length, it would not be considered due to the functional preference given to S E A G A T E - S T - 3 1 4 1 in the Cf. Since S E A G A T E - S T 3 1 4 1 has been tested successfully, a TextEllipsisAntecedentFound message is sent to the initiator of the SearchAntecedent message. An appropriate up date links the corresponding instances via the role access-time-of'and, thus, local coherence is established at the conceptual level of the text knowledge base. 6
C o m p a r i s o n with related approaches
As far as proposals for the analysis of textual ellipsis are concerned, none of the standard grammar theories (e.g., HPSG, LFG, GB, CG, TAG) covers this issue. This is not surprising at all, as their advocates pay almost no attention to the text level of linguistic description (with the exception of several forms of anaphora) and also do not take conceptual criteria as part of grammatical descriptions seriously into account. More specifically, they lack any systematic connection to well-developed reasoning systems accounting for conceptual knowledge of the underlying domain. This latter argument also holds for the framework of DRT, although Wada (1994) deals with restricted forms of textual ellipsis in the DRT context. Also only few systems exist which resolve textual ellipses. As an ex ample, consider the PUNDIT system (Palmer et al. 1986), which provides an informal solution for a particular domain. We consider our proposal superior, since it provides a more general, domain-independent treatment at the level of a formalised text grammar. The approach reported in this paper also extends our own previous work on textual ellipsis (Hahn 1989) by the incorporation of a more general proximity metric and an elaborated model of functional preferences on Cf elements which constrains the set of possible antecedents according to topic/comment patterns. 7
Conclusion
In this paper, we have outlined a model of textual ellipsis parsing. It con siders conceptual criteria to be of primary importance and provides a prox imity measure in order to assess various possible antecedents for consider ation of proper bridges (Clark 1975) to elliptical expressions. In addition,
PARSETALK ABOUT TEXTUAL ELLIPSIS
95
functional constraints based on topic/comment patterns contribute further restrictions on elliptical antecedents. The anaphora resolution module (Strube & Hahn 1995) and the tex tual ellipsis handler have both been implemented in Smalltalk as part of a comprehensive text parser for German. Besides the information techno logy domain, experiments with this parser have also been successfully run on medical domain texts, thus indicating that the grammar predicates we developed are not bound to a particular domain (knowledge base). The current lexicon contains a hierarchy of approximately 100 word class spe cifications with nearly 3.000 lexical entries and corresponding concept de scriptions from the LOOM knowledge representation system (MacGregor & Bates 1987) — 900 and 500 concept/role specifications for the information technology and medicine domain, respectively. Acknowledgements. We would like to thank our colleagues in the CLIF Lab who read earlier versions of this paper. In particular, improvements were due to discussions we had with N. Bröker, K. Markert, S. Schacht, K. Schnattinger, and S. Staab. This work has been funded by LGFG aden-Württemberg (1.1.4-7631.0; M. Strube) and a grant from DFG (Ha 2907/1-3; U. Hahn). REFERENCES Agha, Gul & Carl Hewitt. 1987. "Actors: A Conceptual Foundation for Concur rent Object-oriented Programming". Research Directions in Object-Oriented Programming ed. by B. Shriver et al., 49-74. Cambridge, Mass.: MIT Press. Charniak, Eugene. 1986. "A Neat Theory of Marker Passing". Proceedings of the 5th National Conference on Artificial Intelligence (AAAI '86), vol.1, 584-588. Clark, Herbert H. 1975. "Bridging." Proceedings of the Conference on Theoretical Issues in Natural Language Processing (TINLAP-1), Cambridge, Mass. ed. by Roger Schank & . Nash-Webber, 169-174. Dahl, Sten, ed. 1974. Topic and Comment, Contextual Boundness and Focus. Hamburg: Buske. Danes, František, ed. 1974. Papers on Functional Sentence Perspective. Prague: Academia. Grosz, Barbara J., Aravind K. Joshi & Scott Weinstein. 1995. "Centering: A Framework for Modeling the Local Coherence of Discourse". Computational Linguistics 21:2.203-225. Guha, R. V. & Douglas B. Lenat. 1990. "CYC: A Midterm Report". AI Maga zine 11:3.32-59.
96
UDO HAHN & MICHAEL STRUBE
Hahn, Udo. 1989. "Making Understanders out of Parsers: Semantically Driven Parsing as a Key Concept for Realistic Text Understanding Applications". International Journal of Intelligent Systems 4:3.345-393. Hahn, Udo, Susanne Schacht & Norbert Bröker. 1994. "Concurrent, Objectoriented Natural Language Parsing: The ParseTalk Model". International Journal of Human-Computer Studies 41:1/2.179-222. Hayes, Patrick J. 1985. "The Second Naive Physics Manifesto". Formal Theories of the Commonsense World ed. by J. Hobbs & R. Moore, 1-36. Norwood, N.J.: Ablex. MacGregor, Robert. 1991. "The Evolving Technology of Classification-based Knowledge Representation Systems." Principles of Semantic Networks ed. by J. Sowa, 385-400. San Mateo, Calif.: Morgan Kaufmann. MacGregor, Robert & Raymond Bates. 1987. The LOOM Knowledge Repres entation Language. Information Sciences Institute, University of Southern California (ISI/RS-87-188). Mars, Nicolaas J. I. 1994. "The Role of Ontologies in Structuring Large Know ledge Bases". Knowledge Building and Knowledge Sharing ed. by K. Fuchi & T. Yokoi, 240-248. Tokyo, Ohmsha and Amsterdam: IOS Press. Palmer, Martha S. et al. 1986. "Recovering Implicit Information". Proceedings of the 24th Annual Meeting of the Association for Computational Linguistics (ACL86), 10-19. New York, N.Y. Rada, Roy, Hafedh Mili, Ellen Bicknell & Maria Blettner. 1989. "Development and Application of a Metric on Semantic Nets". IEEE Transactions on Sys tems, Man, and Cybernetics 19:1.17-30. Resnik, Philip. 1995. "Using Information Content to Evaluate Semantic Similar ity in a Taxonomy". Proceedings of the 14th International Joint Conference on Artificial Intelligence (IL95), vol.1, 448-453. Montreal, Canada. Rips, L. J., E. J. Shoben & E. E. Smith. 1973. "Semantic Distance and the Verification of Semantic Relations". Journal of Verbal Learning and Verbal Behavior 12:1.1-20. Simmons, Geoff. 1992. "Empirical Methods for 'Ontologicai Engineering'. Case Study: Objects". Ontologie und Axiomatik der Wissensbasis von LILOG ed. by G. Klose, E. Lang & Th. Piriein, 125-154. Berlin: Springer. Strube, Michael & Udo Hahn. 1995. "ParseTalk about Sentence- and Text-level Anaphora". Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics (EACL'95)i 237-244. Wada, Hajime. 1994. "A Treatment of Functional Definite Descriptions." Pro ceedings of the 15th International Conference on Computational Linguistics (COLING-94), vol.II, 789-795. Kyoto, Japan.
Improving a Robust Morphological Analyser Using Lexical Transducers IÑAKi
ALEGRÍA, X A B I E R ARTOLA
&
K E P A SARASOLA
University of the Basque Country Abstract This paper describes the components of a robust and wide-coverage morphological analyser for Basque and their transformation into lex ical transducers. The analyser is based on the two-level formalism and has been designed in an incremental way with three main mod ules: the standard analyser, the analyser of linguistic variants, and the analyser without lexicon which can recognise word-forms without having their lemmas in the lexicon. This analyser is a basic tool for current and future work on automatic processing of Basque and its first three applications are a commercial spelling corrector and a general purpose lemmatiser/tagger. The lexical transducers are gen erated as a result of compiling the lexicon and a cascade of two-level rules (Karttunen et al. 1994). Their main advantages are speed and expressive power. Using lexical transducers for our analyser we have improved both the speed and the description of the different com ponents of the morphological system. Some slight limitations have been found too. 1
Introduction
The two-level model of morphology (Koskenniemi 1983) has become the most popular formalism for highly inflected and agglutinative languages. The two-level system is based on two main components: (i) a lexicon where the morphemes (lemmas and affixes) and the possible links among them (morphotactics) are defined; (ii) a set of rules which controls the mapping between the lexical level and the surface level due to the morphophonological transformations. The rules are compiled into transducers, so it is possible to apply the system for both analysis and generation. There is a free available software, PC-Kimmo (Antworth 1990) which is a useful tool to experiment with this formalism. Different flavours of two-level morphology have been developed, most of them changing the continuation-class based morphotactics by unification based mechanisms (Ritchie et al. 1992; Sproat 1992).
98
INAKI ALEGRIA,
XABIER
ARTOLA & ΚΕΡΑ SARASOLA
We did our own implementation of the two-level model with slights vari ations, and applied it to Basque (Agirre et al. 1992), a highly inflected and agglutinative language. In order to deal with a wide variety of linguistic data we built a Lexical Database (LDBB). This database is both source and support for the lexicons needed in several applications, and was designed with the objectives of being neutral in relation to linguistic formalisms, flexible, open and easy to use (Agirre et al. 1995). At present it contains over 60,000 entries, each with its associated linguistic features (category, sub-category, case, number, etc.). In order to increase the coverage and the robustness, the analyser has been designed in a incremental way. It is composed of three main modules (see Figure 1): the standard analyser, the analyser of linguistic variants pro duced due to dialectal uses and competence errors, and the analyser without lexicon which can recognise word-forms without having their lemmas in the lexicon. An important feature of the analyser is its homogeneity as the three different steps are based on two-level morphology, far from ad-hoc solutions.
Fig. 1: Modules of the analyser This analyser is a basic tool for current and future work on automatic pro cessing of Basque and its first two applications are a commercial spelling cor rector (Aduriz et al. 1994) and a general purpose lemmatiser/tagger (Aduriz et al. 1995). Following an overview of the lexical transducers and the description of the application of the two-level model and lexical transducers to the different steps of morphological analysis of Basque are given.
IMPROVING MORPHOLOGY USING TRANSDUCERS 2
99
Lexical transducers
A lexical transducer (Karttunen et al. 1992; Karttunen 1994) is a finitestate automaton that maps inflected surface forms into lexical forms, and can be seen as an evolution of two-level morphology where: • Morphological categories are represented as part of the lexical form. Thus it is possible to avoid the use of diacritics. • Inflected forms of the same word are mapped to the same canonical dictionary form. This increases the distance between the lexical and surface forms. For instance better is expressed through its canonical form good (good+COMP:better). • Intersection and composition of transducers is possible (see Kaplan & Kay 1994). In this way the integration of the lexicon (the lexicon will be another transducer) in the automaton can be resolved and the changes between lexical and surface level can be expressed as a cascade of two-level rule systems (Figure 2).
Fig. 2: Lexical transducers (from Karttunen et al. 1992) In addition, the morphological process using lexical transducers is very fast (thousands of words per second) and the transducer for a whole morpholo gical description can be compacted in less than 1 MB.
100
INAKI ALEGRIA,
XABIER
ARTOLA & ΚΕΡΑ SARASOLA
Different tools to build lexical transducers (Karttunen & Beesley 1992; Karttunen 1993) have been developed in Xerox and we are using them. Uses of lexical transducers are documented by Chanod (1994) and Kwon & Karttunen (1994). 3
T h e s t a n d a r d analyser
Basque is an agglutinative language; that is, for the formation of words the dictionary entry independently takes each of the elements necessary for the different functions (syntactic case included). More specifically, the affixes corresponding to the determinant, number and declension case are taken in this order and independently of each other (deep morphological structure). One of the principal characteristics of Basque is its declension system with numerous cases, which differentiates it from the languages spoken in the surrounding countries. We have applied the two-level model defining the following elements (Agirre et al. 1992; Alegría 1995): • Lexicon: over 60,000 entries have been defined corresponding to lem mas and affixes, grouped into 154 sublexicons. The representation of the entries is not canonical because 18 diacritics are used to control the application of morphophonological rules. • Continuation classes: they are groups of sublexicons to control the morphotactics. Each entry of the lexicon has its continuation class and all together define the morphotactics graph. The long distance de pendencies among morphemes can not be properly expressed by con tinuation classes, therefore in our implementation we extended their semantics defining the so-called extended continuation classes. • Morphophonological rules: 24 two-level rules have been defined to express the morphological, phonological and orthographic changes between the lexical and the surface levels that appear when the morph emes are combined. The morphological analyser attaches to each input word-form all possible in terpretations and its associated information that is given in pairs of morphosyntactic features. The conversion of our description to a lexical transducer was done in the following steps: 1. Canonical forms and morphological categories were integrated in the lexicon from the lexical data-base.
IMPROVING MORPHOLOGY USING TRANSDUCERS
101
2. Due to long distance dependencies among morphemes, which could not be resolved in the lexicon, two additional rules were written to ban some combinations of morphemes. These rules can be put in a different rule system near to the lexicon without mixing morphotactics and morphophonology (see Figure 3). 3. The standard rules could be left without changes (mapping in the lexicon canonical forms and arbitrary forms) but were changed in or der to change diacritics by morphological features, doing a clearer description of the morphology of the language.
Fig. 3: Lexical transducer for the standard analysis of Basque The resultant lexical transducer is about 500 times faster than the original system.
102 4
INAKI ALEGRIA, XABIER ARTOLA & KEPA SARASOLA T h e analysis and correction of linguistic variants
Because of the recent standardisation and the widespread dialectal use of Basque, the standard morphology is not enough to offer good results when analysing corpora. To increase the coverage of the morphological processor an additional two-level subsystem was added (Aduriz et al. 1993). This subsystem is also used in the spelling corrector to manage competence errors and has two main components: 1. New morphemes linked to the corresponding correct ones. They are added to the lexical system and they describe particular variations, mainly dialectal forms. Thus, the new entry tikan, dialectal form of the ablative singular morpheme, linked to its corresponding right entry tik will be able to analyse and correct word-forms such etxetikan, k a l e t i k a n , ... (variants of e t x e t i k from the house, k a l e t i k from the street, ...). Changing the continuation class of morphemes morphotactic errors can be analysed. 2. New two-level rules describing the most likely regular changes that are produced in the variants. These rules have the same structure and management than the standard ones. Twenty five new rules have been defined to cover the most common competence errors. For instance, the rule h:0 => V:V_V:V describes that between vowels the h of the lexical level may disappear in the surface level. In this way the wordform bear, misspelling of behar, to need, can be analysed. All these rules are optional and have to be compiled with the standard rules but some inconsistencies have to be solved because some new changes were forbidden in the original rules. To correct the word-form the result of the analysis has to be entered into the morphological generation using correct morphemes linked to variants and original rules. To correct beartzetikan, variant of b e h a r t z e t i k , two steps, analysis and generation, are followed as it is shown in Figure 4. When we decided to use lexical transducers for the treatment of linguistic variants, the following procedure was applied: 1. The additional morphemes linked to the standard ones are solved using the possibility of expressing two levels in the lexicon. In one level the non-standard morpheme will be specified and in the other (the correspondent to the result of the analysis) the standard morpheme. 2. The additional rules do not need to be integrated with the standard ones (Figure 5), and so, it is not necessary to solve the inconsistencies.
IMPROVING MORPHOLOGY USING TRANSDUCERS
103
Fig. 4: Steps {or correction As Figure 5 (B) shows, it is possible and clearer to put these rules in other plane near to the surface, because most of the additional rules are due to phonetic changes and do not require morphological information. Only the surface characters, the morpheme boundary and additional information about one change (the final a of lemmas) complete the intermediate level between the two rule systems. 3. In our original implementation it was possible to distinguish between standard and non-standard analysis (the additional rules are marked and this information can be obtained as result of the analysis), and so the non- standard information can be additional; but with lexical transducers, it is necessary to store two transducers one for standard analysis and other for standard and non-standard analysis. Although in the original system the speed of analysis using additional in formation was two or three times slower than the standard analysis, using lexical transducers the difference between both analysis is very slight. 5
The analysis of unknown words
Based on the idea used in speech synthesis (Black et al. 1991), a two-level mechanism for analysis without lexicon was added to increase the robustness of the analyser.
104
INAKI ALEGRIA, XABIER ARTOLA & KEPA SARASOLA
(A)
(B)
Fig. 5: Lexical transducer for the analysis of linguistic
variants
This mechanism has the following two main components in order to be capable of treating unknown words: 1. generic lemmas represented by "??" (one for each possible open cat egory or subcategory) which are organised with the affixes in a small two-level lexicon 2. two additional rules in order to express the relationship between the generic lemmas at lexical level and any acceptable lemma of Basque, which are combined with the standard ones Some standard rules have to be modified because surface and lexical level are specified, and in this kind of analysis the lexical level of the lemmas changes. The two-level mechanism is also used to analyse the unknown forms, and the obtention of at least one analysis is guaranteed. In order to eliminate the great number of ambiguities in the analysis, a local disambiguation process is carried out.
IMPROVING MORPHOLOGY USING TRANSDUCERS
105
By using lexical transducers the two additional rules can be placed inde pendently (see Figure 6), and so, the original rules can remain unchanged. In this case the additional subsystem is arranged close to the lexicon be cause it maps the transformation between generic and hypothetical lemmas at lexical level. The resultant lexical transducer is very compact and fast.
Fig. 6: Lexical transducer for the analysis of unknown words Our system has a user lexicon and an interface to the update process too. Some information about the new entries (mainly part of speech) is necessary to add them to the user lexicon. The user lexicon is combined with the general one increasing the coverage of the morphological analyser. This mechanism is very useful in the process of spelling correction but an on line updating of the user lexicon is necessary. This treatment is carried out in our original implementation but, when we use lexical transducers the updating operation is slow (it is necessary to compile everything together) and therefore, there are problems for on-line updating. Carter (1995) proposes compiling affixes and rules, but no lemmas, in order to have flexibility when dealing with open lexicons, but it presents problems managing compounds at run-time.
106 6
INAKI ALEGRIA, XABIER ARTOLA & KEPA SARASOLA Conclusions
A two-level formalism based morphological processor has been designed in a incremental way in three main modules: the standard analyser, the analyser of linguistic variants produced due to dialectal uses and competence errors, and the analyser without lexicon which can recognise word-forms without having their lemmas in the lexicon. This analyser is a basic tool for current and future work on automatic processing of Basque. A B 4.846 2.343 2.607 1.429 307 85 101 28 22 85 (84%) (79%) 21 4 Full wrong analysis Precision 99,2% 99,7%
Concept Number of words Different words Unknown words Linguistic variants Analysed
A+B 7.207 4.036 392 129 107 (83%) 25 99,4%
Table 1: Figures about the different kinds of analysis Figures about the precision of the analyser are given in Table 6. Two different corpora were used: (A) a text of a magazine where foreign names appear and (B) a text about philosophy. The percents of unknown words and precision are calculated on different words, so, the results with all the corpus would be better. Using lexical transducers for our analyser we have improved both the speed and the description of the different components of the tool. Some slight limitations have been found too. Acknowledgements. This work had partial support from the local Government of Gipuzkoa and from the Government of the Basque Country. We would like to thank to Xerox for letting us using their tools, and also to Ken Beesley and Lauri Karttunen for their help in using these tools and designing the lexical transducers. We also want to acknowledge to Eneko Agirre for his help with the English version of this manuscript.
IMPROVING MORPHOLOGY USING TRANSDUCERS
107
REFERENCES Aduriz, Itziar, E. Agirre, I. Alegria, X. Arregi, J.M. Arriola, X. Artola, A, Diaz de Illarraza, N. Ezeiza, M. Maritxalar, K. Sarasola & M. Urkia. 1993. "A Morphological Analysis Based Method for Spelling Correction". Proceedings of the 6th Conference of the European Association for Computational Lin guistics (EACL'93), 463-463. Utrecht, The Netherlands. , E. Agirre, I. Alegria, X. Arregi, J.M. Arriola, X. Artola, Da Costa A., A. Diaz de Illarraza, N. Ezeiza, M. Maritxalar, K. Sarasola & M. Urkia. 1994. "Xuxen-Mac: un corrector ortografico para textos en euskara". Proceedings of the 1st Conference Universidad y Macintosh, UNIMAC, vol.11, 305-310. Madrid, Spain. , I. Alegria, J.M. Arriola, X. Artola, Diaz de Ilarraza A., N. Ezeiza, K, Gojenola, M. Maritxalar. 1995. "Different issues in the design of a lemmatiser/tagger for Basque". From Text to Tag Workshop, SIGDAT (EACL''95), 18-23. Dublin, Ireland. Agirre, Eneko, I. Alegria, X. Arregi, X. Artola, A. Diaz de Illarraza, M. Maritx alar, K. Sarasola & M. Urkia. 1992. "XUXEN: A spelling checker/corrector for Basque based on Two-Level morphology". Proceedings of the 3rd Con ference Applied Natural Language Processing (ANLP'92), 119-125. Trento, Italy. , X. Arregi, J.M. Arriola, X. Artola, A. Diaz de Illarraza, J.M. Insausti & K. Sarasola. 1995. "Different issues in the design of a general-purpose Lexical Database for Basque". Proceedings of the 1st Workshop on Applications of Natural Language to Data Bases (NLDB'95), Versailles, France, 299-313. Alegria, Iñaki. 1995. Euskal morfologiaren tratamendu automatikorako tresnak. Ph.D. dissertation, University of the Basque Country. Donostia, Basque Country. Antworth, Evan L. 1990. PC-KIMMO: A two-level processor for morphological analysis. Dallas, Texas: Summer Institute of Linguistics. Black, Alan W., Joke van de Plassche & Briony Williams. 1991. "Analysis of Unknown Words through Morphological Descomposition". Proceedings of the 5th Conference of the European Association for Computational Linguistics (EACL'91), vol.1, 101-106. Carter, David. 1995. "Rapid development of morphological descriptions for full language processing system". Proceedings of the 5th Conference of the European Association for Computational Linguistics (EACL'95), 202-209. Dublin, Ireland. Chanod, Jean-Pierre. 1994. "Finite-state Composition of French Verb Morpho logy". Technical Report (Xerox MLTT-005). Meylan, France: Rank Xerox Research Center, Grenoble Laboratory.
108
INAKI ALEGRIA, XABIER ARTOLA & KEPA SARASOLA
Kaplan, Ronald M. & Martin Kay. 1994. "Regular models of phonological rule systems". Computational Linguistics 20:3.331-380. Karttunen, Lauri & Kenneth R. Beesley. 1992. "Two-Level Rule Compiler". Technical Report (Xerox ISTL-NLTT-1992-2). Palo Alto, Calif.: Xerox. Palo Alto Research Center. , Ronald M. Kaplan & Annie Zaenen. 1992. "Two-level morphology with composition". Proceedings of the 14th Conference on Computational Lin guistics (COLING'92), vol.1, 141-148. Nantes, Prance. 1993. "Finite-State Lexicon Compiler". Technical Report (Xerox ISTLNLTT-1993-04-02). Xerox. Palo Alto Research Center. 3333 Coyote Hill Road. Palo Alto, CA 94304 1994. "Constructing Lexical Transducers". Proceedings of the 15th Con ference on Computational Linguistics (COLING'94), vol.1, 406-411. Kyoto, Japan. Koskenniemi, Kimmo. 1983. Two-level Morphology: A general Computational Model for Word-Form Recognition and Production. Publications 11. Univer sity of Helsinki. Kwon, Hyuk-Chul & Lauri Karttunen. 1994. "Incremental construction of a lexical transducer for Korean". Proceedings of the 15th Conference on Com putational Linguistics (COLING,94)-l vol.11, 1262-1266. Kyoto, Japan. Ritchie, Graeme D., Alan W. Black, Graham J. Russell & Stephen G. Pulman. 1992. Computational Morphology. Cambridge, Mass.: MIT Press. Sproat, Richard. 1992. Morphology and Computation. Press.
Cambridge, Mass.: MIT
II SEMANTICS AND DISAMBIGUATION
Context-Sensitive Word Distance by Adaptive Scaling of a Semantic Space HIDEKI KOZIMA & AKIRA ITO
Communications Research Laboratory Abstract This paper proposes a computationally feasible method for measuring the context-sensitive semantic distance between words. The distance is computed by adaptive scaling of a semantic space. In the semantic space, each word in the vocabulary V is represented by a multi dimensional vector which is extracted from an English dictionary through principal component analysis. Given a word set C which specifies a context, each dimension of the semantic space is scaled up or down according to the distribution of C in the semantic space. In the space thus transformed, the distance between words in V becomes dependent on the context (7. An evaluation through a word prediction task shows that the proposed measurement successfully extracts the context of a text. 1
Introduction
Semantic distance (or similarity) between words is one of the basic meas urements used in many fields of natural language processing, information retrieval, etc. Word distance provides bottom-up information for text under standing and generation, since it indicates semantic relationships between words that form a coherent text structure (Grosz & Sidner 1986); word dis tance also provides a basis for text retrieval (Schank 1990), since it works as associative links between texts. A number of methods for measuring semantic word distance have been proposed in the studies of psycholinguistics, computational linguistics, etc. One of the pioneering works in psycholinguistics is the 'semantic differ ential' (Osgood 1952), which analyses the meaning of words by means of psychological experiments on human subjects. Recent studies in computa tional linguistics proposed computationally feasible methods for measuring semantic word distance. For example, Morris & Hirst (1991) used Roget's thesaurus as a knowledge base for determining whether or not two words are semantically related; Brown et al. (1992) classified a vocabulary into semantic classes according to the co-occurrency of words in large corpora;
112
HIDEKI KOZIMA & AKIRA ITO
Kozima & Furugori (1993) computed the similarity between words by means of spreading activation on a semantic network of an English dictionary. The measurements in these former studies are so-called context-free or static ones, since they measure word distance irrespective of contexts. How ever, word distance changes in different contexts. For example, from the word car, we can associate related words in the following two directions: • car → bus, t a x i , railway, • car → engine, t i r e , seat, • • • The former is in the context of 'vehicle', and the latter is in the context of 'components of a car'. Even in free-association tasks, we often imagine a certain context for retrieving related words. In this paper, we will incorporate context-sensitivity into semantic dis tance between words. A context can be specified by a set C of keywords of the context (for example, {car, bus} for the context 'vehicle'). Now we can exemplify the context-sensitive word association as follows: • C= {car, bus} → t a x i , railway, airplane, ••• • C— {car, engine} → t i r e , seat, headlight, ••• Generally, we observe a different distance for different context. So, in this paper we will deal with the following problem: Under the context specified by a given word set C, compute semantic distance d(w,w'\C) between any two words w,w' in our vocabulary V. Our strategy for this context-sensitivity is 'adaptive scaling of a semantic space'. Section 2 introduces the semantic space where each word in the vocabulary V is represented by a multi-dimensional semantic vector. Sec tion 3 describes the adaptive scaling. For a given word set C that specifies a context, each dimension of the semantic space is scaled up or down accord ing to the distribution of C in the semantic space. After this transformation, distance between Q-vectors becomes dependent on the given context. Sec tion 4 shows some examples of the context-sensitive word distance thus computed. Section 5 evaluates the proposed measurement through word prediction task. Section 6 discusses some theoretical aspects of the pro posed method, and Section 7 gives our conclusion and perspective. 2
Vector-representation of word meaning
Each word in the vocabulary V is represented by a multi-dimensional Qvector. In order to obtain Q-vectors, we first generate 2851-dimensional
CONTEXT-SENSITIVE WORD DISTANCE
113
Fig. 1: Mapping words onto Q-vectors P-vectors by spreading activation on a semantic network of an English dic tionary (Kozima & Furugori 1993). Next, through principal component analysis on P-vectors, we map each P-vector onto a Q-vector with a re duced number of dimensions (see Figure 1). 2.1
From an English dictionary to P-vectors
Every word w in the vocabulary V is mapped onto a P-vector P(w) by spreading activation on the semantic network. The network is systematic ally constructed from a subset of the English dictionary, LDOCE (Longman Dictionary of Contemporary English). The network has 2851 nodes corres ponding to the words in LDV (Longman Defining Vocabulary, 2851 words). The network also has 295914 links between these nodes — each node has a set of links corresponding to the words in its definition in LDOCE. Since every headword in LDOCE is defined by using LDV only, the network be comes a closed cross-reference network of English words. Each node of the network can hold activity, and this activity flows through the links. Hence, activating a node in the network for a certain period of time causes the activity to spread over the network and forms a pattern of activity distribution on it. Figure 2 shows the pattern gener ated by activating the node red; the graph plots the activity values of 10 dominant nodes at each step in time. The P-vector P(w) of a word w is the pattern of activity distribution generated by activating the node corresponding to w. P(w) is a 2851dimensional vector consisting of activity values of the nodes at T —10 as an approximation of the equilibrium. P(w) indicates how strongly each node of the network is semanticaliy related with w. In this paper, we define the vocabulary V as LDV (2851 words) in or der to make our argument and experiments simple. Although V is not a large vocabulary, it covers 83.07% of the 1006815 words in the LancasterOslo/Bergen (LOB) corpus. In addition, V can be extended to the set of
114
HIDEKIKOZIMA & AKIRA ITO
Fig. 2: Spreading activation
Fig. 3: Clustering of P-vectors
all headwords in LDOCE (more than 56000 words), since a P-vector of a non-LDV word can be produced by activating a set of the LDV-words in its dictionary definition. (Remember that every headword in LDOCE is defined using only LDV.) The P-vector P(w) represents the meaning of the word w in its rela tionship to other words in the vocabulary V. Geometric distance between two P-vectors P(w) and P(w') indicates semantic distance between the words w and w''. Figure 3 shows a part of the result of hierarchical clus tering on P-vectors, using Euclidean distance between centers of clusters. The dendrogram reflects intuitive semantic similarity between words: for instance, rat/mouse, t i g e r / l i o n / c a t , etc. However, the similarity thus observed is context-free and static. The purpose of this paper is to make it context-sensitive and dynamic. 2.2
From P-vectors to Q-vectors
Through principal component analysis, we map every P-vector onto a Qvector, of which we will define context-sensitive distance later. The principal component analysis of P-vectors provides a series of 2851 principal compon ents. The most significant m principal components work as new orthogonal axes that span m-dimensional vector space. By these m principal compon ents, every P-vector (with 2851 dimensions) can be mapped onto a Q-vector (with m dimensions). The value of m, which will be determined later, is much smaller than 2851. This brings about not only compression of the semantic information, but also elimination of the noise in P-vectors. First, we compute the principal components X 1 , X 2 , • • •, X 2851 — each
CONTEXT-SENSITIVE WORD DISTANCE
115
of which is a 2851-dimensional vector — under the following conditions: • For any x3 its norm |x2| is 1. • For any X3,X3(i ≠ j), their inner product (Xi,X3) is 0. • The variance vi of P-vectors projected onto Xi is not smaller than any vi (j> i). In other words, X1 is the first principal component with the largest variance of P-vectors, and X2 is the second principal component with the secondlargest variance of P-vectors, and so on. Consequently, the set of principal components X 1 , X2 ,..., X 2851 provides a new orthonormal coordinate sys tem for P-vectors. Next, we pick up the first m principal components X 1 , X2, ...,Xm. The principal components are in descending order of their significance, because the variance vi indicates the amount of information represented by Xi We found that even the first 200 axes (7.02% of the 2851 axes) can represent 45.11% of the total information of P-vectors. The amount of information represented by Q-vectors increases with m: 66.21% for the first 500 axes, 82.80% for the first 1000 axes. However, for large m, each Q-vector would be isolated because of overfitting — a large number of parameters could not be estimated by a small number of data. We estimate the optimal number of dimensions of Q-vectors to be m = 281, which can represent 52.66% of the total information. This optimisation is done by minimising the proportion of noise remaining in Q-vectors. The amount of the noise is estimated by ∑wЄF |Q(w)|, where F ( V) is a set of 210 function words — determiners, articles, prepositions, pronouns, and conjunctions. We estimated the proportion of noise for all m = 1, • • •, 2851 and obtained the minimum for m = 281. Therefore, from now we will use a 281-dimensional semantic space. Finally, we map each P-vector P(w) onto a 281-dimensional Q-vector Q(w). The i-th component of Q(w) is the projected value of P(w) on the principal component Xi; the origin of Xi is set to the average of the projected values on it. 3
Adaptive scaling of the semantic space
Adaptive scaling of the semantic space of Q-vectors provides context-sensitive and dynamic distance between Q-vectors. Simple Euclidean distance between Q-vectors is not so different from that between P-vectors; both are contextfree and static distances. The adaptive scaling process transforms the se mantic space to adapt it to a given context C. In the semantic space thus
116
HIDEKI KOZIMA & AKIRA ITO
Fig. 4: Adaptive scaling
Fig. 5: Clusters in a subspace
transformed, simple Euclidean distance between Q-vectors becomes depend ent on C. (See Figure 4.) 3.1
Semantic subspaces
A subspace of the semantic space of Q-vectors works as a simple device for semantic word clustering. In a semantic subspace with the dimensions appropriately selected, the Q-vectors of semantically related words are ex pected to form a cluster. The reasons for this are as follows: • Semantically related words have similar P-vectors, as illustrated in Figure 3. • The dimensions of Q-vectors are extracted from the correlations between P-vectors by means of principal component analysis. As an example of word clustering in the semantic subspaces, let us consider the following 15 words: 1. after, 2. ago, 3. before, 4. bicycle, 5. bus, 6. car, 7. enjoy, 8. former, 9. glad, 10. good, 11. l a t e , 12. pleasant, 13. railway, 14. s a t i s f a c t i o n , 15. vehicle. We plotted these words on the subspace I 2 x l 3 , namely the plane spanned by the second and third dimensions of Q-vectors. As shown in Figure 5, the words form three apparent clusters, namely 'goodness', 'vehicle', and 'past'. However, it is still difficult to select appropriate dimensions for mak ing a semantic cluster for given words. In the example above, we used only two dimensions; most semantic clusters need more dimensions to be well-separated. Moreover, each of the 2851 dimensions is simply selected
CONTEXT-SENSITIVE WORD DISTANCE
117
Fig. 6: Adaptive scaling of the semantic space or discarded; this ignores their possible contribution to the formation of clusters. 3.2
Adaptive scaling
Adaptive scaling of the semantic space provides a weight for each dimension in order to form a desired semantic cluster; these weights are given by scaling factors of the dimensions. This method makes the semantic space adapt to a given context C in the following way: Each dimension of the semantic space is scaled up or down so as to make the words in C form a cluster in the semantic space. In the semantic space thus transformed, the distance between Q-vectors changes with C. For example, as illustrated in Figure 6, when C has ovalshaped (generally, hyper-elliptic) distribution in the pre-scaling space, each dimension is scaled up or down so that C has a round-shaped (generally, hyper-spherical) distribution in the transformed space. This coordinate transformation changes the mutual distance among Q-vectors. In the raw semantic space (Figure 6, left), the Q-vector • is closer to C than the Qvector o; in the transformed space (Figure 6, right), it is the other way round — o is closer to C, while • is further apart. The distance d(w,w'\C) between two words w,w' under the context C = {w1, • • •, wn} is defined as follows:
where Q(w) and Q(w') are the m-dimensional Q-vectors of w and w'; re spectively: Q(w) = (q1 ..., qm), Q(w') = (q', • • •, q'm).
118
HIDEKI KOZIMA & AKIRA ITO
The scaling factor fi G [0,1] of the z'-th dimension is defined as follows:
where SD i (C) is the standard deviation of the z-th component values of w1, • • •, wn, and SD i (V) is that of the words in the whole vocabulary V. The operation of the adaptive scaling described above is summarised as follows. • If C forms a compact cluster in the i-th dimension (ri 0), the di mension is scaled up (fi 1) to be sensitive to small differences in the dimension. • If C does not form an apparent cluster in the z-th dimension (ri >>0), the dimension is scaled down (fi0) to ignore small differences in the dimension. Now we can tune the distance between Q-vectors to a given word set C which specifies the context for measuring the distance. In other words, we can tune the semantic space of Q-vectors to the context C. This tune-up procedure is not computationally expensive, because once we have computed the set of Q-vectors and SD 1 (V), • • •, SD m (V), then all we have to do is to compute the scaling factors f1,..., fm for a given word set C Computing distance between Q-vectors in the transformed space is no more expensive than computing simple Euclidean distance between Q-vectors. 4
Examples of measuring the word distance
Let us see a few examples of the context-sensitive distance between words computed by adaptive scaling of the semantic space with 281 dimensions. Here we deal with the following problem: Under the context specified by a given word set C, compute the distance d(w, C) between w and C, for every word w in our vocabulary V. The distance d(w,C) is defined as follows:
This means that the distance d(w, C) is equal to the distance between w and the center of C in the semantic space transformed. In other words, d(w ,C) indicates the distance of w from the context C.
CONTEXT-SENSITIVE WORD DISTANCE (7 = {bus, car, railway} +
wЄC (15) car_l r a i l way J. bus_l carriage-1 motor_l motor_2 track_2 track_l road-1 passenger_l vehicle_l engine.l garage-1 train_l belt.l
d(w, C) 0.1039 0.1131 0.1141 0.1439 0.1649 0.1949 0.1995 0.2024 0.2038 0.2185 0.2274 0.2469 0.2770 0.2792 0.2853
119
C = {bus, scenery, tour} wЄC+(15) bus_l scenery_l tour - 2 tour-l abroad-1 tourist-l passenger-l make-2 make-3 everywhere_l garage.l set.2 machinery_l something-l timetable.l
d(w, C) 0.1008 0.1122 0.1211 0.1288 0.1559 0.1593 0.1622 0.1691 0.1706 0.1713 0.1715 0.1723 0.1733 0.1743 0.1744
Table 1: Association from a given word set C Now we can extract a word set C+(k) which consists of the k closest words to the given context C. This extraction is done by the following procedure: 1. Sort all words in our vocabulary V in ascending order of d(w, C). 2. Let C+(k) be the word set which consists of the first k words in the sorted list. Note that C+(k) may not include all words in C, even if k > \C\. Here we will see some examples of extracting C+(k) from a given context C. When the word set C = {bus, car, railway} is given, our contextsensitive word distance produces the cluster C + (15) shown in Table 1 (left). We can see from the list1 that our word distance successfully associates related words like motor and passenger in the context of 'vehicle'. On the other hand, from C = {bus, scenery, t o u r } , the cluster C + (15) shown in Table 1 (right) is obtained. We can see the context 'bus tour' from the list. Note that the list is quite different from that of the former example, though both contexts contain the word bus. When the word set C = {read, paper, magazine}, the following cluster C + (12) is obtained. (The words are listed in ascending order of the dis tance.) {paper_l, read_l, magazine.l, newspaper_l, print_2, book_l, p r i n t _ l , wall_l, something_l, a r t i c l e _ l , s p e c i a l i s t - 1 , t h a t - l } . 1
Note that words with different suffix numbers correspond to different headwords (i.e., homographs with different word classes) of the English dictionary LDOCE. For in stance, motor_l / noun, motor_2 / adjective.
120
HIDEKI KOZIMA & AKIRA ITO n
e
1 2 3 4 5 6 7 8
0.3248 0.1838 0.1623 0.1602 0.1635 0.1696 0.1749 0.1801
Fig. 7: Word prediction task (left) and its result (right) It is obvious that the extracted context is 'education' or 'study'. On the other hand, when C = {read, machine, memory}, the following word set C+ (12) is obtained. {machine_l, memory_l, read_l, computer_i, remember_l, someone_l, have-2, t h a t - l , instrument-1, f eeling_2, that_2, what_2}. It seems that most of the words are related to 'computer' or 'mind'. These two clusters are quite different, in spite of the fact that both contexts contain the word read. 5
Evaluation through word prediction
We evaluate the context-sensitive word distance through predicting words in a text. When one is reading a text (for instance, a novel), he or she often predicts what is going to happen next by using what has happened already. Here we will deal with the following problem: For each sentence in a given text, predict the words in the sen tence by using the preceding n sentences. This task is not so difficult for human adults because a target sentence and the preceding sentences tend to share the same contexts. This means that predictability of the target sentence suggests how successfully we extract information about the context from preceding sentences. Consider a text as a sequence S 1 ,...., SN, where Si is the i-th sentence of the text (see Figure 7, left). For a given target sentence Si, let Ci be a set of the concatenation of the preceding n sentences: Ci = {Si-n . . . S i - 1 } . Then, the prediction error ei of Si is computed as follows: 1. Sort all the words in our vocabulary V in ascending order of d(w, Ci). 2. Compute the average rank ri of wij Є Si in the sorted list. 3. Let the prediction error ei be the relative average rank ri/ |V'/.
CONTEXT-SENSITIVE WORD DISTANCE
121
Note that here we use the vocabulary V which consists of 2641 words — we removed 210 function words from the vocabulary V. Obviously, the prediction is successful when ei0. We used 0 . Henry's short story 'Springtime a la Carte' (Thornley 1960: 56-62) for the evaluation. The text consists of 110 sentences (1620 words). We computed the average value e of the prediction error ei for each target sentence Si (i = n + l , . . . , 110). For different numbers of preceding sentences (n = 1 , . . . , 8) the average prediction error ē is computed and shown in Figure 7 (right). If prediction is random, the expected value of the average prediction error ē is 0.5 (i.e., chance). Our method predicted the succeeding words better than randomly; the best result was observed for n — 4. Without adaptive scaling of the semantic space, simple Euclidean distance resulted in ē = 0.2905 for n — 4; our method is better than this, except for n — 1. When the succeeding words are predicted by using prior probability of word occurrence, we obtained ē — 0.2291. The prior probability is estimated by the word frequency in West's five-million-word corpus (West 1953). Again our result is better than this, except for n = 1. 6 6.1
Discussion Semantic vectors
A monolingual dictionary describes the denotational meaning of words by using the words defined in it; a dictionary is a self-contained and selfsufficient system of words. Hence, a dictionary contains the knowledge for natural language processing (Wilks et al. 1989). We represented the meaning of words by semantic vectors generated by the semantic network of the English dictionary LDOCE. While the semantic network ignores the syntactic structures in dictionary definitions, each semantic vector contains at least a part of the meaning of the headword (Kozima & Furugori 1993). Co-occurrency statistics on corpora also provide semantic information for natural language processing. For example, mutual information (Church & Hanks 1990) and n-grams (Brown et al. 1992) can extract semantic re lationships between words. We can represent the meaning of words by the co-occurrency vectors extracted from corpora. In spite of the sparseness of corpora, each co-occurrency vector contains at least a part of the meaning of the word. Semantic vectors from dictionaries and co-occurrency vectors from corpora would have different semantic information (Niwa & Nitta 1994). The former
122
HIDEKI KOZIMA & AKIRA ITO
displays paradigmatic relationships between words, and the latter syntagmatic relationships between words. We should incorporate both of these complementary knowledge sources into the vector-representation of word meaning. 6.2
Word prediction and text structure
In the word prediction task described in Section 5, we observed the best average prediction error e for n = 4 , where n denotes the number of preceding sentences. It is likely that e will decrease with increasing n, since the more we read the preceding text, the better we can predict the succeeding text. However, we observed the best result for n = 4. Most studies on text structure assume that a text can be segmented into units that form a text structure (Grosz & Sidner 1986). Scenes in a text are contiguous and non-overlapping units, each of which describes certain objects (characters and properties) in a situation (time, place, and backgrounds). This means that different scenes have different contexts. The reason why n = 4 gives the best prediction lies in the alternation of the scenes in the text. When both a target sentence Si and the preceding sentences Ci are in one scene, prediction of Si from d would be successful. Otherwise, the prediction would fail. A psychological experiment (Kozima & Furugori 1994) supports this correlation with the text structure. 7
Conclusion
We proposed context-sensitive and dynamic measurement of word distance computed by adaptive scaling of the semantic space. In the semantic space, each word in the vocabulary is represented by an m-dimensional Q-vector. Q-vectors are obtained through a principal component analysis on P-vectors. P-vectors are generated by spreading activation on the semantic network which is constructed systematically from the English dictionary (LDOCE). The number of dimensions, m = 281, is determined by minimising the noise remaining in Q-vectors. Given a word set C which specifies a context, each dimension of the Q-vector space is scaled up or down according to the distribution of C in the space. In the semantic space thus transformed, word distance becomes dependent on the context specified by C. An evaluation through predicting words in a text shows that the proposed measurement captures the context of the text well.
CONTEXT-SENSITIVE WORD DISTANCE
123
T h e context-sensitive and dynamic word distance proposed here can be applied in many fields of natural language processing, information retrieval, etc. For example, the proposed measurement can be used for word sense disambiguation, in t h a t the extracted context provides bias for lexical am biguity. Also prediction of succeeding words will reduce the computational cost in speech recognition tasks. In future research, we regard the adaptive scaling method as a model of human memory and attention t h a t enables us to follow a current context, to put a restriction on memory search, and to predict what is going to happen next. REFERENCES Brown, Peter F., Vincent J. Delia Pietra, Peter V. deSouza, Jenifer C. Lai & Robert L. Mercer. 1992. "Class-Based n-gram Models of Natural Language". Computational Linguistics 18:4.467-479. Church, Kenneth W. & Patrick Hanks. 1990. "Word Association Norms, Mutual Information, and Lexicography". Computational Linguistics 16:1.22-29. Grosz, Barbara J. & Candance L. Sidner. 1986. "Attention, Intentions, and the Structure of Discourse". Computational Linguistics 12:3.175-204. Kozima, Hideki & Teiji Furugori. 1993. "Similarity between Words Computed by Spreading Activation on an English Dictionary". Proceedings of the 6th Conference of the European Chapter of the Association for Computational Linguistics (EACL'93), 232-239. Utrecht, The Netherlands. Kozima, Hideki & Teiji Furugori. 1994. "Segmenting Narrative Text into Coher ent Scenes". Literary and Linguistic Computing 9:1.13-19. Morris, Jane and Graeme Hirst. 1991. "Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text". Computational Linguist ics 17:1.21-48. Niwa, Yoshiki & Yoshihiko Nitta. 1994. "Co-occurrence Vectors from Corpora vs. Distance Vectors from Dictionaries". Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), 304-309. Kyoto, Japan. Osgood, Charles E. 1952. "The Nature and Measurement of Meaning". Psycho logical Bulletin 49:3.197-237. Schank, Roger C. 1990. Tell Me a Story: A New Look at Real and Artificial Memory. New York: Scribner. Thornley, G. C. 1960. British and American Short Stories. Harlow: Longman. West, Michael. 1953. A General Service List of English Words. Harlow: Long man.
124
HIDEKI KOZIMA & AKIRA ITO
Wilks, Yorick, Dan Fass, Cheng-Ming Guo, James McDonald, Tony Plate, & Brian Slator. 1989. "A Tractable Machine Dictionary as a Resource for Computational Semantics". Computational Lexicography for Natural Lan guage Processing ed. by Bran Boguraev & Ted Briscoe, 193-228. Harlow: Longman.
Towards a Sublanguage-Based Semantic Clustering Algorithm M. VICTORIA A R R A N Z , 1 IAN R A D F O R D , SOFIA ANANIADOU & JUN-ICHI T S U J I I
Centre for Computational Linguistics, UMIST Abstract This paper presents the implementation of a tool kit for the ex traction of ontological knowledge from relatively small sublanguagespecific corpora. The fundamental idea behind this system, that of knowledge acquisition (KA) as an evolutionary process, is discussed in detail. Special emphasis is given to the modular and interactive approach of the system, which is carried out iteratively. 1
Introduction
Not knowing which knowledge to encode happens to be one of the main reas ons for difficulties in current NLP applications. As mentioned by Grishman & Kittredge (1986), many of these language processing problems can for tunately be restricted to the specificities of the language usage in a certain knowledge domain. The diversity of language encountered here is consid erably smaller and more systematic in structure and meaning than that of the whole language. Approaching the extraction of knowledge on a sublan guage basis reduces the amount of knowledge to discover, as well as easing the discovery task. One such case of this sublanguage-based research is, for instance, the work carried out by Grishman & Sterling (1992) on selectional pattern acquisition from sample texts. However, we should also bear in mind the necessity for systematic meth odologies of knowledge acquisition, duly supported by software, as already emphasised by several authors (Grishman et al. 1986; Tsujii et al. 1992). Preparation of domain-specific knowledge for a NLP application still relies heavily on human introspection, due mainly to the non-trivial relationship between the ontological knowledge and the actual language usage. This makes the process complex and very time-consuming. In addition, while traditional statistical techniques have proven useful for knowledge acquisition from large corpora (Church & Hanks 1989; Brown 1
Sponsored by the Departamento de Education, Universidades e Investigation of the Basque Government, Spain. */****
126
ARRANZ, RADFORD, ANANIADOU & TSUJII
et al. 1991), they still present two main drawbacks: opacity of the process and insufficient data. The black box nature of purely statistical processes makes them com pletely opaque to the human specialist. This causes great difficulty when judging whether intuitionally uninterpretable results reflect actual language usage, or are simply errors due to the insufficient data. Results therefore have to be either revised to meet the expert's intuition or accepted without revision. To this problem one should also add the fact that statistical methods usually require very large corpora to obtain reasonable results, which is highly unpractical and often unfeasible. This is especially the case if work takes place at a sublanguage level as large corpora become even more inac cessible. Following the research initiated in Arranz (1992) and based on the Epsilon system described in Tsujii & Ananiadou (1993), our aim is to discover a systematic methodology for sublanguage-specific semantic KA, applicable to different subject domains and multilingual corpora. The tool kit [e] being developed at CCL supports the principles of KA as an evolutionary process and from relatively small corpora, making it very practical for current NLP applications. This work represents an iterative and modular approach to statistical language analysis, where the acquired knowledge is stored in a Central Knowledge Base (CKB), which is shared and easy to access and update by all subprocesses in the system. Bearing these considerations in mind, we selected a highly specific cor pus, such as the Unix manual, of about 100,000 words. 2
Epsilon [Є]: process
Knowledge
acquisition
as
an
evolutionary
Epsilon'ts idea of knowledge acquisition as an evolutionary process avoids the above-mentioned problems by achieving the following: Stepwise acquisition of semantic clusters. Our system acquires knowledge as a result of stepwise refinement, therefore avoiding the opacity derived from the single-shot techniques used by purely statistical methods. The specialist inspects after every cycle the hypotheses of new pieces of knowledge proposed by the utility programs in [e]. Design of robust discovery methods. Early stages of the KA process are particularly problematic for statistical programs, due to the fact that the corpus is still very complex. We aim to reduce this complexity by
SUBLANGUAGE-BASED SEMANTIC CLUSTERING
127
initially using more robust techniques (to cope, for e.g., with words with low frequency of occurrence) before applying statistical methods. Inherent links between acquired knowledge and language us age. Epsilon easily deals with the opacity caused by the non-trivial nature of the mapping between the domain ontology and the language usage. The cases of words which denote several different ontological entities, or con versely, one entity denoted by different words, are often encountered in actual corpora, [Є] keeps a record of the pseudo-texts produced during the KA process (cf. below), as well as of their relationships with the acquired knowledge, so that the specialist can check and understand why certain clusterings take place and when. Effective minimum human intervention. As emphasised by Arad (1991) in her quasi-statistical system, human intervention is inevitable. However, in [Є] this intervention remains systematised and is only applied locally, whenever required by the process. The general idea of Knowledge Acquisition as an evolutionary process is illustrated in Figure 1 (Tsujii & Ananiadou 1993). Application of utility programs to Text-i and human inspection of the results yield the next version of knowledge (the i-th version), which in turn is the input to the next cycle of KA. This general framework is simplified if the results of text description are text-like objects (pseudo-texts), where the i-th version presents a lesser degree of complexity than the previous pseudo-text. The pseudo-texts obtained are characterised by the following: they present the same type of data structure as ordinary texts, i.e., an ordered sequence of words. The words contained in these pseudo-texts include both pseudo-words as well as ordinary words. Such pseudo-words can denote semantic categories to which the actual words belong, words with POS information, single concept-names corresponding to multi-word terms and disambiguated lexical items (like in Zernik 1991). Also, these pseudo-texts are fully compatible with the existing utility programs, and neither the input data nor the tool itself require any alteration. Finally, the degree of complexity of the text is approximated in relation to the number of different words and word tokens resulting from the several passes of the programs. Working on lipoprotein literature, Sager (1986) also shows that it is pos sible to meassure quantitative features such as the complexity of information contained in a sublanguage.
128
ARRANZ, RADFORD, ANANIADOU & TSUJII
Fig. 1: General scheme of KA as an evolutionary process
3 3.1
Knowledge acquisition process POS information
Once the Classify subprocess (cf. Section 5) was put into practice, it was observed that since no part-of-speech information was provided, great con fusion was caused at the replacement stage. A series of illegitimate substi tutions were carried out, which resulted in serious incoherence within the generated pseudo-texts. The input text was then preprocessed with Eric Brill's (Brill 1993) rulebased POS tagger. The accuracy of the tagger for the corpus in current use oscillates between 87.89% and 88.64%, before any training takes place, and 94.05%, with a single pass of training. This is quite impressive, if we take into consideration the specificity and technicality of the text. After providing the sample text with POS information, the set of can didates for semantically related clusters was much more accurate, and the wrong replacements of mixed syntactic categories ceased to take place. In addition, this corpus annotation allowed us to establish a tag compatibility set, which contributed in recovering part of the incorrectly rejected hypo theses posed for replacement. Such tag compatibility set consisted of a group
SUBLANGUAGE-BASED SEMANTIC CLUSTERING
129
of lines, each of them containing interchangeable part-of-speech markers. An example of one of these lines looks as follows: JJ JJR JJS VBN. 3.2
Modular configuration
The current version of the system consists of: 1. Central Knowledge Base, which stores all the relationships among words and pseudo-words obtained during the KA process. 2. Record of the pseudo-texts created, as well as the relationships between them, in terms of replacements or clusterings taking place. 3. A number of separate subprocesses (detailed below) which are involved during the processing of each pass of the system. These subprocesses rely upon the iterative application of simple analysis tools, updating the CKB with the knowledge acquired at each stage. The resulting modular system is of a simple-to-maintain and enhance nature. At present [e] contains three major processes involved in the KA task: (i) Compound] which generates hypotheses of multi-word expressions; (ii) Classify, which generates semantically-related term clusters; (iii) Re placement, which deals with the reduction of the complexity of the text, by replacing the newly-found pieces of information within the corpus. 4 4.1
The Compound
subprocess
Framework
This tool performs the search for those multi-word structures within the text that can be ranked as single ontological entities. This module was built to interact with the other existing module Classify, and with the CKB, so as to achieve any required exchange or storage of semantic information. Step 1. The first stage relies on the analysis of the corpus using a simple grammar, which is based upon pairs of words where the second word is a noun and the first is one of the class Noun, Gerund, Adjective. Using this grammar we extract descriptions of the structures of potential compound terms. Any single pass can thus only determine two-word compounds, re quiring multiple passes if longer compounds are to be found. These poten tial compounds are then filtered by simply ensuring that they occur in the corpus more than once. Step 2. The remaining candidates from Step 1 are then prioritised by calculating the mutual information (Church & Hanks 1989) of each pair.
130
ARRANZ, RADFORD, ANANIADOU & TSUJII
Step 3. Once the set of compound term candidates has been verified by the human expert, the replacement of each selected compound with a single token takes place. At present, this token is a composite which retains all of the original information within the corpus entry. For instance, the compound generated from the nouns environment/NN and variable/NN looks as follows: compound (environment/NNV~variable/NN)/NN where the whole structure maintains the grammatical category NN. Step 4. Among those potential compounds discovered, only 40% turned out to be positive cases (cf. Section 4.2). This problem was particularly acute in Adjective Noun and Gerund Noun cases, mainly as a result of the difficulty entailed by the distinction between such general language and domain-specific syntactic pairs. Due to the low frequency of some of the compounds in the corpus, the resulting MI scores were noisy and led to rather irregular results. The measurement of the specificity of the com pounding candidates was then carried out by means of a large corpus of general language (the LOB corpus (Johansson & Holland 1989)). Using the formula shown in equation 1, we established a specificity coefficient, which indicated how specific a particular word was to the sublanguage. (1) Step 5. This is another replacement stage, where the verified compound terms are substituted by compound identifiers, such as Compound67/NN. These identifiers are directly related to the CKB, where a record of the information relating to this token is stored. 4.2
Performance
Regarding the module's performance, the use of the simple grammar in Step 1 succeeds in filtering the around 500 hypotheses of multi-word expressions originally produced, reducing them to around 70 candidates. Out of these 70, 45 present Noun Noun pairs, and the remaining 25 are Adjective Noun or Gerund Noun pairs. As already discussed in Section 4.1, only 40% of the hypotheses belonging to the latter type of compounds were actually correct. Meanwhile, the Noun Noun pairs presented 85% of positive cases. By means of the filtering carried out with the LOB corpus, and using a threshold of 0.9 on adjectives, performance improves from a disappointing 40% to a promising 64% for those troublesome cases, and adds to a global
SUBLANGUAGE-BASED SEMANTIC CLUSTERING
131
Iteration Number Fig. 2: Compounding results 77.5%, just after the first pass. A value of 1.0 in the specificity scale implies that the word is unique to the sublanguage, while negative values represent a word which is more common in general language than in our subject domain sample text. It should be pointed out though, that currently the statistics regarding the word frequencies in the LOB corpus do not take POS information into account, making this filtering a rather limited resource. The future application of an annotated general language text is already being considered, so as to attempt to detect remaining errors. The replacement in Step 5 facilitates the storage of the information in the CKB, and it makes it more accessible for the subprocesses. Once formed, compound identifiers will be treated as an ordinary word with a particular syntactic label. The results obtained by the compounding module are shown in Figure 2. 5 5.1
The Classify
subprocess
Inverse KWIC
This context matching module represents the initial stage in [Є]'s subprocess Classify. Based on the principle that linguistic contexts can provide us with enough information to characterise the properties of words, and to obtain accurate word classifications (Sekine et al. 1992; Tsujii et al. 1992), semantic clusters are extracted by means of the concordance program CIWK (or Inverse KWIC) (Arad 1991). The following is a sample output from CIWK
132
ARRANZ, RADFORD, ANANIADOU & TSUJII
for a [3 3] parameter (3 words preceding and three succeeding): input/NN ;output/NN ; #name/NN of/IN the/DT $ bar-file/NN using/VBG the/DT This indicates that both nouns input/NN and output/NN share the same context at least once in the corpus. Once the list of semantic clusters has been finalised, the corpus is updated with all occurrences of those words within each cluster being replaced by the first word of that cluster. For instance, in the example above, all occurrences of input/NN and output/NN would be replaced by input/NN. For our experiments, a relatively small contextual size parameter has been selected (a [2 2]), so as to obtain a larger set of hypotheses. A list of about 700 semantic classes has been produced with this parameter.
5.2
Evaluation
Among the 700 clusters generated, there is an interesting number of cases which present crucial ontological and contextual features for our KA process. Unfortunately, there is also a significant amount of ambiguous clusters which require filtering. Work is currently taking place on this filtering process and some preliminary results can already be seen in Section 7.2. In spite of the interesting results initially obtained from CIWK, the exact matching technique this tool is based on is rather inflexible for the semantic clustering task. The semantic classes formed and the actual instances of each class can be seen in Figure 3.
6
C e n t r a l knowledge base
Although not fully implemented, our Central Knowledge Base plays a very important role within the system's framework. Due to Epsilon's modular approach and the open nature of the links between the stored acquired know ledge and the different subprocesses within the system, there is no need to retain newly extracted information in the corpus. Everything is maintained in the CKB by means of referentials, such as Semantic-classl8/NN (to refer to a resulting cluster from Classify) or Compound67/NN (to present one of the acquired compound expressions). This provides an easy method of updating and improving the knowledge base, a well as an opportunity to add new modules to the whole configuration of the system.
SUBLANGUAGE-BASED SEMANTIC CLUSTERING
133
Iteration Number Fig. 3: Semantic clustering results
7
7.1
Dynamic context matching techniques for semantic clustering disambiguation Word sense disambiguation
As mentioned in Section 5.2, an important number of ambiguous clusters take place with the use of Classify, which are in need of filtering. However, the CIWK algorithm is very inflexible and will only accept those candidates sharing exact matching contexts. In practice we often encounter instances of semantically-related words, but whose contexts vary slightly for various reasons. In other occasions one might find that differing contexts within the same term, or between different words, represent the different ontologies of such word(s), and therefore need disambiguating. Work on such filtering module is currently being undertaken, by means of a technique called Dy namic Alignment (Somers 1994). 7.2
Dynamic context matching
This technique allows us to compare the degree of similarity between two words, and it represents a much more flexible approach than the exact matching technique used in CIWK. Its aim is to discover all potential matches between a given set of individual words, attaching a value to each match according to its level of importance. Then, the set of matches pro ducing the highest total match strength is calculated. The obtained highest
134
ARRANZ, RADFORD, ANANIADOU & TSUJII
score is attributed to the pair of contexts, establishing thus a value on their similarity relation. For each pair of contexts, the best match value is calcu lated, which results in a correlation matrix. Figure 4 presents an example of the way all possible word matches are discovered for a particular pair of contexts. Given the constraint that the individual matches are not allowed to cross, the maximal set is chosen and thus, its value calculated. The fol lowing is the correlation matrix formed by the pair of words discussed/VBN and listed/VBN: '/, dynamic discussed/VBN listed/VBN +5 -5 < corpus Post context length set to 5 Pre context length set to 5 CIWK data read. 9 records found. 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
5 10 8
7 7 27
6 8 10 8 3 5 7 11 9 6 8 6 14 9 14
7 10 5 4 5 4 4 4 6 4 1 3 4 5 5
Partial Match Full Match
Fig. 4: Example match between two contexts The clustering algorithm used to determine the strongest semantic cluster in the matrix operates in a simple manner. Initially, the pair of contexts with the highest correlation is selected as the core of the cluster. Then, each remaining context is considered in turn, adding to the cluster those
SUBLANGUAGE-BASED SEMANTIC CLUSTERING
135
contexts which present a correlation value above a certain threshold, with respect to more than half the contexts already in the cluster. This will be repeated until no more contexts can be added to the cluster. Although this process is still being tested and required thresholds and parameters are being set, it has proved to present important advantages over Classify: it is more flexible and it implicitly solves the ambiguity prob lem detailed above. The contexts provided contain the necessary ontological knowledge allowing us to extract the different senses of the cluster compon ents, e.g., the above matrix found two different contextual clusters, showing two different meanings. 8
Concluding remarks
This system attempts to avoid the pitfalls faced by purely statistical tech niques of knowledge acquisition. As for this, the idea of KA as an evolution ary process is described in detail, and applied to the task of sublanguagespecific KA from small corpora. The iterative nature of our system enables statistical measures to be performed, in spite of the relatively small size of our sample text. The interactive framework of our implementation provides a simple way to access and store the acquired ontological knowledge, and it also allows our subprocesses to exchange information so as to obtain desir able results. REFERENCES Arad, Iris. 1991. A Quasi-Statistical Approach to Automatic Generation of Lin guistic Knowledge. Ph.D. dissertation, CCL, UMIST, Manchester, U.K. Arranz, Victoria. 1992. Construction of a Knowledge Domain from a Corpus. M.Sc. dissertation, CCL, UMIST, Manchester, U.K. Brill, Eric. 1993. A Corpus-Based Approach to Language Learning. Ph.D. dis sertation, University of Pennsylvania, Philadelphia. Brown, Peter F., Stephen A. Delia Pietra, Vincent J. Delia Pietra & Robert L. Mercer. 1991. "Word-Sense Disambiguation Using Statistical Methods". Proceedings of the 29th Annual Conference of the Association for Compu tational Linguistics (ACL'91), Berkeley, Califs 264-270. San Mateo, Calif.: Morgan Kaufmann. Church, Kenneth W. & Patrick Hanks. 1989. "Word Association Norms, Mutual Information, and Lexicography". Proceedings of the 27th Annual Confer ence of the Association for Computational Linguistics (ACL'89), Vancouver, Canada, 76-82. San Mateo, Calif.: Morgan Kaufmann.
136
ARRANZ, RADFORD, ANANIADOU & TSUJII
Grishman, Ralph & Richard Kittredge. 1986. Analysing Language in Restricted Domains: Sublanguage Description and Processing. New Jersey: Lawrence Erlbaum Associates. & John Sterling. 1992. "Acquisition of Selectional Patterns". Proceedings of the 14th International Conference on Computational Linguistics (COLING'92), Nantes, France, 658-664. , Lynette Hirschman & Ngo Thanh Nhan. 1986. "Discovery Procedures for Sublanguage Selectional Patterns: Initial Experiments". Computational Linguistics 12:3.205-215. Johansson, Stig & Knut Hofland. 1989. Frequency Analysis of English Vocabulary and Grammar: Based on the LOB Corpus, vol.1: Tag Frequencies and Word Frequencies. Oxford: Clarendon Press. Sager, Naomi. 1986. "Sublanguage: Linguistic Phenomenon, Computational Tool". Analysing Language in Restricted Domains: Sublanguage Description and Processing ed. by Ralph Grishman & Richard Kittredge, 1-17. New Jersey: Lawrence Erlbaum Associates. Sekine, Satoshi, Jeremy J. Carroll, Sofia Ananiadou & Jun-ichi Tsujii. 1992. "Automatic Learning for Semantic Collocation". Proceedings of the 3rd Con ference on Applied Natural Language Processing (ANLP'92), Trento, Italy, 104-110. New Jersey: ACL. Somers, Harold, Ian McLean & Daniel Jones. 1994. "Experiments in Multi lingual Example-Based Generation". Proceedings of the 3rd Conference on the Cognitive Science of Natural Language Processing (CSNLP'94), Dublin, Ireland: Dublin City University. Tsujii, Jun-ichi & Sofia Ananiadou. 1993. "Epsilon [e] : Tool Kit for Knowledge Acquisition Based on a Hierarchy of Pseudo-Texts". Proceedings of Natural Language Processing Pacific Rim Symposium (NLPRS'93), 93-101. Fukuoka, Japan. Tsujii, Jun-ichi, Sofia Ananiadou, Iris Arad & Satoshi Sekine. 1992. "Linguistic Knowledge Acquisition from Corpora". Proceedings of the International Workshop on Fundamental Research for the future Generation of Natural Language Processing (FGNLP), 61-81. Manchester, U.K. Zernik, Uri. 1991. "Trainl vs. Train2: Tagging Word Senses in Corpus". Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon ed. by Uri Zernik, 91-112. New Jersey: Lawrence Erlbaum Associates.
Customising a Verb Classification to a Sublanguage ROBERTO BASILI*, MICHELANGELO DELLA ROCCA*, M A R I A T E R E S A PAZIENZA* & PAOLA VELARDI**
* Universita' di Tor Vergata, Roma ** Universita' di Ancona Abstract In this paper we study the relationships between a general purpose, human coded verb classification, proposed in the WordNet lexical reference system, and a corpus driven classification model based on context analysis. We describe a context-based classifier that tunes WordNet to specific sublanguages and reduces its over-ambiguity.1 1
Sense disambiguation and sense tuning
The purpose of this study is to define a context-based statistical method to constrain and customise the WordNet type hierarchy, according to a specific sublanguage. Our context-based method is expected to tune the initial WordNet categorisation to a given corpus, in order to: • Reduce the initial ambiguity • Order each sense according to its relevance in the corpus • Identify new senses typical for the domain. These results could be useful for any NLP systems lacking in human support for word categorisation. The problem that we consider in this paper is strongly related to the problem of word-sense disambiguation. Given a verb and a representative set of its occurrences in a corpus, we wish to determine a subset of its initial senses, that may be found in the sublanguage. In case, new senses may be found, that were not included in the initial classification. Word sense disambiguation is an old-standing problem. Recently, several statistically based algorithms have been proposed to automatically disam biguate word senses in sentences, but many of these methods are hopelessly unusable, because they require manual training for each ambiguous word. 1
This paper summarises the results presented in the International Conference on Recent Advances in Natural Language Processing. The interested reader may refer to the RANLP proceedings for additional details on the experiments.
138
BASILI, DELLA ROCCA, PAZIENZA & VELARDI
Exceptions are the simulated annealing method proposed in (Cowie et al. 1992), and the context-based method proposed in (Yarowsky 1992). Sim ulated annealing attempts to select the optimal combination of senses for all the ambiguous words in a sentence S. The source data for disambigu ation are the LDOCE dictionary definitions and subject codes, associated with each ambiguous word in the sentence S. The basic idea is that word senses that co-occur in a sentence will have more words and subject codes in common in their definitions. However in (Basili et al. 1996) we experimentally observed that sense definitions for verbs in dictionaries might not capture the domain specific use of a verb. For example, for the verb to obtain in the RSD we found patterns of use like: the algorithm obtains good results for the calculation... data obtained from the radar... the procedure obtains useful information by fitting... etc., while the (Webster's) dictionary definitions for this verb are: (i) to gain possession of: to acquire, (ii) to be widely accepted, none of which seems to fit the detected patterns. We hence think that the corpus itself, rather than dictionary definitions, should be used to derive disambiguation hints. One such approach is undertaken in (Yarowsky 1992), which inspired our method (Delia Rocca 1994). In this paper our objectives and methods are slightly different from those in (Yarowsky 1992). First, the aim of our verb classifier is to tune an exist ing verb hierarchy to an application domain, rather than selecting the best category for a word occurring in a context. Second, since in our approach the training is performed on an unbalanced corpus (and for verbs, that no toriously exhibit more fuzzy contexts), we introduced local techniques to reduce spurious contexts and improve the reliability of learning. Third, since we expect also domain-specific senses for a verb, during the classifica tion phase we do not make any initial hypothesis on the subset of categories of a verb. Finally, we consider globally all the contexts in which the verb is encountered in a corpus, and compute a (domain-specific) probability distri bution over its expected senses. In the next section the method is described in detail. 2
A context-based classifier
In his experiment, Yarowsky uses 726 Roget's categories as initial classi fication. In our study, we use a more recently conceived, widely available, classification system, WordNet.
CONTEXTS AND CATEGORIES CATEGORY
body (BD) change (CH) cognition (CO) communication (CM) competition (CP) consumption (CS) contact (CT) creation (CR) emotion (EM) perception (PE) possession (PS) social (SO) stative (ST)
139
#VERBS
#SYNSETS
78 287 200 240 63 48 209 124 47 76 122 217 162
76 412 218 299 73 41 279 133 50 80 156 240 183
Table 1: Excerpt of Kernel verbs in the RSD
We decided to adopt as an initial classification the 15 semantically distinct categories in which verbs have been grouped in WordNet. Table 2 shows the distribution of a sample of 826 RSD verbs among these categories, according to the initial WordNet classification. The average ambiguity of verbs among these categories is 3.5 for our sample in the RSD. In what follows we describe an algorithm to re-assign verbs to these 15 categories, depending upon their surrounding contexts in corpora. Our aim is to tune the WordNet classification to the specific domain as well as to capture rather technical verb uses that suggest semantic categories different from those proposed by WordNet. The method works as follows: 1. Select the most typical verbs for each category; 2. Acquire the collective contexts of these verbs and use them as a (dis tributional) description of each category; 3. Use the distributional descriptions to evaluate the (corpus-dependent) membership of each verb to the different categories. In step 1 of the algorithm we learn a probabilistic model of categories from the application corpus. When training is performed on an unbalanced corpus (or on verbs, that are highly ambiguous and with variable contexts), local techniques are needed to reduce the noise of spurious contexts. Hence, rather than training the classifier on all the verbs in the learning corpus, we select only a subset of prototypical verbs for each category. We call these verbs the salient verbs of a category C. We call typicality Tv(C)
140
BASILI, DELLA ROCCA, PAZIENZA & VELARDI
CATEGORY
K E R N E L VERBS
body (BD) change (CH) cognition (CG) communication (CM) competition (CP) consumption (CS) contact (CT) creation (CR) emotion (EM) motion (MO) perception (PC) possession (PS) social (SO) stative (ST) weather (WE)
produce, acquire, emit, generate, cover calibrate, reduce, increase, measure, coordinate estimate, study, select, compare, plot, identify record, count, indicate, investigate, determine base, point, level, protect, encounter, deploy sample, provide, supply, base, host, utilise function, operate, filter, segment, line, describe design, plot, create, generate, program, simulate like, desire, heat, burst, shock, control well, flow, track, pulse, assess, rotate, sense, monitor, display, detect, observe, show provide, account, assess, obtain, contribute, derive experiment, include, manage, implement, test consist, correlate, depend, include, involve, exist scintillate, radiate, flare
Table 2: Excerpt of kernel verbs in the RSD of v in C, the following ratio: (1) where: Nv is the total number of synsets of a verb v, i.e., all the WordNet synonymy sets including v. Nv,c is the number of synsets of v that belong to the semantic category (7, i.e., synsets indexed with C in WordNet. The synonymy Sv of v in C, i.e., the degree of synonymy showed by verbs other than v in the synsets of the class C in which v appears, is modeled by the following ratio: (2) where: Ov = the number of verbs in the corpus that appear in at least one of the synsets of v. Ov,c — the number of verbs in the corpus appearing in at least one of the synsets oftv,that belongs to C. Given 1 and 2, the salient verbs v, for a category C, can be identified maximising the following function, that we call Score: Scorev(C) = OAv x Tv(C) x Sv(C)
(3)
where OAv are the absolute occurrences of v in the corpus. The value of Score depends both on the corpus and on WordNet. OAv depends obviously
CONTEXTS AND CATEGORIES
141
on the corpus. Instead, the typicality depends only on WordNet. A typical verb for a category C is one that is either non ambiguously assigned to C in WordNet, or that has most of its senses (synsets) in C. Finally, the synonymy depends both on WordNet and on the corpus. A verb with a high degree of synonymy in C is one with a high number of synonyms in the corpus, with reference to a specific sense (synset) belonging to C. Salient verbs for C are frequent, typical, and with a high synonymy in C. The kernel of a category kernel(C), is the set of salient verbs v with a 'high' Scorev(C). To select a kernel, we can either establish a threshold for Scorev(C), or fix the cardinality of kernel(C). We adopted the second choice, because of the relatively small number of verbs found in the medium-sized corpora that we used. Table 2 lists some of the kernel verbs in the RSD. In step 2 of the algorithm, the collective contexts for each category are acquired. The collective contexts of a category C is acquired around the salient words for each category (see (Yarowsky 1992)), though we collect salient words using a ±10 window around the kernel verbs. Figure 1 plots the ratio
vs. the number of contexts
acquired for each category, in the RSD and the MD. It is seen that, in the average and for both domains, very few new words are detected over the threshold of 1000 contexts. This phenomenon is called saturation and is rather typical of sublanguages. However, some of the categories (like weather and emotion in the RSD) have very few kernel verbs. In step 3, we need to define a function to determine, given the set of contexts K of a verb v, the probability distribution of its senses in the corpus. For a given verb v, and for each category C, we evaluate the following function, that we call Sense(v,C): (4) where (5) and Ki is the i-th context of v, and w is a word within Ki. In 5, Pr(C) is the (not uniform) probability of a class C, given by the ratio between the number of collective contexts for C and the total number of collective contexts. A verb v has a high Sense value in a category if:
142
BASILI, DELLA ROCCA, PAZIENZA & VELARDI
Fig. 1: New words per context vs. number of contexts in MD and RSD • it co-occurs 'often' with salient words of a category C: • it has few contexts related to C, but these are more meaningful than the others, i.e., they include highly salient words for C. The corpus-dependent distribution of the senses of v among the categories can be analysed through the function Sense. Notice that, during the clas sification phase 3, the initial WordNet classification of ambiguous verbs is no longer considered (unlike for (Yarowsky 1992)). WordNet is used only during the learning phase in which the collective contexts are built. Hence, new senses may be detected for some verb. We need to establish a threshold for Sense(v, C) according to which, the sense C is considered not relevant in the corpus for the verb v, given all its observed occurrences. Since the values of the Sense function do not have a uniform distribution across categories, we introduce the standard variable: (6) where ΜC and σc are the average value and the standard deviation of the Sense function for all the verbs of C, respectively.
143
CONTEXTS AND CATEGORIES A verb v is said to belong to the class C if N s e n s e ( v , C )≥
Nsense0
(7)
Under the hypothesis of a normal distribution for the values of 6, we exper imentally determined that a reasonable choice is Nsenseo
=1
(8)
With this threshold, we assign to a category C only to those verbs whose Sense value is equal or higher than μ+σc- In a normal distribution, this threshold eliminates 84% of the classifications. In the next section we discuss and evaluate the experimental results obtained for the two corpora. 3
Discussion of the results
Table 3 shows the sense values that satisfy the 7, for an excerpt of randomly selected RSD verbs. The sign "*" indicates the initial WordNet classifica tion. The average ambiguity of our sample of 826 RSD verbs is 2.2, while the initial WordNet ambiguity was 3.5. For 1,235 verbs of the MD, the average ambiguity is 2.1 and the initial was 2.9. We obtained hence a 30-40% reduction of the initial ambiguity. As expected, classes are more appropriate for the domain. Less relevant senses are eliminated (all empty boxes with a "*" in Table 3). New proposed categories are indicated by scores without the "*". The function Sense, defined in the previous section, produces a new, context-dependent, distribution of categories. In this section we evaluate and discuss our data numerically. First, we wish to study the commonalities and divergence between WordNet and our classification method. We introduce the following definition: A = {(v,C)\Nsense(v,C) W = {{v,C)\Scorev(C)
= 84Nsense 0 } > 0}
I = A∩w where A is the set of verbs classified in C according to their context, W is the set of verbs classified in C according to WordNet and I is the intersection between the two sets. Two performance measures, that assume WordNet as an oracle, are recall, defined as and the precision, i.e.,
144
BASILI, DELLA ROCCA, PAZIENZA & VELARDI
BD CH CG CM CP CS CT CR MO PC PS SO ST apply 3.9* * * * 1.3* * calculate 1.1* * * change * * cover * * * * * * * * 1.1* gain * 1.38 * * 4.9* occur 3.8* * * operate 1.1* 3.0* * * point 1.0* * * 1.7* * 2.37 record * * 2.8* scan 2.1* 1.1* * * * * * survey 3.4* test * VERBS
Table 3: Sense values for an excerpt of RSD verbs This definition of recall measures the number of initial WordNet senses in agreement with our classifier. Under the perspective of sense tuning, the recall may be seen as measuring the capability of our classifier to reduce the WordNet initial ambiguity, while the percentage of new senses is given by 100% — precision. Domain Recall
RSD (200 verbs) 41%
MD (341 verbs) 40%
Table 4: A comparison between the corpus-driven classification and WordNet Table 3 summarises recall and precision values for the two domains and shows that the corpus-driven classifications fit the expectations of WordNet authors, while more than 1/2 of the initial senses (59% in RSD, 60% in MD) are pruned out! Furthermore, there are 13% and 18% new detected categor ies in the MD and in the RSD, respectively. Of course, it is impossible to evaluate, if not manually, the plausibility of these new classifications. We will return to this problem at the end of this section. A second possible evaluation of the method is a comparison between unambiguous verbs' classifications. We found that in the large majority of cases, there is a concordance between WordNet and our classifier. Verbs convoy flex wake
BD -2.53 -2.50 34.9*
CH -3.07 -4.76 0.21
CG -1.94 -2.23 0.21
Table 5: Nsense
CM -2.98 -4.42 -0.98
CP -3.08 -3.86 -1.34
CS 2.08 -4.20 1.70
CT -2.37 -3.94 -0.25
CR 0.41 -3.18 -0.17
MO 51.9* 9.14* -1.03
PC PS SO -1.19 -1.68 -2.19 -2.60 -1.97 -3.94 -0.58 -0.83 -0.08 values for three verbs unambiguous in WordNet
ST -4.59 -5.51 -1.16
Table 5 shows the standard variable 6 values for some unambiguous verbs.
CONTEXTS AND CATEGORIES DOMAIN
RSD (140 verbs)
145
MP (170 verbs)
Recall 91% 85% Table 6: Recall of the classification of unambiguous verbs Table 6 globally evaluates the performance of the classifier over unambigu ous verbs, for the two domains. We also attempted a global linguistic analysis of our data. We observed that for some verbs the collective contexts acquired may not express their intended meaning (i.e., category) in WordNet. Moreover technical uses of some verbs are idiosyncratic with respect to their WordNet category. Consider for example the verb to record in the medical domain. This verb is automatically classified in the categories communication and contact The contact classification is new, that is, it was not included among the WordNet categories for to record. Initially, we examined all the occurrences of this verb (45 sentences) with the purpose of manually evaluating the classi fication choices of our system. Each of the authors of this paper independ ently attempted to categorise each occurrence of the verb in the MD corpus as either belonging to the categories proposed by WordNet for to record (communication) or to the new class contact. However, since the WordNet authors provided only very schematic descriptions for each category, each of us used his personal intuition of the definition for each category. The result was a set of almost totally divergent classification choices! During the analysis of the sentences, we observed that the verb to record occurs in the medical domain in rather repetitive contexts, though the sim ilarity of these contexts can only be appreciated through a generalisation process. Specifically, we found two highly recurrent generalised patterns: A record(Z,X,Y): subject (z) object(physiological_state(X)), locative (individual(Y) or body-part(Y) ) .
(e.g., myelitis spinal cord injury tumors were recorded at the three levels paretial spinal cervical . . . ) . B
record(Z,X,Y): subject(Z), object(abstraction(X)), locative(information(Y) ) or time( time_period(Y) ) .
146 I ( ( ( ( ( ( ( ( | (
BASILI, DELLA ROCCA, PAZIENZA & VELARDI
In, normal, patients, potentials, of, a, uniform, shape, were, # , during, flaccidity ) I At, cutoff, frequencies Cavernous, electrical, activity, was, # , in, patients, with, erectile, dysfunction ) Abnormal, findings, of, cavernous, electrical, activity, were , # , in, _, of, the, consecutive, impotent, patients ) Morbidity, and, mortality, rates, were, # , in, the, first, month, of, life, Juveniles, and, yearlings, rarely ) seconds, of, EMG, interference, pattern, were, # , at, a, maximum, voluntary, contractions, from, the, biceps ) interference, pattern, IP, in, studies, were, # , using, a, concentric, needle, electrode, MUAPs, were, recorded ) During, Hz, stimulation, twitches, # , by, measurement, of, the, ankle, dorsiflexor, group, displayed, increasing ) Macro-electromyographic, MUAPs, were, # , from, patients, in, studies, MUAP, analysis, revealed ) myelitis, spinal, cord, injury, tumours, The, SEPs, were, # , at, three, levels, parietal, spinal, cervical ) |
Table 7: Examples of contexts for the verb to record in MD (e.g., mortality rates were recorded in the study during the first month of life) Above unary functors (e.g., individual, information, . . . ) are WordNet labels. We then attempted to re-classify all the occurrences of the verb as either fitting the scheme A or B, regardless of WordNet categories. Table 3 shows a subset of contexts for the verb to record. The symbol "#" indicates an occurrence of the verb. Out of 45 sentences, only 5 did not clearly fit one of the two schemes. There was almost no disagreement among the four human classifiers, and, surprisingly enough (but not so much), we found a very strong correspond ence between our partition of the set of sentences and that proposed by our context-based classifier. If we name the class A contact, and the class B com munication, we found 37 correspondences over 40 sentences. In the three non correspondent cases the context included physiological states and/or body parts, though not as direct object or modifiers of the verb. The sys tem hence classified the verb as contact, though we selected the scheme B. Somehow, it seems that the context-based classifier categorises a verb as contact, not so much because it implies the physical contact of entities, but because the arguments of the verb are physical and are the same of truly contact verbs. For the same verb, a similar analysis has been performed on its 170 RSD contexts and comparable results have been obtained. This experiment suggests that, even if viable (especially but not exclus ively for verb investigation), a mere statistical analysis of the surround ing context of a single ambiguous word does not bring sufficient linguistic insight, though it provides a good global domain representation. Verb se mantics (although domain specific) is useful to explain and validate most of the acquired evidence. As an improvement, in the future, we plan to integ rate the method described in this paper with a more semantically oriented corpus based classification method, described in (Basili et al. 1995).
CONTEXTS AND CATEGORIES 4
147
Final remarks
It is broadly agreed that most successful implementations of NLP applic ations are based on lexica. However, ontological and relational structures in general purpose on-line lexica are often inadequate (i.e., redundant and over-ambiguous) at representing the semantics of specific sublanguages. In this paper we presented a context-based method to tune a general purpose on-line lexical reference system, WordNet, to sublanguages. The method was applied to verbs, one of the major sources of sense ambigu ity. In order to acquire more statistically stable contextual descriptors, we used as initial classification the 15 highest level semantic categories defined in WordNet for verbs. We then used local (corpus dependent) and global (WordNet dependent) evidence to learn the collective contexts of each cat egory and to compute the probability distribution of verb senses among the categories. This tuning method showed to be reliable for a lexical category, like verbs, for which other statistically-based classifiers proposed in literature obtained weak results. For two domains, we could eliminate about 60% of the initial WordNet ambiguity and identify 10-20% new senses. Further more we observed that, for some category, the collective context acquired may be spurious for the intended meaning of the category. A manual ana lysis revealed that a more semantically-oriented representation of a category context would be greatly helpful at improving the performance of the sys tem and at gaining more linguistically oriented information on category descriptions. REFERENCES Basili, Roberto, Maria Teresa Pazienza & Paola Velardi. 1996. "A Context Driven Conceptual Clustering Method for Verb Classification". Corpus Pro cessing for Lexical Acquisition ed. by Branimir Boguraev & James Pustejovsky. Cambridge, Mass.: MIT press. , Maria Teresa Pazienza & Paola Velardi. Forthcoming. "An Empirical Symbolic Approach to Natural Language Processing". To appear in Artificial Intelligence, vol. 85, August 1996. Cowie, Jim, J. Guthrie & L. Guthrie. 1992. "Lexical Disambiguation Using Simulated Annealing". Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), 359-365. Nantes, France.
148
BASILI, DELLA ROCCA, PAZIENZA & VELARDI
Delia Rocca, Michelangelo. 1994. Classificazione automatica dei termini di una lingua basata sulla elaborazione dei contesti [Context-Driven Automatic Clas sification of Natural Language Terms]. Ph.D. dissertation, Dept. of Electrical Engineering, Tor Vergata University, Rome. Fellbaum, Christian, R. Beckwith, D. Gross & G. Miller. 1993. "WordNet: A Lexical Database Organised on Psycholinguistic Principles". Lexical Acquis ition: Exploting On-Line Resources to Build a Lexicon ed. by U. Zernik, 211-232. Hillsdale, New Jersey: Lawrence-Erlbaum Associates. Yarowsky, David. 1992. "Word-Sense Disambiguation Using Statistical Models of Rogets Categories Trained on Large Corpora". Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), 359365. Nantes, France.
Concept-Driven Search Algorithm Incorporating Semantic Interpretation and Speech Recognition A K I T O NAGAI, YASUSHI ISHIKAWA & KUNIO NAKAJIMA
MITSUBISHI Electric Corporation Abstract This paper discusses issues concerning incorporating speech recogni tion with semantic interpretation based on concept. In our approach, a concept is a unit of semantic interpretation and an utterance is re garded as a sequence of concepts with an intention to attain both linguistic robustness and constraints for speech recognition. First, we propose a basic search method for detecting concepts from a phrase lattice by island-driven search evaluating the linguistic likelihood of concept hypotheses. Second, an improved method to search effi ciently for N-best meaning hypotheses is proposed. Experimental results of speech understanding are also reported. 1
Introduction
A 'spoken language system' for a naive user must have linguistic robustness because utterances are shown by a large variety of expressions, which are often ill-formed (Ward 1993:49-50; Zue 1994:707-710). How does a language model cover such a variety of sentences? There is a crucial issue closely related to linguistic robustness: how do we exploit linguistic constraints to improve 'speech recognition'? Syntactic constraint contributes to improving speech recognition, but it is not robust because it limits sentential expressions. Several recent works have tried to solve these linguistic problems by relaxing grammatical con straints or applying the 'partial parsing' technique (Stallard 1992:305-310; Seneff 1992:299-304; Baggia 1993:123-126). This technique is based on the principle that a whole utterance can be analysed with syntactic grammar even if the utterance is partly ill-formed. It is, however, likely that the partial parser cannot create even a partial tree for an utterance in freephrase order in 'spontaneous speech' , and this linguistic feature is normal in Japanese. Thus, one key issue in attaining linguistic robustness is exploiting se mantic knowledge to represent relations between phrases by semantic-driven
150
AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA
processing. One of the methods for doing this is to use case frames based on predicative usage. In this approach, a hypothesis explosion, owing to both word-sense ambiguity and many recognised candidates, occurs if only se mantic constraint is used without syntactic constraint. Therefore, a frame work to evaluate growing meaning hypotheses, based on both syntactic and semantic viewpoints, is indispensable in the process of 'semantic interpret ation' from a 'phrase lattice' to a meaning representation. In our previous work (Nagai et al. 1994a, 1994b), we proposed a se mantic interpretation method for obtaining both linguistic robustness and constraints for speech recognition. This paper aims to focus on issues con cerning the integration of this semantic interpretation and speech recogni tion, and to evaluate the performance of 'speech understanding' . 2
Semantic interpretation based on concepts
Our approach is based on the idea that a semantic item represented by a partial expression can be a unit of semantic interpretation. We call this unit a concept. We consider that; (1) a concept is represented by phrases which are continuously uttered in a part of a sentence, (2) a sentence is regarded as a sequence of concepts, and (3) a user talks about concepts with an intention. A concept is defined to represent a target task: for example, concepts for the Hotel Reservation task are Date, Stay, Hotel Name, Room Type, Distance, Cost, Meal, etc.. The representation is based on a semantic frame. An intention is defined as an attributive type of meaning frame of a whole utterance. A meaning frame registers an intention that constrains a set of concept frames. The intention types are defined as reservation, change, cancel, WH-inquiry, Y/N-inquiry, and consultation. 2.1
Basic process
Figure 1 illustrates the principle of the proposed method. The total process can be divided into concepts detection and meaning hypotheses generation. In detecting concepts, slots are filled by phrase candidates which can be concatenated in the phrase lattice, based on examining the semantic value and a particle. A phrase candidate which has no particle is examined using only its semantic value. This phrase candidate has case-level ambiguity, and each case is hypothesised. In generating meaning hypotheses, the main process consists of two subprocesses. First, an intention type is hypothesised using; (1) key predicates
CONCEPT-DRIVEN SEMANTIC INTERPRETATION
151
which relate semantically to each intention, (2) a particle standing for an inquiry, and (3) interrogative adverbs. If a key predicate is not detected, the intention type is guessed using the semantic relation between concepts. Second, concept hypotheses are combined using meaning frames which are associated with each intention type. All meaning hypotheses for an entire sentence are generated as the meaning frames which have slots filled with concept hypotheses.
hypothesis of phrase sequence derived from phrase lattice
Fig. 1: Semantic interpretation based on concepts
2.2
Reduction of ambiguity in concept hypotheses
Many senseless meaning hypotheses remain owing to ambiguity of word sense, cases of a phrase, and boundaries of concepts. Two methods are used to reduce the ambiguity. First, two existence conditions of a concept are supposed. One is that a concept should have filled slots which are indispensable to the gist of the concept. The other condition is that a concept should occupy a continuous part of a sentence. This assumes that a user talks about a semantic item as a chunk of phrases. Second, the linguistic likelihood of a concept hypothesis is evaluated by a scoring method which considers linguistic dependency between phrases. This method is based on penalising linguistic features instead of using syn tactic rules in order to obtain less rigid syntactic constraints. If a new
152
AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA
concept hypothesis is produced, it is examined on the basis of all penalty rules. The total score of all concept hypotheses is evaluated as the lin guistic likelihood of a meaning hypothesis. Some principles for defining penalty rules are shown in Table 1.
Syntactic features
Semantic features
• • • • • • • •
Deletion of key particle Inversion of attributive case and substantive case Adverbial case without predicative case Inadequate conjugation of verbs Inversion of predicative case and other cases Predicative case without other cases Semantic mismatch between phrase candidates Abstract noun without modifiers
Table 1: Principles for defining penalty rules The advantageous features of this semantic interpreting method are con sidered to be: (1) better coverage of sentential expressions than syntactic rules of a sentence, (2) suppression of an explosion by treating a concept as a target of semantic constraints, and (3) portability of common defined concepts to be shared for different tasks. 3
Integrating speech recognition
For integration with speech recognition, we use 'island-driven search' for detecting concept hypotheses (Figure 2). 3.1
Basic process
First, the speech recogniser based on 'phrase spotting' sends a phrase lattice and pause hypotheses to the semantic interpreter. A concept lattice is then generated from the phrase lattice by the island-driven search. In this pro cess, reliable phrase candidates are selected as seeds for growing concept hy potheses. Each concept hypothesis is extended both forward and backward considering existence of gaps, overlaps, and pauses. To select phrase can didates for the extension, several criteria concerning concatenating phrase candidates are used as follows; (1) Gaps and overlaps between phrases are permitted, if their length is within the permitted limit. (2) Pauses are permitted between phrases, considering gaps and overlaps, within the per mitted limit. (3) Phrases which satisfy two conditions of the existence of a concept are connected. (4) Both acoustic and linguistic likelihood are
CONCEPT-DRIVEN SEMANTIC INTERPRETATION
153
given to a concept hypothesis whenever it is extended to integrate a phrase candidate. If the likelihoods are worse than their thresholds, the hypothesis is abandoned. Finally, meaning hypotheses for a whole sentence are generated by con catenating concept hypotheses in the concept lattice. This search is per formed in a best-first manner. In connecting concept hypotheses, the lin guistic likelihood of growing meaning hypotheses is also evaluated and the existence of gaps, overlaps, and pauses is considered between concept hypo theses within the permitted limit. The linguistic scoring method evaluates growing concept hypotheses and abandons hopeless hypotheses. The total score of acoustic and linguistic likelihood is given as ST — aSL + (1 — a)SU , where ST is the total score, SL is the linguistic score, SA is the acoustic score, and a is the weighting factor.
Fig. 2: Detecting concept hypotheses
3.2
Speech understanding experiments
Experiments were performed on 50 utterances of one male on the Hotel Re servation task. The uttered sentences were made by 10 subjects instructed to make conversational sentences with no limitation of sentential expres sions. The average number of phrases was 5.8 per sentence. Intra-phrase grammar with a 356-word vocabulary is converted into phrase networks. For the spotting model, the phrase networks are joined to background mod els which allows all connections of words or phrases (Hanazawa 1995:21372140). Speaker-independent phonemic 'hidden Markov model's ('HMM's)
154
AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA
are used. Phrase lattices provided by speech recognition included 'false alarm' s from 10 to 30 times the average number of input phrases. The standards for judging an answer correct are; (1) concepts and their boundaries are correctly detected, (2) cases are correctly assigned to phrase candidates, and (3) semantic values are correctly extracted. A best per formance of 92% at the first rank was achieved as shown in Table 2. This shows that the proposed semantic interpretation method is capable of ro bustly understanding various spoken sentences. Moreover, we see that using the total score improves the performance of speech understanding. This is because totalising both acoustic and linguistic likelihood improves the like lihood of a correct meaning hypothesis which is not always best in both acoustic and linguistic likelihood. background model rank 1 ≤ 2 ≤ 3 ≤4 ≤ 5
word A T 82 80 84 82 86 88 86 88
phrase A T 82 92 84 94 90 92 96
A: ordered with priority to acoustic score. T: total score. Table 2: Understanding rate (%): 50 utterances of one male These results, however, leave room for some discussion. First, performance was hardly improved in the case of the word background model, although total score was used. The reason for this is that the constraints of linguistic penalty rules were not powerful enough to exclude more false alarms than in the case of the phrase background model. The penalty rules have to be designed in more detail. Second, the errors were mainly caused in the fol lowing cases; (1) when length of gaps exceeded the permitted limit owing to deletion errors of particles and pauses, causing failure of phrase connection, and (2) when seeds for concept hypotheses were not detected in the seed selection stage. To cope with these errors, (1) speech recognition has to be improved using, for example, context-dependent precise HMMs, and (2) a search strategy considering the seed deletion error is required.
CONCEPT-DRIVEN SEMANTIC INTERPRETATION 4
155
Improving search efficiency
In this section, we propose an improved search method which overcomes computational problems arising from seed deletion errors (Nagai 1994:558563). In searching a phrase lattice, it is very important to perform an efficient search selecting reliable phrase candidates in as high a rank as possible. But if only reliable candidates are selected to limit the search space, correct phrase candidates with lower likelihoods will be missed, just like seed deletion errors. This compels us to lower the threshold to avoid the deletion error, and, as a result, the computational amount suddenly increases. To solve this problem, the improved method quickly generates initial meaning hypotheses which allow deletion of concepts. Then, these initial meaning hypotheses are repaired by re-searching for missing concepts using prediction knowledge associated with the initial meaning hypotheses.
Fig. 3: Principle of improved search method
4.1
Basic process
The total process is composed of concept lattice generation, initial meaning hypothesis generation, acceptance decision, and the repairing process (Fig ure 3). To start with, the concept lattice is generated using only a small number of reliable phrase candidates by the concept lattice generation mod ule. In this process, the number of concept hypotheses is also reduced to
156
AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA
improve the quality of the concept lattice. Next, the initial meaning hypo theses generation module generates meaning hypotheses which are incom plete as regards coverage of an utterance, but are reliable. Deletion sections are penalised in proportion to their length, because the initial meaning hy potheses should cover an utterance as widely as possible. Then, the acceptance decision module judges whether the initial meaning hypotheses are acceptable or not. Acceptable means that an initial meaning hypothesis satisfies two conditions; (1) it covers a whole utterance fully, and (2) it would not be possible to attain a better meaning hypothesis by re-searching the phrase lattice. This process is illustrated in Figure 4. The best likelihood possible after repairing hypotheses (set A) can be estimated, since the maximum likelihood in re-searching deletion sections will be less than the seed threshold value.
Fig. 4: Acceptance decision If the hypotheses are not acceptable, the repairing process module re-searches the phrase lattice for concepts in the limited search space of deletion sec tions. There is, however, a risk of failing to detect concepts because both concept hypotheses neighbouring a deletion section are considered not to be reliable. Therefore, additional meaning hypotheses are also generated to be repaired, assuming that such errors occur in either concept. We use a simple method to make these hypotheses; either concept hypothesis of the unreliable two is deleted and replaced with a new concept hypothesis which is re-searched and can fill the deletion. The search space of the re-searching process can be reduced by limiting concepts. Such concepts can be associated with both concept hypotheses and the intention of the initial meaning hypotheses which is already at tained. In the case as shown in Figure 5, for example, the concepts "Cancel"
CONCEPT-DRIVEN SEMANTIC INTERPRETATION
157
or "Distance" can be abandoned considering a situation where an intention "HOW MUCH" and concepts "Hotel Name", "Room Type", and "Cost" are obtained. As concept prediction knowledge, three kinds of coexistence rela tions are defined which concern (1) an intention and a verb, (2) an intention and a concept, and (3) two concepts.
Fig. 5: Prediction of concepts
4.2
Speech understanding experiments
To evaluate search efficiency, an experimental comparison was performed on two search methods; the basic search method mentioned in section 3 and this improved search method. The former searches all phrase candidates after detecting seeds in the stage of generating the concept lattice, while the latter searches limited reliable phrase candidates and re-searches predicted concepts if deletion sections exist. Experimental conditions were almost similar to those in section 3, but the number of false alarms in the phrase lattice was increased for the purpose of clarifying differences in processing time. The spotting model was the phrase background model. Thirteen types of intention were used. Table 3 shows the results of the baseline method without the re-searching technique, and Table 4 shows the results for the improved search method. Seeds in Table 3 means seeds for concept hypotheses in generating concept lattices, and seeds in Table 4 means reliable phrase candidates for generating initial meaning hypotheses. CPU times were computed on the DEC ALPHA 3600 workstation.
158
AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA # seeds rate (%), 1st rank < 5th CPU time (s.)
100 88 98 15.6
30 88 96 14.2
20 88 96 16.9
15 90 96 12.3
10 84 90 11.2
5 66 72 6.0
Table 3: Understanding rate and processing time: baseline search method, 50 utterances of one male. Number of false alarms: max. 227, ave. 75 # seeds rate (%), 1st rank < 5th CPU time (s.) # utterances repaired
30 88 98 1.7 2
20 88 96 1.2 3
15 88 96 3.1 10
10 84 94 3.8 13
5 64 76 3.7 27
Table 4: Understanding rate and processing time: improved search method, 50 utterances of one male These results show that the proposed search method using the repairing technique achieved a successful reduction in processing time. Moreover, the repairing process effectively kept the understanding rate almost equal to the rate of the baseline method in the case when deletion errors occurred owing to a small number of seeds. Processing time, however, tends to increase if the number of repetitions of the repairing process increases. One of the reasons for this is considered to be that constraints of concept prediction were not so powerful in the Hotel Reservation task. In this task, there are slightly exclusive relations between concepts and intentions because most concepts can coexist as parameter values for retrieving the hotel database. If this method is applied to a task where the relations of concepts and intentions are more distinct, for example, a task where interrogative adverbs appear frequently, the constraints of the concepts are considered to become stronger. There is ample room for further improvement in the re-search method in repairing initial meaning hypotheses. The present method does not use in formation concerning both concept hypotheses neighbouring a deletion sec tion, but only replaces them with concept hypotheses which are re-searched. Using this information will help reduce search space in the repairing pro cess. One of the methods for this improvement will be to try to extend both concept hypotheses in order to judge whether a better likelihood can be obtained or not before replacing them.
CONCEPT-DRIVEN SEMANTIC INTERPRETATION 5
159
Concluding r e m a r k s
We proposed a two-stage semantic interpretation method for robustly un derstanding spontaneous speech and described the integration of speech recognition. In this approach, the proposed concept has three roles; as a robust interpreter of various partial expressions, as a target of semantic constraints, and as a basic unit of understanding a whole meaning. This se mantic interpretation was successfully integrated with speech recognition by island-driven lattice search for generating a concept lattice and exploiting linguistic scoring knowledge. This baseline system achieved good performance with a 92% understand ing rate at the first rank. Moreover, we developed an efficient search method which quickly generates initial meaning hypotheses allowing deletion errors of correct concepts, and repairs them by re-searching for missing concepts using prediction knowledge associated with the initial meaning hypotheses. This technique considerably reduced search processing time to approxim ately one-tenth in experimental comparison with the baseline method. Future enhancements will include; (1) detailed design of general lin guistic knowledge for scoring linguistic likelihood of concept, (2) evaluation of this semantic interpretation as applied to other tasks using spontan eous speech data from naive speakers, (3) development of an interpretation method for a 'complex sentence' (Nagai 1996: Forthcoming), and (4) dealing with 'unknown words'. REFERENCES Baggia, Paolo & Claudio Rullent. 1993. "Partial Parsing as Robust Parsing Strategy". Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP'93), Minneapolis, Minn., vol.11, 123-126. New York: The Institute of Electrical and Electronics Engineers (IEEE). Goodine, David, Eric Brill, James Glass, Christine Pao, Michael Phillips, Joseph Polifroni, Stephanie Seneff & Victor Zue. 1994. "GALAXY: A HumanLanguage Interface to On-Line Travel Information". Proceedings of the Inter national Conference on Spoken Language Processing (ICSLP'94), Yokohama, Japan, vol.11, 707-710. Tokyo: The Acoustical Society of Japan. Hanazawa, Toshiyuki, Yoshiharu Abe & Kunio Nakajima. 1995. "Phrase Spot ting using Pitch Pattern Information". Proceedings of 4th European Confer ence on Speech Communication and Technology (EUROSPEECH'95), Mad rid, Spain, vol.III, 2137-2140. Madrid: Graficas Brens.
160
AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA
Nagai, Akito, Yasushi Ishikawa & Kunio Nakajima. 1994a. "A Semantic In terpretation Based on Detecting Concepts for Spontaneous Speech Under standing" . Proceedings of the International Conference on Spoken Language Processing (ICSLP'94), Yokohama, Japan, vol.1, 95-98. Tokyo: The Acous tical Society of Japan. , Yasushi Ishikawa & Kunio Nakajima. 1994b. "Concept-Driven Semantic Interpretation for Robust Spontaneous Speech Understanding". Proceedings of Fifth Australian International Conference on Speech Science and Tech nology (SST'94), Perth, W.A., Australia, vol.1, 558-563. Perth: Univ. of Western Australia. , Yasushi Ishikawa & Kunio Nakajima. Forthcoming. "Integration of ConceptDriven Semantic Interpretation with Speech Recognition". To appear in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP'96), Atlanta, Ga., Seneff, Stephanie. 1992. "A Relaxation Method for Understanding Spontan eous Speech Utterances". Proceedings of Defence Advanced Research Projects Agency (DARPA) Speech and Natural Language Workshop, Harriman, N.Y., 299-304. San Mateo, Calif.: Morgan Kaufmann. Stallard, David & Robert Bobrow. 1992. "Fragment Processing in the DELPHI System". Proceedings of Defence Advanced Research Projects Agency (DARPA) Speech and Natural Language Workshop, Harriman, N. V., 305310. San Mateo, Calif.: Morgan Kaufmann. Ward, Wayne & Sheryl R. Young. 1993. "Flexible Use of Semantic Constraints in Speech Recognition". Proceedings of the International Conference on Acous tics, Speech and Signal Processing (ICASSP'93), Minneapolis, Minn., vol.11, 49-50. New York: The Institute of Electrical and Electronics Engineers (IEEE).
A Proposal for Word Sense Disambiguation Using Conceptual Distance E N E K O A G I R R E 1 & GERMAN RIGAU 2
Euskal Herriko Unibertsitatea & Universitat Politecnica de Catalunya Abstract This paper presents a method for the resolution of lexical ambiguity and its automatic evaluation over the Brown Corpus. The method relies on the use of the wide-coverage noun taxonomy of WordNet and the notion of conceptual distance among concepts, captured by a Conceptual Density formula developed for this purpose. This fully automatic method requires no hand coding of lexical entries, hand tagging of text nor any kind of training process. The results of the experiment have been automatically evaluated against SemCor, the sense-tagged version of the Brown Corpus. 1
Introduction
Word sense disambiguation is a long-standing problem in Computational Linguistics. Much of recent work in lexical ambiguity resolution offers the prospect that a disambiguation system might be able to receive as input unrestricted text and tag each word with the most likely sense with fairly reasonable accuracy and efficiency. The most extended approach is to at tempt to use the context of the word to be disambiguated together with information about each of its word senses to solve this problem. Several interesting experiments in lexical ambiguity resolution have been performed in recent years using preexisting lexical knowledge resources. Cowie et al. (1992) and Guthrie et al. (1993) describe a method for lexical disambiguation of text using the definitions in the machine-readable version of the LDOCE dictionary as in the method described in Lesk (1986), but using simulated annealing for efficiency reasons. Yarowsky (1992) combines the use of the Grolier encyclopaedia as a training corpus with the categor ies of the Roget's International Thesaurus to create a statistical model for the word sense disambiguation problem with excellent results. Wilks et al. (1993) perform several interesting statistical disambiguation experiments 1 2
Eneko Agirre was supported by a grant from the Basque Government. German Rigau was supported by a grant from the Ministerio de Education y Ciencia.
162
ENEKO AGIRRE & GERMAN RIGAU
using co-occurrence data collected from LDOCE. Sussna (1993), Voorhees (1993) and Richarson et al. (1994) define a disambiguation programs based in WordNet with the goal of improving precision and coverage during doc ument indexing. Although each of these techniques looks somewhat promising for disam biguation, either they have been only applied to a small number of words, a few sentences or not in a public domain corpus. For this reason we have tried to disambiguate all the nouns from real texts in the public domain sense tagged version of the Brown Corpus (Francis & Kucera 1967; Miller et al. 1993), also called Semantic Concordance or SemCor for short. We also use a public domain lexical knowledge source, WordNet (Miller 1990). The advantage of this approach is clear, as SemCor provides an appropriate environment for testing our procedures in a fully automatic way. It also defines, for the purpose of this study, word-sense as the sense present in WordNet. This paper presents a general automatic decision procedure for lexical ambiguity resolution based on a formula of the conceptual distance among concepts: Conceptual Density. The system needs to know how words are clustered in semantic classes, and how semantic classes are hierarchically organised. For this purpose, we have used a broad semantic taxonomy for English, WordNet. Given a piece of text from the Brown Corpus, our system tries to resolve the lexical ambiguity of nouns by finding the combination of senses from a set of contiguous nouns that maximises the total Conceptual Density among senses. Even if this technique is presented as stand-alone, it is our belief, follow ing the ideas of McRoy (1992) that full-fledged lexical ambiguity resolution should combine several information sources. Conceptual Density might be only one evidence of the plausibility of a certain word sense. Following this introduction, Section 2 presents the semantic knowledge sources used by the system. Section 3 is devoted to the definition of Con ceptual Density. Section 4 shows the disambiguation algorithm used in the experiment. In Section 5, we explain and evaluate the performed experi ment. In the last section some conclusions are drawn. 2
WordNet and the semantic concordance
Sense is not a well defined concept and often has subtle distinctions in topic, register, dialect, collocation, part of speech, etc. For the purpose of this study, we take as the senses of a word those ones present in WordNet
A PROPOSAL FOR WSD USING CD
163
version 1.4. WordNet is an on-line lexicon based on psycholinguistic theories (Miller 1990). It comprises nouns, verbs, adjectives and adverbs, organised in terms of their meanings around semantic relations, which include among others, synonymy and antonymy, hypernymy and hyponymy, meronymy and holonymy. Lexicalised concepts, represented as sets of synonyms called synsets, are the basic elements of WordNet. The senses of a word are represented by synsets, one for each word sense. The version used in this work, WordNet 1.4, contains 83,800 words, 63,300 synsets (word senses) and 87,600 links between concepts. The nominal part of WordNet can be viewed as a tangled hierarchy of hypo/ hypernymy relations. Nominal relations include also three kinds of meronymic relations, which can be paraphrased as member-of, made-of and component-part-of. SemCor (Miller et al. 1993) is a corpus where a single part of speech tag and a single word sense tag (which corresponds to a WordNet synset) have been included for all open-class words. SemCor is a subset taken from the Brown Corpus (Francis & Kucera 1967) which comprises approximately 250,000 words out of a total of 1 million words. The coverage in WordNet of the senses for open-class words in SemCor reaches 96% according to the authors. The tagging was done manually, and the error rate measured by the authors is around 10% for polysemous words. 3
Conceptual density and word sense disambiguation
A measure of the relatedness among concepts can be a valuable prediction knowledge source to several decisions in Natural Language Processing. For example, the relatedness of a certain word-sense to the context allows us to select that sense over the others, and actually disambiguate the word. Relatedness can be measured by a fine-grained conceptual distance (Miller & Teibel 1991) among concepts in a hierarchical semantic net such as WordNet. This measure would allow to discover reliably the lexical cohesion of a given set of words in English. Conceptual distance tries to provide a basis for determining closeness in meaning among words, taking as reference a structured hierarchical net. Conceptual distance between two concepts is defined in Rada et al. (1989) as the length of the shortest path that connects the concepts in a hierarch ical semantic net. In a similar approach, Sussna (1993) employs the notion of conceptual distance between network nodes in order to improve preci sion during document indexing. Following these ideas, Agirre et al. (1994)
164
ENEKO AGIRRE & GERMAN RIGAU
describe a new conceptual distance formula for the automatic spelling cor rection problem and Rigau (1994), using this conceptual distance formula, presents a methodology to enrich dictionary senses with semantic tags ex tracted from WordNet. The measure of conceptual distance among concepts we are looking for should be sensitive to: - the length of the shortest path that connects the concepts involved. - the depth in the hierarchy: concepts in a deeper part of the hierarchy should be ranked closer. - the density of concepts in the hierarchy: concepts in a dense part of the hierarchy are relatively closer than those in a more sparse region. - and the measure should be independent of the number of concepts we are measuring. We have experimented with several formulas that follow the four criteria presented above. Currently, we are working with the Conceptual Density formula, which compares areas of sub-hierarchies.
Word to be disambiguated: W Context words: wl w2 w3 w4 ...
Fig. 1: Senses of a word in WordNet As an example of how Conceptual Density can help to disambiguate a word, in Figure 1 the word W has four senses and several context words. Each sense of the words belongs to a sub-hierarchy of WordNet. The dots in the sub-hierarchies represent the senses of either the word to be disambiguated (W) or the words in the context. Conceptual Density will yield the highest density for the sub-hierarchy containing more senses of those, relative to the total amount of senses in the sub-hierarchy. The sense of W contained in the sub-hierarchy with highest Conceptual Density will be chosen as the
A PROPOSAL FOR WSD USING CD
165
sense disambiguating W in the given context. In Figure 1, sense2 would be chosen. Given a concept c, at the top of a sub-hierarchy, and given nhyp and h (mean number of hyponyms per node and height of the sub-hierarchy, respectively), the Conceptual Density for c when its sub-hierarchy contains a number m (marks) of senses of the words to disambiguate is given by the formula below: (1) The numerator expresses the expected area for a sub-hierarchy contain ing m marks (senses of the words to be disambiguated), while the divisor is the actual area, that is, the formula gives the ratio between weighted marks below c and the number of descendant senses of concept c. In this way, formula 1 captures the relation between the weighted marks in the sub-hierarchy and the total area of the sub-hierarchy below c. The weight given to the marks tries to express that the height and the number of marks should be proportional. nhyp is computed for each concept in WordNet in such a way as to satisfy equation 2, which expresses the relation among height, averaged number of hyponyms of each sense and total number of senses in a sub-hierarchy if it were homogeneous and regular: (2) Thus, if we had a concept c with a sub-hierarchy of height 5 and 31 des cendants, equation 2 will hold that nhyp is 2 for c. Conceptual Density weights the number of senses of the words to be disambiguated in order to make density equal to 1 when the number m of senses below c is equal to the height of the hierarchy h, to make density smaller than 1 if m is smaller than h and to make density bigger than 1 whenever m is bigger than h. The density can be kept constant for different m's provided a certain proportion between the number of marks m and the height h of the sub-hierarchy is maintained. Both hierarchies A and B in Figure 2, for instance, have Conceptual Density l 3 . In order to tune the Conceptual Density formula, we have made several experiments adding two parameters, a and β. The a parameter modifies the 3
From formulas 1 and 2 we have:
166
ENEKO AGIRRE & GERMAN RIGAU
Fig. 2: Two hierarchies with CD strength of the exponential i in the numerator because h ranges between 1 and 16 (the maximum number of levels in WordNet) while m between 1 and the total number of senses in WordNet. Adding a constant (3 to nhyp, we tried to discover the role of the averaged number of hyponyms per concept. Formula 3 shows the resulting formula. (3) After an extended number of runs which were automatically checked, the results showed that β does not affect the behaviour of the formula, a strong indication that this formula is not sensitive to constant variations in the number of hyponyms. On the contrary, different values of a affect the performance consistently, yielding the best results in those experiments with a near 0.20. The actual formula which was used in the experiments was thus the following: (4) 4
The disambiguation algorithm using conceptual density
Given a window size, the program moves the window one word at a time from the beginning of the document towards its end, disambiguating in each step the word in the middle of the window and considering the other words in the window as context. The algorithm to disambiguate a given word w in the middle of a window of words W roughly proceeds as follows. First, the algorithm represents in a lattice the nouns present in the window, their senses and hypernyms (step 1). Then, the program computes the Conceptual Density of each concept in WordNet according to the senses it contains in its sub-hierarchy (step 2). It selects the concept c with highest density (step 3) and selects the senses
A PROPOSAL FOR WSD USING CD
167
below it as the correct senses for the respective words (step 4). If a word from W: - has a single sense under c, it has already been disambiguated. - has not such a sense, it is still ambiguous. - has more than one such senses, we can eliminate all the other senses of w, but have not yet completely disambiguated w. The algorithm proceeds then to compute the density for the remaining senses in the lattice, and continues to disambiguate words in W (back to steps 2, 3 and 4). When no further disambiguation is possible, the senses left for w are processed and the result is presented (step 5). To illustrate the process, consider the text in Figure 3 extracted from SemCor. The jury(2) praised the administration(3) and operation(8) of the Atlanta Police_Department(l) , the Fulton-Tax-Commissioner-'s.Office. the Bellwood and Alpharetta prison_f arms(i) , Grady .Hospital and the Fulton_Health_Department.
Fig. 3: Sample sentence from SemCor The underlined words are nouns represented in WordNet with the number of senses between brackets. The noun to be disambiguated in our example is operation, and a window size of five will be used. Each step goes as follows: Step 1 Figure 4 shows partially the lattice for the example sentence. As far as Prison_farm appears in a different hierarchy we do not show it in the figure. The concepts in WordNet are represented as lists of synonyms. Word senses to be disambiguated are shown in bold. Underlined concepts are those selected with highest Conceptual Density. Monosemic nouns have sense number 0. Step 2, for instance, has underneath 3 senses to be disambiguated and a sub-hierarchy size of 96 and therefore gets a Conceptual Density of 0.256. Meanwhile, , with 2 senses and subhierarchy size of 86, gets 0.062. Step 3 , being the concept with highest Con ceptual Density is selected. Step 4 In the example, Operation_3, police-department_0 and jury_l are the senses chosen for operation, Police-Department and jury. All the other concepts below are marked so that they are no longer selected. Other senses of those words are deleted from the lattice, e.g., jury_2. In the next loop of the algorithm will have only one disambiguation-word below it, and therefore its density will be much
168
ENEKO AGIRRE & GERMAN RIGAU police_department_0 local department, department of local government government department department jury-1,panel committee, commission operation_3, function division administrative unit unit organisation social group people
group
administration-1, governance. . . jury_2 body people group, grouping
Fig. 4: Partial lattice for the sample sentence lower. At this point the algorithm detects that further disambiguation is not possible, and quits the loop. Step 5 The algorithm has disambiguated operation_3, police_department_0, jury_l and prison_farm_0 (because this word is monosemous in WordNet), but the word administration is still ambiguous. The output of the algorithm , thus, will be that the sense for operation in this context, i.e., for this window, is operation_3. The disambiguation window will move rightwards, and the algorithm will try to disambiguate Police-Department taking as context administration, operation, prison_f arms and whichever noun is first in the next sentence. The disambiguation algorithm has and intermediate outcome between completely disambiguating a word or failing to do so. In some cases the algorithm returns several possible senses for a word. In this experiment we treat this cases as failure to disambiguate. 5
The experiment
We selected one text from SemCor at random: br-aOl from the gender "Press: Reportage". This text is 2079 words long, and contains 564 nouns. Out of these, 100 were not found in WordNet. From the 464 nouns in WordNet, 149 are monosemous (32%).
A PROPOSAL FOR WSD USING CD
169
<s> <wd>jury<sn>[noun.group.0]NN <wd>administration<sn>[noun.act.0]NN <wd>operation<sn>[noun.state.0]NN <wd>Police_Department<sn> [noun.group.0]NN <wd>prison_farms<mwd>prisonjfarm <msn>[noun.artifact.0]NN
Fig. 5: SemCor format jury administration operation PoliceJDepartment prisonfarm
Fig. 6: Input words The text plays both the role of input file (without semantic tags) and (tagged) test file. When it is treated as input file, we throw away all nonnoun words, only leaving the lemmas of the nouns present in WordNet. The program does not face syntactic ambiguity, as the disambiguated part of speech information is in the input file. Multiple word entries are also available in the input file, as long as they are present in WordNet. Proper nouns have a similar treatment: we only consider those that can be found in WordNet. Figure 5 shows the way the algorithm would input the example sentence in Figure 3 after stripping non-noun words. After erasing the irrelevant information we get the words shown in Fig ure 6 4 . The algorithm then produces a file with sense tags that can be compared automatically with the original file (cf. Figure 5). Deciding the optimum context size for disambiguating using Conceptual Density is an important issue. One could assume that the more context there is, the better the disambiguation results would be. Our experiment shows that precision5 increases for bigger windows, until it reaches window size 15, where it gets stabilised to start decreasing for sizes bigger than 25 (cf. Figure 7). Coverage over polysemous nouns behaves similarly, but with a more significant improvement. It tends to get its maximum over 80%, decreasing for window sizes bigger than 20. Precision is given in terms of polysemous nouns only. The graphs are drawn against the size of the context 6 that was taken into account when disambiguating. Figure 7 also shows the guessing baseline, given when selecting senses at random. First, it was calculated analytically using the polysemy counts for 4
5
6
Note that we already have the knowledge that police and prison farm are compound nouns, and that the lemma of prison farms is prison farm. Precision is defined as the ratio between correctly disambiguated senses and the total number of answered senses. Coverage is given by the ratio between total number of answered senses and total number of senses. Context size is given in terms of nouns.
170
ENEKO AGIRRE & GERMAN RIGAU
Fig. 7: Precision and coverage % w=25 polysemic overall
Cover. 83.2 88.6
Prec. 47.3 66.4
Recall 39.4 58.8
Table 1: Overall data for the best window size the file, which gave 30% of precision. This result was checked experimentally running an algorithm ten times over the file, which confirmed the previous result. We also compare the performance of our algorithm with that of the 'most frequent' heuristic. The frequency counts for each sense were collected using the rest of SemCor, and then applied to the text. While the precision is similar to that of our algorithm, the coverage is nearly 10% worse. All the data for the best window size can be seen in table 5. The precision and coverage shown in the preceding graph was for polysemous nouns only. If we also include monosemic nouns precision raises from 47.3% to 66.4%, and the coverage increases from 83.2% to 88.6%. 6
Conclusions
The automatic method for the disambiguation of nouns presented in this paper is ready-usable in any general domain and on free-running text, given part of speech tags. It does not need any training and uses word sense tags from WordNet, an extensively used lexical data base. The algorithm is theoretically motivated and founded, and offers a general measure of the
A PROPOSAL FOR WSD USING CD
171
semantic relatedness for any number of nouns in a text. In the experiment, the algorithm disambiguated one text (2079 words long) of SemCor, a subset of the Brown corpus. The results were obtained automatically comparing the tags in SemCor with those computed by the algorithm, which would allow the comparison with other disambiguation methods. T h e results are promising, considering the difficulty of the task (free running text, large number of senses per word in WordNet), and the lack of any discourse structure of the texts. More extensive experiments on additional SemCor texts, including among others the use of meronymic links, testing of homograph level disambigu ation and direct comparison with other approaches, are reported in Agirre et al. (1996). This methodology has been also used for disambiguating nominal entries of bilingual MRDs against WordNet (Rigau & Agirre 1995). A c k n o w l e d g e m e n t s . We wish to thank all the staff of the CRL and specially Jim Cowie, Joe Guthtrie, Louise Guthrie and David Farwell. We would also like to thank Ander Murua, for mathematical assistance, Xabier Arregi, Jose Mari Arriola, Xabier Artola, Arantxa Diaz de Ilarraza, Kepa Sarasola, and Aitor Soroa from the Computer Science Department of EHU and Francesc Ribas, Horacio Rodriguez and Alicia Ageno from the Computer Science Department of UPC. REFERENCES Agirre, Eneko, Xabier Arregi, Arantza Diaz de Ilarraza & Kepa Sarasola. 1994. "Conceptual Distance and Automatic Spelling Correction". Workshop on Speech Recognition and Handwriting, 1-8. Leeds, U.K. & German Rigau. 1996. An Experiment in Word Sense Disambiguation of the Brown Corpus Using WordNet. Technical Report (MCCS-96-291). Las Cruces, New Mexico: Computing Research Laboratory, New Mexico State University. Cowie, Jim, Joe Guthrie & Louise Guthrie. 1992. "Lexical Disambiguation Using Simulated Annealing". Proceedings of the DARPA Workshop on Speech and Natural Language, 238-242. Francis, Nelson & Henry Kucera. 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Boston, Mass.: Houghton-Mifflin. Guthrie, Louise, Joe Guthrie & Jim Cowie. 1993. Resolving Lexical Ambiguity. Technical Report (MCCS-93-260). Las Cruces, New Mexico: Computing Research Laboratory, New Mexico State University.
172
ENEKO AGIRRE & GERMAN RIGAU
Lesk, Michael. 1986. "Automatic Sense Disambiguation: How to Tell a Pine Cone from an Ice Cream Cone". Proceeding of the 1986 SIGDOC Conference, Association of Computing Machinery, 24-26. McRoy, Susan W. 1992. "Using Multiple Knowledge Sources for Word Sense Discrimination". Computational Linguistics 18:1.1-30. Miller, George A. 1990. "Five Papers on WordNet". Special Issue of the Inter national Journal of Lexicography. 3:4. & Daniel A. Teibel. 1991. "A Proposal for Lexical Disambiguation". Pro ceedings of the DARPA workshop on Speech and Natural Language, 395-399. , Claudia Leacock, Tengi Randee & Ross T. Bunker. 1993. "A Semantic Concordance". Proceedings of the DARPA Workshop on Human Language Technology, 303-308. Rada, Roy, Hafedh Mili. Ellen Bicknell & Maria Blettner. 1989. "Development an Application of a Metric on Semantic Nets". IEEE Transactions on Systems, Man and Cybernetics. 19:1.17-30. Richarson, Ray, Allan F. Smeaton & John Murphy. 1994. Using WordNet as a Konwledge Base for Measuring Semantic Similarity between Words. Tech nical Report (CA-1294). Dublin, Ireland: School of Computer Applications, Dublin City University. Rigau, German. 1995. "An Experiment on Semantic Tagging of Dictionary Definitions". Workshop "The Future of the Dictionary". Uriage-les-Bains, Prance. & Eneko Agirre. 1995. "Disambiguating Bilingual Nominal Entries against WordNet". Proceedings of the Computational Lexicon Workshop, 7th European Summer School in Logic, Language and Information, 71-82. Barcelona, Spain. Sussna, Michael. 1993. "Word Sense Disambiguation for Free-text Indexing Using a Massive Semantic Network". Proceedings of the 2nd International Conference on Information and Knowledge Management, 67-74. Airlington, Virginia, U.S.A. Voorhees, Ellen. 1993. "Using WordNet to Disambiguate Word Senses for Text Retrieval", Proceedings of the 16th Annual International ACM SIGIR Con ference on Research and Development in Information Retrieval, 171-180. Wilks, Yorick et al. 1993. "Providing Machine Tractable Dictionary Tools". Semantics and the Lexicon ed. by James Pustejovsky, 341-401. Yarowsky, David. 1992. "Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora", Proceedings of the ARPA Workshop on Human Language Technology, 266-271.
An Episodic Memory for Understanding and Learning OLIVIER F E R R E T * & B R I G I T T E GRAU* **
*LIMSI-CNRS
**IIE-CNAM
Abstract In this article we examine the incorporation of pragmatic knowledge learning in natural language understanding systems. We argue that this kind of learning can and should be done incrementally. In order to do so we present a model that is able simultaneously to build a case library and to prepare the abstraction of schemata which represent general situations. Learning takes place on the basis of narratives whose representations are collected in an episodic memory. 1
Introduction
Text understanding requires pragmatic knowledge about stereotypical situ ations. One must go beyond the information given so that inferences can be performed to make explicit the links between utterances. By determining the relations between individual utterances the global representation of the entire text can be computed. Unless one is dealing with specific domains it is not reasonable to assume that a system has a priori all the information needed. In most cases texts are made of known and unknown bits and pieces of information. Text analysis is therefore best viewed as a complex process in which understanding and learning take place, and which must improve itself (Schank 1982). Methods of reasoning that are exclusively analytic are no longer suffi cient to assure the understanding of texts, as these typically include new situations. Hence alternatives such as synthetic and analogical reasoning, which use more contextualised knowledge, are also needed. Thus, a memory model dedicated to general knowledge must be extended with an episodic component that organises specific situations, and must be able to take into account the constraints coming from gathering the understanding and learn ing processes. In the domain of learning pragmatic knowledge from texts, the short comings of one-dimensional approaches such as Similarity-Based Learning — IPP (Lebowitz 1983) — or Explanation-Based Learning — GENESIS (Mooney & DeJong 1985) — have become apparent and have given place to a multistrategy approach. OCCAM (Pazzani 1988) is an attempt in this
174
OLIVIER FERRET & BRIGITTE GRAU
direction as it uses Similarity-Based Learning techniques in order to com plete a domain theory for an Explanation-Based Learning process. Despite their differences, all these approaches share the same goal or means: each new causal representation constructed by the system is generalised as soon as possible in order to classify it on the basis of the system's background knowledge. However learning is not an all-or-nothing process. We follow Vygotsky's (Vygotsky 1962) views on learning, namely, that learning is an incremental process whereby general knowledge is abstracted on the basis of cumulative, successive experiences (in our case, the representations of texts). In this perspective generalisations should not occur every time a new situation is encountered. Rather, we suggest to store them in a buffer, the episodic memory, where the abstraction takes place at a later stage. The result of this abstraction process is a graph of schemata, akin to the MOPs introduced by Schank (Schank 1982). Before we became interested in this topic other researchers have made proposals. Case-Based Reasoning (CBR) systems such as SWALE (Schank & Leake 1989) and AQUA (Ram 1993) have been designed in order to exploit the kind of representations we are talking about. However, these systems start out with a lot of knowledge. They do not model the incre mental aspect we are proposing, that is, an abstraction must be performed only when sufficient reinforced information has been accumulated. Further more, the memory structure of these systems is fixed a priori. Thus, the criteria for determining whether a case can be considered as representat ive cannot be dynamically determined. Despite these shortcomings, CBR systems remain a very good model in the context of learning and must be taken into account when specifying a dynamic episodic memory. 2
Structure of the episodic memory
2.1
Text representation
Before examining the structure of the episodic memory, we will consider the form of its basic component: the text representations. In our case these representations come from short narratives such as the following. A few years ago, [I was in a department store in Harlem] (1) [with a few hun dred people around me](2). [I was signing copies of my book "Stride toward Freedom"] (3) [which relates the boycott of buses in Montgomery in 1955-56] (4). Suddenly, while [Iwas appending my signature to a page] (5), [I felt a pointed thing sinking brutally into my chest] (6). [I had just been stabbed with a paper knife by
AN EPISODIC MEMORY
175
a woman] (7) [who was acknowledged as mad afterwards] (8). [I was taken immedi ately to the Harlem Hospital] (9) [where I stayed on a bed during long hours] (10) while [many preparations were made] (11) [in order to remove the weapon from my body] (12). Revolution Non-Violente by Martin Luther King (based on a French version of the original text)
The texts' underlying meanings are expressed in terms of conceptual graphs (Sowa 1984). The clauses are organised according to the situations men tioned in the texts (See Figure l 1 ). Hence, each of these situations (a dedication meeting in a department store, a murder attempt and a stay in hospital in our example) corresponds to a Thematic Unit (TU).
Fig. 1: The representation of the text about Martin Luther King A text representation, which we call an episode, is a structured set of TUs which are thematically linked in either one of two ways: • thematic deviation: this relation means that a part of a situation is elaborated. In our example, the hospital situation is considered to be a deviation from the murder attempt because these two situations are thematically related to the Martin Luther King's wound. More precisely, a deviation is attached to one of the graphs of a TU. Here, the Hospital TU is connected to the turning graph (9) expressing that Martin Luther King is taken to the hospital. • thematic shift: this relation characterises the introduction of a new situation. In the example below, there is a thematic shift between the dedication meeting situation and the murder attempt one because they are not intrinsically tied together, fortunately for the book writers. Among all the TUs of an episode, at least one has the status of being the main topic (MT). In the Martin Luther King text, the Murder attempt TU plays this role. More generally, a main topic is determined by applying heuristics based on the type of the links between the TUs (Grau 1984). TUs have a structure. Depending on the aspect of the situation they describe, graphs are distributed among three slots: 1
Propositions 6 and 7, also 3 and 5, are joined together in one conceptual graph. This is possible through the definition graph associated to the types of concept.
176
OLIVIER FERRET & BRIGITTE GRAU • circumstances (C): states under which the situation occurs; • description (D): actions which characterise the situation; • outcomes (0): states resulting from the situation.
A TU is valid only if its description slot is not empty. Nevertheless, as shown in the example below, certain slots may remain empty if the corresponding information is not present in the text. Inside the description slot, graphs may be linked by temporal and causal relations. For example, in the Hospital TU graphically represented in Fig ure 1, graphs (10) and (11) are causally tied with the graph (12). Text representations have so far been built manually. However, prelim inary studies show that this analysis could be done automatically without using any particular abstract schemata. A CBR mechanism using both text representations and linguistic clues (such as connectives, temporal markers ɔr other cohesive devices) is under study. 2.2
The episodic memory
The structure of the episodic memory is governed by one major principle: all similar elements are stored in the same structure. As a result, accumulation occurs and implicit generalisations are made by reinforcing the recurrent features of the episodes or the situations. This principle is applied to the episodes and the TUs, and the memory is organised by storing this information accordingly. That is, similar episodes and similar TUs are grouped such as to build aggregated episodes in one case and aggregated TUs in the other. We show an example of the memory in Figure 2. Episode 1 and episode 2, which talk about the same topic, a murder attempt with a knife, have been grouped together in one aggregated episode. In this episode, the TUs that describe more specifically the murder attempt have been gathered in the same aggregated TU. It should be noted that TUs coming from different episodes without being their main topic can still be grouped in a same aggregated TU (see the Scuffle TU or the Speech TU in Figure 2). The principle of aggregation is not applied at the memory scale for smaller elements such as concepts or graphs. Aggregated graphs exist in the memory; but their scope is limited to the slot of the aggregated TU containing them. An aggregated graph gathers only those similar graphs that belong to the same slot of similar TUs coming from different episodes. Similarly, an aggregated concept makes no sense in isolation of the aggreg ated graph of which it is part of, hence, it cannot be found in another graph.
AN EPISODIC MEMORY
177
It is in fact the product of a generalisation applied to concepts which re semble each other in the context of graphs which are also considered to be similar. This explains why the accumulation process can be viewed as the first step of a generalisation process.
Fig. 2: The episodic memory For instance, in the aggregated graph (a) of the description slot below (see Figure 3), Stab has Man for agent, because the type Man is the result of the aggregation of the more specific types Soldier and Young-man. On the other hand, we have no aggregated concept for recipient because the aggregation was unsuccessful for Arm and Stomach. The accumulation process has been designed in such a way as to make apparent the most relevant features of the situations by reinforcing them. This is done by storing similar elements in the same structure and by as signing them a weight. This weight quantifies the degree of recurrence of an element. Figure 3 shows these weights for aggregated graphs and aggregated con cepts. These weights characterise the relative importance of aggregated graphs with regard to the aggregated TU and the relative importance of aggregated concepts with regard to the aggregated graph. This principle of cumulation holds also for the relations between the entities. This is shown in Figure 3 for casual relations in the aggregated graphs. In a description slot, temporal and causal relations coming from different episodes are also aggregated and similarly for the thematic relations between the TUs of an episode. This example illustrates not only the accumulative dimension of our memory model but also its potential for being a case library. Even though aggregated concepts are generalisations, they still maintain a link to the
178
OLIVIER FERRET & BRIGITTE GRAU
Circumstances (b) [Quarrel] (0.5) [event] (1.0) (agent) (1.0) —> [young-man] (1.0) event [1] (agent) [2] young-man [2] [airport] (1.0) (object) (1.0) - > [money] (1.0) (object) [2] money [2] airport [1] (accomp.) (1.0) —> [young-man] (1.0) (accomp.) [2] young-man [2] Description (a) [Stab] (1.0) " (b) [Arrest] (1.0) (agent) (1.0) —> [man] (1.0) (agent) (1.0) —> [human] (1.0) (agent) [1,2] policeman [1], human [2] (agent) [1,2] soldier [1], young-man [2] (recipient) (1.0) —> [ ] — (object) (1.0) - > [man] (1.0) (recipient) [1,2] arm (0.5) [1], stomach (0.5) [2] (object) [1,2] soldier [1], young-man [2] (part) (1.0) - > [man] (1.0) (part) [1,2] head-of-state [1], young-man [2] (d) [Stumble] (0.5) (instrument) (1.0) - > [knife] (1.0) (agent) (1.0) - > [soldier] (1.0) (agent) [1] soldier [1] (instrument) [1,2] bayonet [1], flick knife [2] (c) [Attack] (0.5) (e) [Hit] (0.5) (agent) (1.0) - > [soldier] (1.0) (agent) (1.0) —> [young-man] (1.0) (agent) [1] soldier [1] (agent) [2] young-man [2] (object) (1.0) —> [head-of-state] (1.0) (recipient) (1.0) —> [young-man] (1.0) (object) [1] head-of-state [1] (recipient) [2] young-man [2] (manner) (1.0) —> [suddenly] (1.0) | (manner) [1] suddenly [1] Outcomes (a) [Located] (1.0) (b) [Wounded] (0.5) (experiencer) (1.0) —> [man] (1.0) (experiencer) (1.0) —> [head-of-state] (1.0) (experiencer) [1,2] soldier [1], young-man [2] (experiencer) [1] head-of-state [1] (location) (1.0) —> [prison] (1.0) (manner) (1.0) - > [light] (1.0) (location) [1,2] prison [1,2] (manner) [1] light [1] (c) [Dead] (0.5) (experiencer) (1.0) —> [young-man] (1.0) (experiencer) [2] young-man [2] (a) [Located] (0.5) (experiencer) (1.0) —> (experiencer) [1] (location) (1.0) - > (location) [1]
[Stab]: predicate of an aggregated graph. (1.0) : weight value, [man] : aggregated concept, (agent) : aggregated relation, soldier [1]: a concept, i.e. an instance, occurring in episode 1. It is linked to the aggregated concept above it. (recipient) [1,2]: a relation which occurs in episodes 1 and 2. It is linked to the aggregated relation above it.
Fig. 3: An aggregated TU (the Murder Attempt TU of Figure 2) concepts from which they have been built 2 . Thus, following the references to the episodes, we know that the agent of the Stab predicate in the episode 1 is a Soldier. Hence, a Case-Based Reasoner will be able to use this fact in order to exploit the specific situations stored in the aggregates and improve an automatic comprehension process. Such a reasoner could use the aggregated information and the specific information simultaneously. The former would be used to evaluate the relative importance of a piece of data, and the latter to reason more precisely on the basis of similarities and differences. The multidimensional aspect of this model also has implications on the way of retrieving information from the memory when it is used as a case 2
Unlike the aggregated concepts, concepts in texts, i.e. instances, may belong to several graphs and are therefore starting points for roles.
AN EPISODIC MEMORY
179
library. Unlike most CBR systems, the library here has a relatively flat structure: similar episodes and similar TUs are simply grouped together. Aggregated episodes can be considered as typical contexts for the aggreg ated TUs, which are the central elements, but there is no structural means (for instance, a hierarchical structure of relevant features) for searching a case. This operation is achieved in an associative way by a spreading activation mechanism which works on all different knowledge levels. The interaction between the concepts and the structures of the memory (aggreg ated episodes, aggregated TUs or schemata) leads to a stabilised activation configuration from which the cases with the highest activation level are se lected. This process is akin to what Lange and Dyer (Lange & Dyer 1989) call evidential activation. In our case, the weights upon which the propaga tion is based are those that characterise the element's relative importance in our memory model. This mechanism presents two major advantages from the search-phase point of view. First of all, no a priori indexing is necessary. This is useful in a learning situation where the setting is not stable. Secondly, a syntactic match is performed at the same time. 3
Episode matching and memorisation
When the building of the text's underlying meaning representation is com pleted, one, or possibly several memorised episodes have been selected by the spreading activation mechanism. They are related to either the text's main situation, the main TU, or a secondary one. Matching episodes amounts thus to comparing memorised TUs with TUs of the text. In this section we examine under what conditions TUs are similar. 3.1
Similarity of TUs
The relative similarity between two TUs depends on the degree of their slot matching. We proceed in two steps. At first we compute two ratios obtained from the number of similar graphs, in relation to the number of graphs present in the memorised slot as well as to the number of graphs in the text slot. Thus, we first evaluate each slot in the lump by comparing these ratios with an interval of arbitrary thresholds [t 1 ,t 2 ] we have established. When the two ratios are under the lower limit, the similarity is rejected: neither the memorised slot nor the text slot contains a sufficient number of common points with regard to their differences. If one of these two ratios is above
180
OLIVIER FERRET & BRIGITTE GRAU
the upper limit, the proportion of common points of one slot or the other is sufficient to consider the slots as highly similar. If both ratios happen to be within the interval, we conclude in favour of a moderate similarity that has to be evaluated by another more precise method. In this case, we compute a score based on the importance of the graphs inside the slots. This computation is described in detail in the next section. When this score is above another given threshold t3, we conclude that there is a high similarity. Thus, two slots sharing an average number of graphs can be very similar if these graphs are important for this slot. The thresholds are parameters of the system. In the current version, t1 — 0.5, t2 = 0.8 and t3 = 0.7. Finally, two TUs are similar if they correspond to any of the following rules: R1'. highly similar circumstances and moderately similar description R2: similar circumstances and similar outcomes, with at least one of the two dimensions highly similar. R3: moderately similar description and highly similar outcomes R4: highly similar description. 3.2
Similarity of slots and similarity of graphs
The score of a slot is based on the score of its similar graphs, weighted by their relative importance into the slot. We compute the score of two graphs only when they contain the same predicate and at least one similar concept related by an equivalent casual relation. Two concepts are similar if the most specific abstraction of their types is less than the concept type of the canonical graph. By definition, the graphs we compare are derived from the same canonical graph and, for each relation, their concept types are restrictions of the same type inside this canonical graph. In the comparison of two concepts, if the aggregated one does not exist, the resulting type is the one which abstracts the maximum number of concept occurrences. Thus, the evaluation function of the similarity of two graphs containing the same predicate is the following:
with Sim Concept(ci,c'i) = 1 when the concepts are similar otherwise 0, wci is the weight of the concept inside the memorised graph and the ci are the concepts other than the predicate.
AN EPISODIC MEMORY
181
Two graphs, g and g', are similar if SimGraph(g,g') > 0. The weight wci is either the weight of the aggregated concept or the sum of the weights of the regrouped occurrences. The following illustrates the computation of the similarity between the graph (a) of the description slot in Figure 3 and the graph of the Martin Luther King text which has the same predicate (it corresponds to the clauses 6 and 7): [Stab] — (agent) —> (recipient) —> (part) — > (instrument) — > (manner) — >
SimGraph [woman] [chest] — [man] [paper-knife] [brutally]
~
= (1.0 SimConcept(man,woman)+ 0.5 SimConcept(chest,stomach or arm)+ 1.0 SimConcept(man,man)+ 1.0 • SimConcept(knife,paper-knife)/3.5 = (1.0 + 0.0 + 1.0 + 1.0)/3.5 0.86
We can now define the evaluation function of two identically named slots as follows:
where
wpi is the weight of the aggregated predicate and SimGraph(txtgi,memgi) > 0.
The eventual presence of a chronological order between graphs in the de scription slots does not intervene in the similarity evaluation. We do not want to favour an unfolding of events with regard to another, the various combinations having actually occurred in the original situations. More generally, the way in which the similarity between structures is computed resembles Kolodner and Simpson's (Kolodner & Simpson 1989) method, with the computation of an aggregate match score. There are however two big differences: first of all, the similarity is context dependent because the relative importance of any element is always evaluated within the context of the component to which it belongs. Second, this importance can change, since it is represented by the recurrence of the element and not by an hierarchy of features established on a priori grounds. Because situations are not related in the same way, nor with the same level of precision, the structure of episodes may be different even if they deal with the same topic. For instance, a TU may be detailed by another TU in one episode and not in another one. Hence, graphs that could be matched may be found in two different TUs as we can see in Figure 4. This peculiarity must be taken into account when we compare two slots. We do so by first recognising similar graphs in identically named slots; then we try to find the remaining graphs in the appropriate slots of an eventually
182
OLIVIER FERRET & BRIGITTE GRAU
C: Circumstances D : Description O : Outcomes
memorized TUs: TU2 gives details concerning the circumstances of TU1
Fig. 4: Matching two different structures detailed TU. For example, when examining the similarity of the circum stance slots of text TU and TU1 in Figure 4, the remaining states (g2) are searched either in the outcomes slot of an associated TU (TU2), or in the resulting states of the actions in its description slot. This process will be applied to the remaining graphs of the text and to those of the memorised TU. The difference of structure is bypassed during the computation of the similarity measure, but it will not be neglected during the aggregation pro cess. In such cases, the aggregation of the first similar graphs will take place while the other similar graphs will be represented in their respective TU. No strengthening of the structure between the concerned TUs will occur. 3.3
Memorisation of an episode: The aggregation process
The spreading activation process leads to the selection of memorised epis odes which are ordered according to their activation level. To decide if one of these is a good candidate for an aggregation with the incoming episode, even if this aggregation is only a partial one, we have to find similar TUs between them. Episodes can be aggregated only if their principal TUs are similar. If this similarity is rejected, we are brought back to the sole ag gregation of TUs and the incoming episode leads to the creation of a new aggregated episode. Otherwise, the process continues in order to decide whether the topic structuring of the studied text is similar to the structur ing of the held episode. If similar secondary TUs are found in the same relation network, their links will be reinforced accordingly. This last part of the process will be applied even if no match is found at the episode level. The reinforcement of such links means that a more general context than a single TU is recurrent. Whatever level of matching is recognised, TUs are aggregated. In doing so, the graphs of the text TU are memorised according to the slot they belong and to the result of the similarity process. If new predicates appear,
AN EPISODIC MEMORY
183
the corresponding graphs are added to the memorised slot with a weight equal to 1 divided by the number of times the TU has been aggregated. In the case of graphs which contain an existing predicate and whose similarity has been rejected, they are joined with no strengthening of the predicate. New concepts related to existing casual relations are related to the corres ponding aggregated concept. Existing aggregated concepts, which are the abstraction amalgamating the maximum number of occurrences, may be questioned when a new concept is added to a graph. If any of them no longer fulfills this definitional constraint, it is suppressed. Pre-generalisation and reinforcement occur when the graphs are similar. As a result, the weight of the predicate increases. According to the res ults of the similarity process, aggregated concepts may evolve and become more abstract. The weights of the modified concepts inside the graphs are computed in order for them to be always equal to the number of times the concept has been strengthened, divided by the number of predicate's aggregations. The result of the aggregation of the Stab graph (see 3.2) coming from the Martin Luther King text (episode 5) with the Stab aggregated graph of the Murder Attempt aggregated TU (see Figure 3) is shown below: [Stab](1.0) — (agent)(1.0)—> (agent) [1,2,5]
[human] (1.0), soldier[l] young-man[2] woman[5] ( i n s t r u m e n t ) ( 1 . 0 ) — > [knife](1.0), (instrument) [1,2,5] bayonet[l] flick knife[2] paper-knife[5]
4
(recipient) ( 1 . 0 ) — > (recipient) [1,2,5]
(part)(1.0)—> (part)[l,2,5]
[] — arm(0.33)[l] stomach(0.33)[2] chest (0.33) [5] [man](1.0) head-of-state[l] young-man [2] man[5]
Conclusion
Natural Language Understanding systems must be conceived in a learning perspective if they are not designed for a specific purpose. Within this approach, we argue that learning is an incremental process based on the memorisation of its past experiences. That is why we have focused our work on the elaboration and the implementation of an episodic memory that is able to account for progressive generalisations by aggregating similar situations and reinforcing recurrent structures. This memory model also constitutes a case library for analogical reasoning. It is characterised by the two levels of cases it provides. These cases give different sorts of information: on one hand, specific cases can be used as sources given their richness coming from the situations they represent. On another hand, the aggregated cases,
184
OLIVIER FERRET & BRIGITTE GRAU
being a more reliable source of knowledge, guide and validate the retrieval and the use of the specific cases. More generally, our approach prepares the induction of schemata and the selection of their general features, a step which is still necessary to stabilise and organise abstract knowledge. This approach also provides a robust model of learning insofar as it allows for a weak text understanding. Even misunderstandings resulting from an incomplete domain theory will be compensated on the basis of the treatment of lots of texts involving analogous subjects. REFERENCES Grau, Brigitte. 1984. "Stalking Coherence in the Topical Jungle". Proceedings of the 5th Generation Computer System (FGCS'84), Tokyo, Japan. Kolodner, Janet L. & R.L. Simpson. 1989. "The MEDIATOR: Analysis of an Early Case-Based Problem Solver". Cognitive Science 13:4.507-549. Lange, Trent E. & Michael G. Dyer. 1989. "High-level Inferencing in a Connectionist Network". Connection Science 1:2.181-217. Lebowitz, Michael. 1983. "Generalization from Natural Language Text". Cog nitive Science 7.1-40. Mooney, Raymond & Gerald De Jong. 1985. "Learning Schemata for Natural Language Processing". Proceedings of the 9th International Joint Conference on Artificial Intelligence (IJCAF85), Los Angeles, 681-687. Pazzani, Michael J. 1988. "Integrating Explanation-based and Empirical Learn ing Methods in OCCAM". Third European Working Session on Learning (EWSL'88) ed. by Derek Sleeman, 147-165. Ram, Ahswin. 1993. "Indexing, Elaboration and Refinement: Incremental Learn ing of Explanatory Cases". Machine Learning (Special Issue on Case-Based Reasoning) ed. by Janet L. Kolodner, 10:3.201-248. Schank, Roger C. 1982. Dynamic Memory: A Theory of Reminding and Learning in Computers and People. New York: Cambridge University Press. & David B. Leake. 1989. "Creativity and Learning in a Case-Based Ex plainer". Artificial Intelligence (Special Volume on Machine Learning) ed. by Jaime G. Carbonell, 40:1-3.353-385. Sowa, John F. 1984. Conceptual Structures: Information Processing in Mind and Machine. Reading: Addison Wesley. Vygotsky, Lev S. 1962. Thought and Language. Cambridge, Mass.: MIT Press.
Ambiguities & Ambiguity Labelling: Towards Ambiguity D a t a Bases CHRISTIAN BOITET* & MUTSUKO
*GETA, CLIPS, IMAG **ATR Interpreting
TOMOKIYO**
(UJF, CNRS & INPG) Telecommunications
Abstract This paper has been prepared in the context of the MID DIM project (ATR-CNRS). It introduces the concept of'ambiguity labelling', and proposes a precise text processor oriented format for labelling 'pieces' such as dialogues and texts. Several notions concerning ambiguities are made precise, and many examples are given. The ambiguities labelled are meant to be those which state-of-the-art speech analysers are believed not to be able to solve, and which would have to be solved interactively to produce the correct analysis. The proposed labelling has been specified with a view to store the labelled pieces in a data base, in order to estimate the frequency of various types of ambiguities, the importance to solve them in the envisaged contexts, the scope of disambiguation decisions, and the knowledge needed for disambiguation. A complete example is given. Finally, an equivalent data base oriented format is sketched. 1
Introduction
As has been argued in detail in (Boitet 1993; Boitet 1993; Boitet & LokenKim 1993), interactive disambiguation technology must be developed in the context of research towards practical Interpreting Telecommunications sys tems as well as high-quality multi-target text translation systems. In t h e case of speech translation, this is because the state of the art in the foresee able future is such t h a t a black box approach to spoken language analysis (speech recognition plus linguistic parsing) is likely to give a correct o u t p u t for no more t h a n 50 to 60% of the utterances ('Viterbi consistency'(Black, Garside & Leech 1993)) 1 , while users would presumably require an overall success rate of at least 90% to be able to use such systems at all. However, the same spoken language analysers may be able to produce 1
According to a study by Cohen & Oviatt, the combined success rate is bigger than the product of the individual success rates by about 10% in the middle range. Using a formula such as S2 = S1*S1 + (1-S1)*A with A=20%, we get:
186
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
sets of outputs containing the correct one in about 90% of the cases ('struc tural consistency' (Black, Garside & Leech 1993) ) 2 . In the remaining cases, the system would be unable to analyse the input, or no output would be correct. Interactive disambiguation by the users of the interpretation or translation systems is then seen as a practical way to reach the necessary success rate. It must be stressed that interactive disambiguation is not to be used to solve all ambiguities. On the contrary, as many ambiguities as pos sible should be reduced automatically. The remaining ones should be solved by interaction as far as practically possible. What is left would have to be reduced automatically again, by using preferences and defaults. In other words, this research is complementary to the research in auto matic disambiguation. Our stand is simply that, given the best automatic methods currently available, which use syntactic and semantic restrictions, limitations of lex icon and word senses by the generic task at hand, as well as prosodic and pragmatic cues, too many ambiguities will remain after automatic analysis, and the 'best' result will not be the correct one in too many cases. We suppose that the system will use a state-of-the-art language-based speech recogniser and multilevel analyser, producing syntactic, semantic and pragmatic information. We leave open two possibilities: • an expert system specialised in the task at hand may be available. • an expert human interpreter/translator may be called for help over the network. The questions we want to address in this context are the following: • what kinds of ambiguities (unsolvable by state-of- the-art speech ana lysers) are there in dialogues and texts to be handled by the envisaged systems ? • what are the possible methods of interactive disambiguation, for each ambiguity type? • how can a system determine whether it is important or not for the overall communication goal to disambiguate a given ambiguity?
2
SR of 1 component (S1) 40% 45% 50% 55% 60% SR of combination (S2) 28% 31% 35% 39% 44% S1 65% 70% 75% 80% 85% 90% 95% 100% S2 49% 55% 61% 68% 75% 83% 91% 100% 50~60% overall Viterbi consistency corresponds then to 65~75% individual success rate, which is already optimistic. According to the preceding table, this corresponds to a structural consistency of 95% for each component, which seems impossible to attain by strictly automatic means in practical applications involving general users.
AMBIGUITIES & AMBIGUITY LABELLING
187
• what kind of knowledge is necessary to solve a given ambiguity, or, in other word, whom should the system ask: the user, the interpreter, or the expert system, if any? • in a given dialogue or document, how far do solutions to ambiguities carry over: to the end of the piece, to a limited distance, or not at all? In order to answer these questions, it seems necessary to build a data base of ambiguities occurring in the intended contexts. In this report, we are not interested in any specific data base management software, but in the collection of data, that is, in 'ambiguity labelling'. First, we make more precise several notions, such as ambiguous repres entation, ambiguity, ambiguity kernel , ambiguity type, etc. Second, we specify the attributes and values used for manual labelling, and give a text processor oriented format. Third, we give a complete example of ambiguity labelling of a short dialogue, with comments. Finally, we define a data-base oriented exchange format. 2
A formal view of ambiguities
2.1 2.1.1
Levels and contexts of ambiguities Three levels of granularity for ambiguity labelling
First, we distinguish three levels of granularity for considering ambiguities. There is an ambiguity at the level of a dialogue (resp. a text) if it can be segmented in at least two different ways into turns (resp. paragraphs). We speak of ambiguity of segmentation into turns or into paragraphs. There is an ambiguity at the level of a turn (resp. a paragraph) if it can be segmented in at least two different ways into utterances (We use the term 'utterance' for dialogues and texts, to stress that the 'units of analysis' are not always sentences, but may be titles, interjections, etc.). We speak of ambiguity of\
segmentation into utterances. There is an ambiguity at the level of an utterance if it can be analysed in at least two different ways, whereby the analysis is performed in view of translation into one or several languages in the context of a certain generic task. There are various types of utterance-level ambiguities. Ambiguities of segmentation into paragraphs may occur in written texts, if, for example, there is a separation by a (new-line) character only, withoutor (paragraph). They are much more frequent and problematic in dialogues.
188
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
For example, in ATR's transcriptions of Wizard of Oz interpretations dia logues (Park, Loken-KIM, Mizunashi & Fais 1995), there are an agent (A), a client (C), and an interpreter (I). In many cases, there are two success ive turns of I, one in Japanese and one in English. Sometimes, there are even three in a row (ATR-ITL 1994: J-E-J-32, E-J-J-33). If I does not help the system by pressing a button, this ambiguity will force the system to do language identification every time there may be a change of language. There are also cases of two successive turns by C (ATR-ITL 1994: E-27), and even three by A (ATR-ITL 1994: J-52) and I (ATR-ITL 1994: J-E-J-55, E-E-J-80) or four (ATR-ITL 1994: I,E-J-E-J-99). Studying these ambigu ities is important for discourse analysis, which assumes a correct analysis in terms of turns. Also, if successive turns in the same language are collapsed, this may add ambiguities of segmentation into utterances, leading in turn to more utterance-level ambiguities. Ambiguities of segmentation into utterances are very frequent, and most annoying, as we assume that the analysers will work utterance by utterance, even if they have access to the result of processing of the preceding context. There are for instance several examples of "right |? now |? turn left...". Or (Park, Loken-KIM, Mizunashi & Fais 1995:50):"OK |? so go back and is this number three |? right there |? shall I wait here for the bus?". An utterance may be spoken or written, may be a sentence, a phrase, a sequence of words, syllables, etc. In the usual sense, there is an ambiguity in an utterance if there are at least two ways of understanding it. This, however, does not give us a precise criterion for defining ambiguities, and even less so for labelling them and storing them as objects in a data base. Because human understanding heavily depends on the context and the com municative situation, it is indeed a very common experience that something is ambiguous for one person and not for another. Hence, we say that an utterance is ambiguous if it has an ambiguousl representation in some formal representation system. We return to that later. 2.1.2
Task-derived limitations on utterance-level ambiguities
As far as utterance-level ambiguities are concerned, we will consider only those which we feel should be produced by any state-of-the-art analyser constrained by the task. For instance, we should not consider that "good morning" is ambiguous with "good mourning", in a conference registration task. It could be different in the case of funeral arrangements.
AMBIGUITIES & AMBIGUITY LABELLING
189
Because the analyser is supposed to be state-of-the-art, "help" should not give rise to the possible meaning "help oneself" in "can I help you". Know ledge of the valencies and semantic restrictions on arguments of the verb "help" should eliminate this possibility. In the same way, "Please state your phone number" should not be deemed ambiguous, as no complete analysis should allow "state" to be a noun, or "phone" to be a verb. That could be different in a context where "state" could be construed as a proper noun, "State", for example in a dialogue where the State Department is involved. However, we should consider as ambiguous such cases as: "Please state (N/V) office phone number" (ATR-ITL 1994:33), where "phone" as a verb could be eliminated on grammatical grounds, but not "state office phone" as a noun, with "number" as a verb in the imperative form. The case would of course be different if the transcription would contain prosodic marks, but the point would continue to hold in general. 2.1.3
Necessity to consider utterance-level ambiguities in the context of full utterances
Let us take another example. Consider the utterance: (1) Do you know where the international telephone services are located? The underlined fragment has an ambiguity of attachment, because it has two different 'skeletons' (Black, Garside & Leech 1993) representations: [ i n t e r n a t i o n a l telephone] services / i n t e r n a t i o n a l [telephone services] As a title, this sequence presents the same ambiguity. However, it is not enough to consider it in isolation. Take for example: (2) The international telephone services many countries. The ambiguity has disappeared! It is indeed frequent that an ambiguity relative to a fragment appears, disappears and reappears as one broadens its context in an utterance. For example, in (3) The international telephone services many countries have established are very reliable. the ambiguity has reappeared. From the examples above, we see that, in order to define properly what an ambiguity is, we must consider the fragment within an utterance, and clarify the idea that the fragment is the smallest (within the utterance) where the ambiguity can be observed.
190 2.2 2.2.1
CHRISTIAN BOITET & MUTSUKO TOMOKIYO Representation
systems
Types of formal representation systems
Classical representation systems are based on lists of binary features, flat or complex attribute structures (property lists), labeled or decorated trees, various types of feature-structures, graphs or networks, and logical formulae. What is an 'ambiguous representation'? This question is not as trivial as it seems, because it is often not clear what we exactly mean by 'the' rep resentation of an utterance. In the case of a classical context-free grammar G, shall we say that a representation of U is any tree T associated to U via G, or that it is the set of all such trees? Usually, linguists say that U has several representations with reference to G. But if we use f-structures with disjunctions, U will always have one (or zero!) associated structure S. Then, we would like to say that S is ambiguous if it contains at least one disjunction. Returning to G, we might then say that 'the' representation of U is a disjunction of trees T. In practice, however, developers prefer to use hybrid data structures to represent utterances. Trees decorated with various types of structures are very popular. For speech and language processing, lattices bearing such trees are also used, which means at least 3 levels at which a representation may be ambiguous. 2.2.2
Computable representations and 'reasonable' analysers
Now, we are still left with two questions: 1. which representation system(s) do we choose? 2. how do we determine the representation or representations of a par ticular utterance in a specific representation system? The answer to the first question is a practical one. The representation system(s) must be fine-grained enough to allow the intended operations. For instance, text-to-speech requires less detail than translation. On the other hand, it is counter-productive to make too many distinctions. For example, what is the use of defining a system of 1000 semantic features if no system and no lexicographers may assign them to terms in an efficient and reliable way? There is also a matter of taste and consensus. Although different representation systems may be formally equivalent, researchers and developers have their preferences. Finally, we should prefer representations amenable to efficient computer processing. As far as the second question is concerned, two aspects should be dis tinguished. First, the consensus on a representation system goes with a
AMBIGUITIES & AMBIGUITY LABELLING
191
consensus on its semantics. This means that people using a particular rep resentation system should develop guidelines enabling them to decide which representations an utterance should have, at each level, and to create them by hand if challenged to do so. Second, these guidelines should be refined to the point where they may be used to specify and implement a parser producing all and only the intended representations for any utterance in the intended domain of discourse. A 'computable' representation system is a representation system for which a 'reasonable' parser can be developed. A 'reasonable' parser is a parser such as: • its size and time complexity are tractable over the class of intended utterances; • if it is not yet completed, assumptions about its ultimate capabilities, especially about its disambiguation capabilities, are realistic given the state of the art. _J Suppose, then, that we have defined a computable representation. We may not have the resources to build an adequate parser for it, or the one we have built may not yet be adequate. In that case, given the fact that we are specifying what the parser should and could produce, we may anticipate and say that an utterance presents an ambiguity of such and such types. This only means that we expect that an adequate parser will produce an ambiguous representation for the utterance at the considered level. 2.2.3
Expectations for a system of manual labelling
Our manual labelling should be such that: • it is compatible with the representation systems used by the actual or intended analysers. • it is clear and simple enough for linguists to do the labelling in a reliable way and in a reasonable amount of time. Representation systems may concern one or several levels of linguistic ana lysis. We will hence say that an utterance is phonetically ambiguous if it has an ambiguous phonetic representation, or if the phonetic part of its de scription in a 'multilevel' representation system is ambiguous, and so forth for all the levels of linguistic analysis, from phonetic to orthographic, mor phological, morphosyntactic, syntagmatic, functional, logical, semantic, and pragmatic.
192
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
In the labelling, we should only be concerned with the final result of analysis, not in any intermediate stage, because we want to retain only ambiguities which would remain unsolved after the complete automatic analysis process has been performed. 2.3
Ambiguous representations
A representation will be said to be ambiguous if it is multiple or underspecified. 2.3.1
Proper representations
In all known representation systems, it is possible to define 'proper repres entations', extracted from the usual representations, and ambiguity-free. For example, if we represent "We read books" by the unique decorated dependency tree: [["We" .
((lex "I-Pro") (cat pronoun) (person i) (number plur)...)] " r e a d " ((lex "read-V") (cat verb) (person 1) (number plur) (tense (\{pres past\}))...) ["books" ((lex "book-N") (cat noun)...)]]
there would be 2 proper representations, one with (tense pres), and the other with (tense past). For defining the proper representations of a representation system, it is necessary to specify which disjunctions are exclusive, and which are inclus ive. Proper and multiple representations A representation in a formal representation system is proper if it contains no exclusive disjunction. The set of proper representations associated to a representation R, is obtained by expanding all exclusive disjunctions of R (and eliminating duplicates). It is denoted here by Proper(R). R is multiple if |Proper(R)| > 1. R is multiple if (and only if) it is not proper. 2.3.2
Underspecified representations
A proper representation P is underspecified if it is undefined with respect to some necessary information.
AMBIGUITIES & AMBIGUITY LABELLING
193
There are two cases: the information may be specified, but its value is unknown, or it is missing altogether. The first case often happens in the case of anaphoras: (ref ?), or in the case where some information has not been exactly computed, e.g. (task_domain ?), (decade.of .month ?), but is necessary for translating in at least one of the considered target languages. It is quite natural to consider this as ambiguous. For example, an ana phoric reference should be said to be ambiguous • if several possible referents appear in the representation, which will give rise to several proper representations, • and also if the referent is simply marked as unknown, which causes no disjunction. The second case may never occur in representations such as Ariane-G5 decorated trees, where all attributes are always present in each decoration. But, in a standard f- structure, there is no way to force the presence of an attribute, so that a necessary attribute may be missing: then, (ref ?) is equivalent to the absence of the attribute ref. For any formal representation system, then, we must specify what the 'necessary information' is. Contrary to what is needed for defining Proper(R), this may vary with the intended application. 2.3.3
Ambiguous representations
Our final definition is now simple to state. A representation R is ambiguous if it is multiple or if Proper(R) contains an underspecified P. 2.4 2.4.1
Scope, occurrence, kernel and type of ambiguity Informal presentation
Although we have said that ambiguities have to be considered in the context of the utterances, it is clear that a sequence like "international telephone services" is ambiguous in the same way in utterances (1) and (3) above. We will call this an 'ambiguity kernel', and reserve the term of 'ambiguities' for what we will label, that is, occurrences of ambiguities. The distinction is the same as that between dictionary words and text words. It also clear that another sequence, such as "important business ad dresses" , would present the same sort of ambiguity in analogous contexts. This we want to define as 'ambiguity type'. In this case, linguists speak of
194
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
'ambiguity of attachment', or 'structural ambiguity'. Other types concern the acceptions (word senses), the functions (syntactic or semantic), etc. Our list will be given with the specification of the labelling conventions. Ambiguity patterns are more specific kinds of ambiguity types, usable to trigger disambiguation actions, such as the production of a certain kind of disambiguating dialogue. For example, there may be various patterns of structural ambiguities. 2.4.2
Scope of an ambiguity
We take it for granted that, for each considered representation system, we know how to define, for each fragment V of an utterance U having a proper representation P, the part of P which represents V. For example, given a context-free grammar and an associated tree struc ture P for U, the part of P representing a substring V of U is the smallest sub-tree Q containing all leaves corresponding to V. Q is not necessarily the whole subtree of P rooted at the root of Q. Conversely, for each part Q of P, we suppose that we know how to define the fragment V of U represented by Q. a. Scope of an ambiguity of underspecification Let P be a proper representation of U. Q is a minimal underspecified parti of P if it does not contain any strictly smaller underspecified part Q'. Let P be a proper representation of U and Q be a minimal underspecified part of P. The scope of the ambiguity of underspecification exhibited by Q is the fragment V represented by Q. In the case of an anaphoric element, Q will presumably correspond to one word or term V. In the case of an indeterminacy of semantic relation (deep case), e.g. on some argument of a predicate, Q would correspond to a whole phrase V. b. Scope of an ambiguity of multiplicity A fragment V presents an ambiguity of multiplicity n (n>2) in an utter ance U if it has n different proper representations which are part of n or more proper representations of U. V is an ambiguity scope if it is minimal relative to that ambiguity. This means that any strictly smaller fragment W of U will have strictly less than n associated subrepresentations (at least two of the representations of V are be equal with respect to W).
AMBIGUITIES & AMBIGUITY LABELLING
195
In example (1) above, then, the fragment "the international telephone ser vices", together with the two skeleton representations the [international telephone] services / the international [telephone services]
is not minimal, because it and its two representations can be reduced to the subfragment "international telephone services" and its two representations (which are minimal). This leads us to consider that, in syntactic trees, the representation of a fragment is not necessarily a 'horizontally complete' subtree (diagram on the right).
Fig. 1: Caption for the figure In the case above, for example, we might have the configurations given in the figure below. In the first pair (constituent structures), "international telephone services" is represented by a complete subtree. In the second pair (dependency structures), the representing subtrees are not complete subtrees of the whole tree.
196 2.4.3
CHRISTIAN BOITET & MUTSUKO TOMOKIYO Occurrence and kernel of an ambiguity a. Ambiguity (occurrence)
An ambiguity occurrence, or simply ambiguity, A of multiplicity n (n>2) relative to a representation system R, may be formally defined as: A (U, V, (Pl,P2...Pm), (pl,p2...pn)), where m>n and: • U is a complete utterance, called the context of the ambiguity. • V is a fragment of U, usually, but not necessarily connex, the scope of the ambiguity. • Pl,P2...Pm are all proper representations of U in R, and pl,p2...pn are the parts of them which represent V. • For any fragment W of U strictly contained in V, if ql,q2 ... qn are the parts of pl,p2 ... pn corresponding to W, there is at least one pair qi,qj (i≠j) such that qi = qj. This may be illustrated by the following diagram, where we take the rep resentations to be tree structures represented by triangles (see Figure 2). Here, P2 and P3 have the same part p2 representing V, so that m > n.
Fig. 2: Caption for the figure b. Ambiguity kernel The kernel of an ambiguity A = (U, V, (P1, P2...Pm), (p1, p2...pn)) is the scope of A and its (proper) representations: K(A) = (V, (p1, p2...pn)). In a data base, it will be enough to store only the kernels, and references to the kernels from the utterances.
AMBIGUITIES & AMBIGUITY LABELLING 2.4.4
197
Ambiguity type and ambiguity pattern a. Ambiguity type
The type of A is the way in which the pi differ, and must be defined relative to each particular R. J If the representations are complex, the difference between 2 representations is defined recursively. For example, 2 decorated trees may differ in their geometry or not. If not, at least 2 corresponding nodes must differ in their decorations. Further refinements can be made only with respect to the intended in terpretation of the representations. For example, anaphoric references and syntactic functions may be coded by the same formal kind of attribute-value pairs, but linguists usually consider them as different ambiguity types. When we define ambiguity types, the linguistic intuition should be the main factor to consider, because it is the basis for any disambiguation method. For example, syntactic dependencies may be coded geometrically in one representation system, and with features in another, but disambigu ating questions should be the same. b. Ambiguity pattern An ambiguity pattern is a schema with variables which can be instantiated to a (usually unbounded) set of ambiguity kernels. Here is an ambiguity pattern of multiplicity 2 corresponding to the example above. NP[ x l NP[ x2 x3 ] ] , NP[ NP[ x l x2] x3 ] .
We don't elaborate, as ambiguity patterns are specific to a particular rep resentation system and a particular analyser. 3
Attributes and values used in manual labelling
The proposed text processor oriented format for ambiguity labelling is a first version, resulting from several attempts by the second author to label transcriptions or spoken and multimodal dialogues. We describe this format with the help of a classical context-free gram mar, written in the font used here for our examples, and insert comments and explanations in the usual font.
198 3.1
CHRISTIAN BOITET & MUTSUKO TOMOKIYO Top level (piece)
::= | ::= ::= 'LABELLED TEXT:' ::= ::= '"' "" ::= <paragraph> [<parag_sep> <paragraph>]* <paragraph> ::= [ ]* ::= 'II?' ::= ::= 'LABELLED DIALOGUE:' ::= ::= [ ]* ::= [ ]* ::= <speaker_code> ':'
This means that the labelling begins by listing the text or the transcrip tion of the dialogue, thereby indicating segmentation problems with the mark "||?". 3.2 3.2.1
Paragraph or turn level Structure of the list and associated separators
The labelling continues with the next level of granularity, paragraphs or turns. The difference is that a turn begins with a speaker's code.::= + ::= <parag_text> I'PARAG' <parag_text> C'/PARAG'] <parag_text> ::= [ ]*
The mark PARAG must be used if there is more than one utterance. /PARAG is optional and should be inserted to close the list of utterances, that is if the next paragraph contains only one utterance and does not begin with PARAG. This kind of convention is inspired by SGML, and it might actually be a good idea in the future to write down this grammar in the SGML format.
::=[ ]* ::= '|?' ::= + ::= I'TURN5 ['/TURN']
AMBIGUITIES & AMBIGUITY LABELLING
199
We use the same convention for TURN and /TURN as for PARAG and /PARAG.
3.2.2
::= <speaker_code> ':' <parag_text>
Representation of ambiguities of segmentation
If there is an ambiguity of segmentation in paragraphs or turns, there may be more labelled paragraphs or turns than in the source. For example, A ||? B ||? C may give rise to A-B||C and A||B-C, and not to A-B-C and A||B||C. Which combinations are possible should be determined by the person doing the labelling. The same remark applies to utterances. Take one of the examples given at the beginning of this paper: OK |? so go back and is this number three |? right there |? shall I wait here for the bus?
This is an A | ? B | ? C |? D pattern, giving rise to 10 utterance possibilities. If the labeller considers only the 4 possibilities A|B|C-D, A|B|C|D, A|B-C|D, and A-B-C|D, the following 7 utterances will be labelled: A A-B-C B B-C C C-D D
3.3 3.3.1
OK OK so so go so go right right shall
go back and back and is back and is there there shall I wait here
is this number three right there this number three this number three right there I wait here for the bus? for the bus?
Utterance level Structure of the lists and associated separators
::=I ['UTTERANCES'] + ::=
(I-text) means 'indexed text': at the end of the scope of an ambiguity, we insert a reference to the corresponding ambiguity kernel, exactly as one inserts citation marks in a text. 3.3.2
Headers of ambiguity kernels
::=*
There may be no ambiguity in the utterance, hence the use of "*" instead of ".+ " as above.
200
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
::= ' ( '' ) ' ::= 'ambiguity' ['-' ] ::= ' - ' [ ' ] *
For example, a kernel header may be: "ambiguity EMMI10a-2'-5.1 ". This is ambiguity kernel number 2' in dialogue EMMI 10a, noted here EMMI 10a, and 5.1 is M. Tomokiyo's hierarchical code.
3.3.3
::=
Obligatory labels
::= <scope> \{<status>\}
By { A B C }, we mean any permutation of ABC : we don't insist that the labeller follows a specific order, only that the obligatory labels come first, with the scope as very first. a. Scope <scope> b. Status <status> <status_value>
::= '(scope'' ) ' ::= '(status' <status_value> ' ) ' ::= 'expert_system'|'interpreter'I'user'
The status expresses the kind of supplementary knowledge needed to re liably solve the considered ambiguity. If 'expert_system' is given, and if a disambiguation strategy decides to solve this ambiguity interactively, it may ask: the expert system, if any; the interpreter, if any; or the user (speaker). If I is given, it means that an expert system of the generic task at hand could not be expected to solve the ambiguity. c. Importance::= '(importance' ' ) ' ::= 'crucial' | 'important' | 'not-important' | 'negligible'
This expresses the impact of solving the ambiguity in the context of the intended task. An ambiguity of negation scope is often crucial, because it may lead to two opposed understanding, as in "A did not push B to annoy C" (did A push B or not?). An ambiguity of attachment is often only important, as the correspond ing meanings are not so different, and users may correct a wrong decision themselves. That is the case in the famous example "John saw Mary in the park with a telescope". From Japanese into English, although the number is very often am biguous, we may also very often consider it as 'not-important'. 'Negligible'
AMBIGUITIES & AMBIGUITY LABELLING
201
ambiguities don't really put obstacles to the communication. For example, "bus" in English may be "autobus" (intra-town bus) or "autocar" (intertown bus) in French, but either translation will almost always be perfectly understandable given the situation. d. Type::= '(type' ' ) ' : := ('structure' | 'attachment') '(' <structure>+ ' ) ' I ('communication_act' | 'CA') '(' + ' ) ' | ('class' | 'cat') '(' <morpho_syntactic_class>+ ' ) ' | 'meaning' '(' <definition>+ ' ) ' | '(' + ' ) ' | 'reference' | 'address' '(' + ' ) ' | 'situation' <situation> | 'mode' <mode> | ...
The linguists may define more types. <structure>
::= '<' (| <structure>+) '>' ::= 'yes' | 'acknowledge' | 'yn-question' | 'inform' | 'confirmation-question'
<morpho_syntactic_class> <definition>
::= ::= ::= ::=
<defined_ref_value>
::= ::= ::=
<situation> <mode>
::= ::=
ι ...
3.3.4
'N' 1 'V' | 'Adj' | 'Adv' | ...| '(' (<defined_ref_value> | )+ ' )' '*somebody' | '*something' '*speaker' | '*hearer' | '*client' | '*agent' | '*interpreter' 'infinitive' | 'indicative' | 'conjunctive' | 'imperative' | 'gerund'
Other labels
Other labels are not obligatory. Their list is to be completed in the future as more ambiguity labelling is performed.
202
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
::= [ | <multimodality>...J* ::= 'definitive' I 'long_term' | 'short_term' | 'local' <multimodality> ::= 'multimodal' (<multimodal_help> I '(' <multimodal_help>+ ' ) ' <multimodal_help> ::= 'prosody' | 'pause' | 'pointing' | 'gesture' | 'facial_expression' |...
4
Conclusions
Although many studies on ambiguities have been published, the specific goal of studying ambiguities in the perspective of interactive disambiguation in automated text and speech translation systems has led us to explore some new ground and to propose the new concept of 'ambiguity labelling'. Several dialogues from EMMI-1(ATR-ITL 1994) and EMMI-2(Park &· Loken-KIM 1994) have already labelled (in Japanese and English). Attempts have also been made on French texts and dialogues. In the near future, we hope to refine our ambiguity labelling, and to label WOZ dialogues from EMMI3(Park, Loken-KIM, Mizunashi & Fais 1995). In parallel, the specification of MIDDIM-DB, a HyperCard based support for the ambiguity data base under construction, is being reshaped to implement the new notions intro duced here: ambiguity kernels, occurrences, and types. Acknowledgements. We are very grateful to Dr. Y. Yamazaki, president of ATR-ITL, Mr. T.Morimoto, head of Department 4, and Dr. Loken-Kim K-H., for their constant support to this project, which one of the projects funded by CNRS and ATR in the context of a memorandum of understanding on scientific cooperation. Thanks should also go to M. Axtmeyer, L.Fais and H.Blanchon, who have contributed to the study of ambiguities in real texts and dialogues, and to M.Kurihara, for his programming skills.
REFERENCES ATR-ITL. 1994. "Transcriptions of English Oral Dialogues Collected by ATRITL using EMMI (from TR-IT-0029, ATR-ITL)" ed. by GETA. EMMI re port. Grenoble & Kyoto. Axtmeyer, Monique. 1994. "Analysis of Ambiguities in a Written Abstract (MIDDIM project)". Internal Report. Grenoble, France: GETA, IMAG (UJF & CNRS).
AMBIGUITIES & AMBIGUITY LABELLING
203
Black, Ezra, R. Garside & G. Leech. 1993. Statistically-Driven Grammars of English: The IBM/ Lancaster Approach ed. by J. Aarts & W. Mejs, (= Language and Computers: Studies in Practical Linguistics, 8). Amsterdam: Rodopi. Blanchon, Hervé. 1993. "Report on a stay at ATR". Project Report (MIDDIM), Grenoble & Kyoto: GETA & ATR-ITL. 1994. "Perspectives of DBMT for Monolingual Authors on the Basis of LIDIA-1, an Implemented Mockup". Proceedings of 15th International Con ference on Computational Linguistics(COLING-94)', vol.1, 115-119. Kyoto, Japan. 1994. "Pattern-Based Approach to Interactive Disambiguation: First Definition and Experimentation". Technical Report 0073. Kyoto, Japan: ATR-ITL. Boitet, Christian. 1989. "Speech Synthesis and Dialogue Based Machine Trans lation". Proceedings of ATR Symposium on Basic Research for Telephone Interpretation, 22-22. Kyoto, Japan. & H. Blanchon. 1993. "Dialogue-based MT for Monolingual Authors and the LIDIA Project". Rapport de Recherche (RR-918-I). Grenoble: IMAG. GETA, UJF & CNRS. 1993. "Practical Speech Translation Systems will Integrate Human Expert ise, Multimodal Communication, and Interactive Disambiguation". Proceed ings of the 4th Machine Translation Summit, 173-176. Kobe, Japan. 1993. "Human-Oriented Design and Human-Machine-Human Interactions in Machine Interpretation". Technical Report 0013. Kyoto: ATR-ITL. _. 1993. "Multimodal Interactive Disambiguation: First Report on the MIDDIM Project". Technical Report 0014. Kyoto: ATR-ITL. & K-H. Loken-Kim. 1993. "Human-Machine-Human Interactions in Inter preting Telecommunications". Proceedings of International Symposium on Spoken Dialogue. Tokyo, Japan. & M. Axtmeyer. 1994. "Documents Prepared for Inclusion in MIDDIMDB". Internal Report. Grenoble: GETA, IMAG (UJF & CNRS). 1994. "On the design of MIDDIM-DB, a Data Base of Ambiguities and Dis ambiguation Methods". Technical Report 0072. Kyoto & Grenoble: ATRITL & GETA-IMAG. & H. Blanchon. 1995. "Multilingual Dialogue-Based MT for monolingual authors: the LIDIA project and a first mockup". Seminor Report on Machine Translation. Grenoble.
204
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
Maruyama, Hiroshi, H. Watanabe & S. Ogino. 1990. "An Interactive Japan ese Parser for Machine Translation" ed. by H. Karlgren, Proceedings of 15th International Conference on Computational Linguistics (COLING-90), vol.II/III, 257-262. Helsinki, Finland. Tomokiyo, Mutsuko & K-H.Loken-Kim. 1994. "Ambiguity Analysis and MIDDIMDB". Technical Report 0064. Kyoto & Grenoble: ATR-ITL & GETA-IMAG. . 1994. "Ambiguity Classification and Representation". Proceedings of Nat ural Language Understanding and Models of Communication (NLC-94 work shop). Tokyo. Park Young Dok & K-H.Loken-Kim. 1994. "Text Database of the Telephone and Multimedia Multimodal Interpretation Experiment". Technical Report 0086. Kyoto: ATR-ITL. , K-H. Loken-Kim & L. Fais. 1994. "An Experiment for Telephone versus Multimedia Multimodal Interpretation: Methods and Subject's Behavior". Technical Report 0087. Kyoto: ATR-ITL. , K-H.Loken-Kim, S.Mizunashi & L.Fais. 1995. "Transcription of the Col lected Dialogue in a Telephone and Multimedia/ Multimodal WOZ Experi ment". Technical Report 0091. Kyoto: ATR-ITL. Winship, Joe. 1994. "Building MIDDIM-DB, a HyperCard data-base of ambigu ities and disambiguation methods". ERASMUS Project Report. Grenoble Brighton: GETA, IMAG (UJF CNRS) University of Sussex at Brighton.
AMBIGUITIES & AMBIGUITY LABELLING E x a m p l e of a short dialogue I. C o m p l e t e l a b e l l i n g in t e x t p r o c e s sor o r i e n t e d f o r m a t The numbers in square brackets are not part of the labelling format and are only given for convenience.
205
[15] A:and y o u ' l l t a k e t h e subway n o r t h t o Sanjo s t a t i o n [16]AA:0K [17] A : / I s / a t Sanjo s t a t i o n y o u ' l l g e t off and change t r a i n s t o t h i Keihan Kyotsu l i n e [18]AA: [hmm] [19] A:OK I.2 Turns
I.1 Text of the dialogue LABELLED DIALOGUE:" EMMI 10a"
LABELLED TURNS OF DIALOGUE "EMMI 10a"
[1] A:Good morning conference office how can I help you TURN [2] AA:[ah] yes good morning [1] AA:Good morning, c o n f e r e n c e o f f i c e , could you tell me please | ? How can I h e l p you? how to get from Kyoto UTTERANCES station to your conference center AA:Good morning, c o n f e r e n c e [3] A : / I s / [ah] yes (can you t e l l office(l) me) [ah](you) y o u ' r e going t o t h e conference c e n t e r (ambiguity EMMI10a-l-2.2.8.3 today ((scope ''conference o f f i c e ' ' ) [4] AA:yes I am t o a t t e n d t h i [uh] (status expert_system) Second I n t e r n a t i o n a l ( a d d r e s s (*speaker * h e a r e r ) ) Symposium { o n } I n t e r p r e t i n g (importance not-important) Telecommunications (multimodal facial-expression) [5] A : { [ o ? ] } OK n ' where a r e you (desambiguation_scope d e f i n i t i v e ) ) ) c a l l i n g from r i g h t now [6] A A : c a l l i n g from Kyoto s t a t i o n AA:How can I h e l p you? [7] A : / I s / OK, y o u ' r e a t Kyoto /TURN is not necessary here because an s t a t i o n r i g h t now other TURN appears. [8] AA:{yes} [9] A : { / b r e a t h / } and t o g e t t o t h e TURN I n t e r n a t i o n a l Conference Center you can e i t h e r t r a v e l [2] AA:[ah] y e s , good morning. | by t a x i bus or subway how Could you t e l l me p l e a s e would you l i k e t o go how t o g e t from Kyoto [10]AA:I t h i n k subway sounds l i k e s t a t i o n t o your t h e b e s t way t o me conference center? [11] A:OK [ah] you wanna go by The labeller distinguishes here a sure seg subway and y o u ' r e a t t h e mentation into 2 utterances. s t a t i o n r i g h t now [12]AA:yes UTTERANCES [13] A:OK so [ah] y o u ' l l want t o g e t A A : [ a h ] y e s ( 2 ) , good morning. back on t h i subway going n o r t h [14]AA:[hmm]
206
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
(ambiguity EMMI10a-2-5.1 ((scope "yes") (status user) (type CA (yes acknowledge)) (importance crucial) (multimodal prosody))) AA:Could you tell me please how to get from Kyoto station to your conference center(3)? (ambiguity EMMI10a-3-2.2.2 ((scope "your conference center") (status user) (type structure («your conferenceXcenter» «yourXconference center»)) (importance negligible) (multimodal prosody)))
/TURN
(type
Japanese
(importance
important)))
[6] AA:calling from Kyoto station [7] A A : / I s / OK, you're at Kyoto station(8) right now. (ambiguity EMMI10a-8-5.1 ((scope "you're at Kyoto station") (status expert_system) (type CA (yn-question inform)) (importance crucial) (multimodal prosody))) [8] AA :
{yes}
TURN [9] A:{/breath/} and to get to the International Conference Center you can either travel by taxi bus or subway. | how would you like to go
TURN is not necessary if there is only one utterance with no ambiguity of segmenta U T T E R A N C E tion. A:{/breath/} and to get to the [3] A:/Is/[ah] yes (can you tell me) [ah] (you) you're going to the conference center today(4) (ambiguity EMMI10a-4-5.2 ((scope "today") (status expert_system) (situation "the day they are speaking") (importance negligible) (multimodal "built-in calendar on screen"))) [4] AA:yes I am to(5) attend thi [uh] Second International Symposium {on} Interpreting Telecommunications (ambiguity EMMIlOa-5-3.1.2 ((scope "am to") (status user)
International Conference Center you can(9) either travel(9', 9") by taxi bus or subway(10). (ambiguity EMMIiOa-9-2.1 ((scope "can") (status expert_system) (type class(verb modal_verb)) (importance crucial))) (ambiguity EMMI10a-9'-2.1 ((scope "the International Conference Center you can either travel") (status expert_system) (type structure (<«the International Conf erence CenterXyou can» <either travel» «the Inter national Conference Center>
AMBIGUITIES & AMBIGUITY LABELLING
207
subway and you're at the s t a t i o n right now") (status expert-system) (type CA (yn-question inform)) (importance crucial) (multimodal prosody)))
>>>) (importance crucial) (multimodal prosody))) (ambiguity EMMI10a-9"-2.1 ((scope "travel") (status expert_system) (mode (infinitive imperative)) (importance crucial)))
[12]AA:yes [13] A:OK so [ah] you'll want to(13) get back on thi subway going north(14)
(ambiguity EMMI10a-10-2.2.2 ((scope "taxi bus or subway") (status expert_system) (type structure ()) (importance important) (mult imodal prosody)))
(ambiguity EMMIlOa-13-3.1.2 ((scope "want to") (status interpreter) (type Japanese (type French ("vouloir" "devoir")) (importance important)))
A:How would you like to go /TURN
(ambiguity EMMI10a-14-2.2.2 This example is of the same kind as the very ( ( S C O p e " g e t back on t h i subway famous one:" Time flies like an arrow" !" Linguist's going n o r t h " ) examples" are often derided, but they really (status user) appear in texts and dialogues. However, as (type s t r u c t u r e (soon as they are taken out of context, they
north>>>)) important) prosody)))
AA:[hmm] [15] A:and y o u ' l l t a k e t h e subway
interpreter) n o r t h t o Sanio s t a t i o n
(type c a t (verb n o u n ) (importance c r u c i a l ) \\\ (multimodal (prosody p a u s e ) ) ) [17] A [11] A:OK, [ah] you wanna go by
[16]AA:0K A : / l s / a t Sanio s t a t i o n y o u ' l l g e t off (15) and
subway and y o u ' r e a t t h e change t r a i n s t o t h i s t a t i o n r i g h t now(12). (ambiguity ((scope
EMMI10a-12-5.1 you wanna go by
Keihan Kyotsu l i n e (ambiguity EMMI10a-15-5.2
208
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
( ( s c o p e " g e t off and change trains") (status user) ( t y p e s t r u c t u r e ( « g e t off and c h a n g e > t r a i n s > « g e t o f f X a n d change trains»)) (importance negligible) (multimodal p a u s e ) ) )
1.2 Turns Blank lines have been inserted only to make reading easier. TURNS:LABELLED TURNS OF DIALOGUE TURNS:"EMMI 10a"
[18]AA:[hmm]
TURN: [1]
[19] A:OK
TEXT:AA:Good morning, conference TEXT: office, |? How can I help you?
I I . F r a g m e n t in a d a t a b a s e o r i e n t e d format The idea is simply to use a line-oriented format, each line beginning with a keyword corresponding to the part being labelled. If the information does not fit on one line, the keyword is repeated at the beginning of the next line. The following fragment (turns 1~7) ilustrates the idea. The main point is that such a format is easier to handle by traditional D B M S systems. The details of the for mats may vary, but it is always required that translation from one format into the other is possible, without loss of informa tion.
UTTERANCE: [1.1] TEXT:Good morning, conference office(l) AMBIGUITY :EMMI10a-l-2.2.8.3 SCOPE: ''conference office'' STATUS: expert_system ADDRESS: (*speaker *hearer) IMPORTANCE: not-important MULTIMODAL: facial-expression DISAMBIGUATION_SCOPE: definitive UTTERANCE: [1.2] TEXT:AA:How can I help you?
TURN : [2] TEXT :AA:[ah] yes, good morning.| TEXT: Could you tell me II. 1 Text of the dialogue TEXT: please how to get from Kyoto station to HEADING:LABELLED DIALOGUE:"EMMI 10a" TEXT: TEXT: your conference center? COMMENT:Sure segmentation into 2 TEXT: A:Good morning, conference utterances. TEXT: o f f i c e , how can I h e l p you? TEXT: A:Good morning conference TEXT: o f f i c e how can I h e l p you UTTERANCE: [2.1] TEXT :AA:[ah] yes(2),good morning. TEXT:AA:[ah] yes good morning TEXT: could you t e l l me p l e a s e how TEXT: t o g e t from Kyoto s t a t i o n t o AMBIGUITY: EMMI10a-2-5.1 TEXT: your conference c e n t e r SCOPE: "yes" TEXT: A : / I s / [ a h ] yes (can you STATUS: expert_system TEXT: t e l l me)[ah](you) y o u ' r e TYPE: CA (yes acknowledge) TEXT: going t o t h e conference IMPORTANCE: crucial TEXT: c e n t e r today MULTIMODAL: prosody
AMBIGUITIES & AMBIGUITY LABELLING UTTERANCE: [2.2] TEXT : AA : Could you tell me please TEXT: how to get from Kyoto station TEXT: to your conference center(3)? AMBIGUITY : EMMI 1Oa-3-2.2.2 SCOPE: "your conference center'' STATUS:user TYPE: structure («your conference> TYPE:
209
TURN:[5] TEXT: A: {[o?]} OK n' where are TEXT: you calling(6) from right TEXT: now(7) AMBIGUITY : EMMI 1Oa-6-3.Í.2 SCOPE: "calling'' STATUS: expert_system TYPE: Japanese IMPORTANCE: crucial AMBIGUITY: EMMI 10a-7-2.1 SCOPE: "calling from right now'' STATUS: user TYPE: structure («calling from> TYPE:») IMPORTANCE: crucial MULTIMODAL: prosody TURN:[6] TEXT: A:calling from Kyoto station TURN :[7] TEXT:AA: /Is/ OK, you're at Kyoto TEXT: station(8) right now. AMBIGUITY :EMMI10a-8-5.1 SCOPE: "you're at Kyoto station'' STATUS :expert_system TYPE: CA(yn-question inform) IMPORTANCE: crucial MULTIMODAL: prosody
III DISCOURSE
Incorporating Discourse Aspects in English - Polish MT MALGORZATA E. STYŚ & STEFAN S. ZEMKE
University of Cambridge & Linköping University Abstract English orders constituents in utterances according to their grammat ical function. Polish places them with regard to their informational salience and stylistic criteria. This rises two problems when trans lating: how to determine informational salience and which potential order to prefer. The former is addressed by providing an extended version of the centering algorithm. The latter, by extracting order preferences from statistical data. 1
Introduction
Machine translation systems tend to concentrate on conveying the meaning and structure of individual sentences. However, since translation has to be accurate not only lexically and grammatically but also needs to carry across the contextual meaning of each utterance, incorporating discourse aspects is necessary. English and Polish exhibit certain idiosyncratic features which impose different ways of expressing the information status of constituents. Unlike English, in which constituent order is grammatically determined, Polish displays an ordering tendency according to constituents' degree of salience, so that the most informationally salient elements are placed towards the end of the clause. Such ordering requires solid knowledge about the constituents' degree of salience. This paper is organised as follows. The next section includes a descrip tion of the centering algorithm for English and our extensions of the notion in view of English - Polish machine translation. We then go on to describe the idiosyncratic properties of Polish and their implications for center trans fer. Finally, the rules for ordering Polish constituents are outlined. 2
Centering model for English analysis
Centering as introduced by (Grosz et al. 1986) is a discourse model pro posing rules for tracking down given information units on local discourse
214
MALGORZATA STYS & STEFAN ZEMKE
level. Center, expressed as a noun phrase, is a pragmatic construct and it is intentionally defined as the discourse entity that the utterance is about. Each utterance U is assigned a forward-looking center list Cf(Un) of all nominal expressions within the utterance ordered by their grammatical func tion which corresponds to the linear order of constituents in English. The backward-looking center Cb(U n ), the center proper, is the highest ranked element of Un which is also (if possible) realised in Cf(Un_1). Pronominalisation and subjecthood are the main criteria underlying this ranking. Generally, (resolvable) pronouns are the preferred center candidates. For possible relations between subsequent utterances look at (Brennan et al. 1987). 2.1
Extension to the centering algorithm
Various refinements have been added to the centering model since its intro duction (Brennan et al. 1987; Kameyama et al. 1986; Mitkov 1994; Walker et al. 1994). Description of our practically motivated extensions follows. 2.1.1 Definiteness. Definite articles often point to a center. However, the correlation between definiteness and an entity having been introduced in previous discourse is high but not total. (For example, proper names can be textually new yet definite.) We therefore include definiteness among factors contributing to center evaluation. Indefinite noun phrases are treated as new discourse entities. 2.1.2 Lexical reiteration. Lexically reiterated items include repeated or syn onymous noun phrases possibly preceded by articles, possessives or demon stratives. We also propose to consider semantic equivalence based on the synonyms coded in the lexicon as valid instances of reiteration. 2.1.3 Referential distance. For pronouns and reiterated nouns, we propose the allowed maximal referential distance, measured in the number of clauses scanned back, to correlate with the word length of the constituent involved (Siewierska 1993a). This relates to the observation that short referring ex pressions have their resolvents closer than longer ones. Such precaution limiting the referential distance minimises the danger of over-interpretation of common generic expression such as it. We have not yet experimented with various functions relating the type of referent to its allowed referential distance, a simple linear dependence (with factor 1-2) seems to be reason able.
DISCOURSE ASPECTS IN ENGLISH - POLISH MT CONSTRUCT
MARKERS
215
CENTER VALUE
1 2 3 4
Center-pointing constructions (Point. 1-4) Cleft it+Be+N c +that/who center(N c ):=3 Fronted N f Sentence-N f center(N ƒ ):=3 Prompted Prompt+N p ,Sentence center (N p ):=3 There-insertion there+Be+N t center(N t ):=2
1 2
Personal Demonstrative
1 2 3
Indefinites Proper names Default for any NP
Pronominal centers (Pron. 1-2) I/you/it/he/she/we/they center(Pron p e r s ):=2 this/that/these/those center (ndemo):=1 Other (Non. 1-3) a/an/another /other e.g., Mary/Chicago Cases not listed elsewhere
center(N i n d e f ):=-l center(Proper):=1 center(NP):=0
Composite Centers (C. 1-5) 1 2 3 4 5
re
Reiterated nominal Definite expressions Possessives Genitives Resolved pronoun
f-dist·
Nreit ↔ Nreit the/such/this/that etc. +N its/his/her etc. +N N's+N p , N p +of+N ↔ref-dist Pron NPmatch/
center(N r e i t )+1 center(N)+1 center(N)+l center (N p )+center(N) center (Pron)+1
Table 1: Center values for different types of NP 2.1.4 Center-pointing constructions. Certain English constructions unam biguously point to the center thus making more detailed analysis unneces sary. The cleft construction uses a dummy subject it to introduce center, e.g., It was John who came. The center can also be fronted, e.g., Apples, Adam likes, or introduced by a prompt as for, concerning, with regard to etc., e.g., As for Adam, he doesn't like apples. 2.1.5 Composite center value. The rules for Composite Centers in Table 1 allow us to calculate center value increase over the default 0. Thus, for example, the center value for the scientists' colleagues will be arrived at by adding the contribution for the (+1) to the contributions for scientists and colleagues (each 0 or 1 depending on whether the item is reiterated) giving a value between 1 and 3 depending on the context. A constituent is assumed to be assigned the highest possible center value allowed by our rules.
216
MALGORZATA STYS & STEFAN ZEMKE
2.1.6 Center gradation. Considering the priority scale of referential items, the mechanisms underlying centering in English could then be outlined as follows, • • • •
Preference Preference Preference Preference
of of of of
pronouns over full nouns. definîtes over indefinites. reiterated items over non-reiterated ones. constituents involving more 'givenness' indicators.
These considered along with special center-pointing constructions lead to the numerical guidelines presented in Table 1. Some of them agree with the idea of the givenness hierarchy cf. (Gundel 1993). In Table 3, we illustrate the application of rules included in Table 1. We choose the constituent with the highest center value as the discrete center of an utterance. If more than one constituent has been assigned the same value, we take the entity that is highest-ranked according to the ranking introduced in the original algorithm (Grosz et al. 1986; Grosz et al. 1995; Brennan et al. 1987). UTTERANCE
RULES
VALUES
CENTER
1 The scientists conducted many tests.
Comp.2 Non.3
1 = 1+0 0
scientists
2 The tests were thorough.
Comp.1,2
2 = 1+1+0
tests
3
The results were looked at by their colleagues.
Comp.2 Comp.3,5
1 = 1+0 2 = 1+1+0
colleagues
4
They were acknowledged.
3 = 2+1
they
5
The scientists' colleagues accepted the tests.
Pron.l, Comp.5 Comp.1,2,4 Comp.1,2
3 = 1+1+1+0 2 = 1+1+0
colleagues
Table 2: Center values for example clauses
3
Local discourse mechanisms in translation
In discourse analysis, we relate particular utterances to their linguistic and non-linguistic environment. Below, we shall describe the relationship
DISCOURSE ASPECTS IN ENGLISH - POLISH MT
217
between the grammatical sentence pattern (Subject Verb Object) and the communicative pattern (Theme Transition Rheme). Functional sentence perspective (FSP) is an approach used by the Prague School of linguists to analyse utterances of Slavic languages in terms of their information content (Firbas 1992). In a coherent discourse, the given or known information, theme, usually appears first thus forming a co-referential link with the preceding text. The new unit of information, rheme, provides some specification of the theme. It is the essential piece of information of the utterance. There are clear linear effects of FSP 1 . Utterance non-final positions usu ally have given information interpretation whereas the final represents the new. This could be motivated by word order arranged in such a way that first come words referring to details already familiar from the preceding ut terances/external context and only then words describing new detail. Sim ilarly, in perception first comes identification and only then augmentation by details individually connected with the given idea (Szwedek 1976). Constituent order in Polish generally follows the communicative or der from given to new. Since the grammatical function is determined by inflection, there is great scope for the order to express contextual distinc tions and the order often seems free due to virtual absence of structural obstacles. However, there are also other, mostly stylistic, factors influen cing the final order which can co-specify or even override the 'given precedes new' tendency. This presents a delicate task of balancing a number of clues selecting the most justified choice. The degree of emphasis is also a factor and it is worth noting that the more frequently an order occurs the less emphatic it is (Siewierska 1993a).
4
Ordering of Polish constituents
Our choice of ordering criteria has been directly inspired by the findings of the Prague School discussed above, our own linguistic experience (both of us bilingual, native speakers of Polish) some statistical data provided by (Siewierska 1987; 1993a,b) and by the feasibility of implementation. The intended approach to ordering could be characterised as follows, 1
The information structure also changes depending on the accentuation pattern, but we shall leave the intonation aspects aside in this presentation.
MALGORZATA STYS & STEFAN ZEMKE
218
Permissive: Generate more (imperfect) versions rather than none at all. If need be, restrict by further filters. Composite: Generate all plausible orders before some of them will be discrim inated. (This approach is side-tracked when a special construction is en countered.) Discrete: No gradings/probability measures are assigned to competing orders as to discriminate between them. This could be an extension. 4.1
Ordering
criteria
Below we present some rules which are obeyed by Polish clauses under usual conditions: • End weight principle: Last primary constituent is the anti-center; • Given information fronting: Constituents belonging to the given in formation sequence are fronted; • Short precedes long principle: Shorter constituents go first; • Relative order principle: Certain partial orders are only compatible with specific patterns of constituents. Additionally, there is a strong tendency to omit subject pronouns. Such omission, however, exhibits different degrees of optionality. W h a t follows is a list of constructs used in subsequent tables to generate plausible orders of (translated) Polish constituents. Center information: has the highest rank in the ordering procedure and is used in three aspects: • center(Constituent) returns the center value of the Constituent's NP, or 0 if undefined, • center_shift( Utterance) holds if Utterance relates to the preceding one in the way allowed by the shift transition cf. (Grosz et al. 1986a) • discrete-center(Constituent) holds if Constituent is the chosen center of the current utterance Length of constituents: length(Constituent) returns the number of words of the resulting Polish Constituent2. Although not as important as center information, this rough measure can discriminate certain orders on the basis of 'short precedes long principle' 3 . 2
3
This measure, to a great extent, depends on the translation of constituents. It could be approximated by the length of the original English, instead of Polish, units. We use that in the example. However, for the otherwise rare order OSV, the opposite applies.
DISCOURSE ASPECTS IN ENGLISH - POLISH MT
219
Positioning of certain constituents: (or indeed their lack) can in turn in duce other constituents to occupy certain positions. Some orders are only possible in certain configurations, e.g., with frontal Adjunct (X-), whereas others require just its presence (-X-), or absence (X=[ ]). Syntactic phenomena: • grammatical function of a constituent, e.g., being a subject (S) or object (0). • pron(s) & pron(O) if both subject and object are pronominal or Sub(Un) = Sub(Un-1) - if subject stays the same. • certain expressions, e.g., a focus binding expression such as 'only', can trigger specific translation patterns. Features of next utterance: e.g., center(S,Un+1) > 0, can be used together with the features of the current utterance in order to obtain more specific conditions. In the following tables S denotes (Polish) subject, V - verb, - object, X - adjunct, Prim - S or 0 , "-" - (sequence of) any, [ ] - omitted constituent. The difference for ">>" to hold must be at least 2. 4.2
Building on orders of constituents
The Preference Table ?? presents some of the main PREFERENCES for gener ating orders of Polish constituents depending on specific CONDITIONS. Each line of the table can be treated as an independent if-then rule co-specifying (certain aspects of) an order. Different rules can be applied independ ently thus possibly better determining a given order4. The JUSTIFICATION column provides some explanation of the validity of each rule. It might be the case that as a result of applying the Preference table, we obtain too many orders. The Discrimination Table 4 provides some ra tionale for excluding those matching ORDERS for which one of their DISCRI MINATION conditions fails. If the building stage left us with no possible or ders at all, we could allow any order and pick only those which successfully pass all their discrimination tests. It is purposeful that all orders apart from the canonical SVO have some discrimination conditions attached to them. The rarer the order tends to be the more strict the condition. Therefore, SVO is expected to prevail. Both the Preference table and the Discrim ination table are mostly based on statistical data described in (Siewierska 1987; 1993a,b). There remains a number of cases which escape simple characterisation in terms of 'preferred and not-discriminated'. The Preprocessing Table 5 4
Orders derived by co-operation of several rules could be preferred in some way.
220
MALGORZATA STYS & STEFAN ZEMKE
Pref.
CONDITIONS
i ii iii iiib
Orderings implied by center information center (Any) < 0 -Any Final position of new center(Cl) >> center(A2) -Anyl-Any2Given-new principle center(X) > 1 XAdjunct topic fronted discrete_center(Prim) (X-)(V-)Prim- Primary center fronted
iv ν vi vii viii ix X
xi xii xiii
PREFERENCE
JUSTIFICATION
Statistical positioning preferences Statistical XV-S-O-V-S-O- & -XStatistical XV-O-S-0-S- & XStatistical XV-O-S-V-O-S- & -XStatistical XS-V-O-S-V-O- & -XS-V-OX Statistical -S-V-O- & -XStatistical O-V-SX -O-V-S- & -XO-VXS Statistical -O-V-S- & -XPron(S) (& center_shift(Un)) Stylistic -vsGeneral Statistical -v-oStatistical preferences -s-o-
(66%) (53%) (32%) (30%) (29%) (26%)
(89%+) (81%)
Table 3: Center values for example clauses offers some solutions under such circumstances. It is to be checked for its conditions before any of the previous tables are involved. If a condition holds, its result (e.g., 0-anaphora) should be noted and only then the other tables applied to co-specify features of the translation as described above. The Preprocessing table can yield erroneous results when applied repeatedly for the same clause. Therefore, unlike the other tables, it should be used only once per utterance. In Table we continue the example from Table 3. The orderings built on by a cooperation of the Preprocessing/Preference and not refused by the Discrimination table appear in the last column. 5
Conclusion
One of the aims of this research was to exploit the notion of center in Polish and put it forward in context of machine translation. The fact that centers are conceptualised and coded differently in Polish and English has clear repercussions in the process of translation. Through exploring the pragmatic, semantic and syntactic conditions underlying the organisation
DISCOURSE ASPECTS IN ENGLISH - POLISH MT Discr,. i ii iii iv V
vi vii viii ix X
ORDER
DISCRIMINATION
221
JUSTIFICATION
-V-S-O- length(S) < length(O) -V-S-O- -V-S-0 -V-S-O- Pron(S) -V-O-S- length(O) < length(S) -V-O-S- -X- present -S-O-V- SOV -S-O-V- center(S,[Un+1) > 0 -O-S-V-O-S-V- length(O) > length(S) -O-V-S length(O) > length(S)
osvx
Statistical Statistical Stylistic Statistical Statistical Statistical Statistical Statistical Statistical Statistical
(99%) (87%) (96%) (89%) (50%+) (79%) (100%) (64%)
Table 4: Discrimination table RESULT
JUSTIFICATION
Pre.
CONDITIONS
i ii iii iv
0-anaphora S='we' S=[ ] pron(O) & pron(S) S=[ ] Sub(Un) = S u b ( U n ) (& pron(S)) S=[ ] center_continuing(Un) S=[ ]
Rhythmic Stylistic Stylistic Stylistic
v vi
Special constructions -'only' SV- & pron(S) -'tylko' SVX=[ ] & pron(O) SOV
Focus binding expr. Special: S,0,V only
Table 5: Preprocessing table of utterances in both languages, we have been able to devise a set of rules for communicatively motivated ordering of Polish constituents. Among the main factors determining this positioning are pronominalisation, lexical reiteration, definiteness, grammatical function and special centered constructions in the source language. Their degree of topicality is coded by the derived center values. Those along with additional factors, such as the length of the originating Polish constituents and the presence of adjuncts, are used to determine justifiable constituent order in the resulting Polish clauses.
222
MALGORZATA STYS & STEFAN ZEMKE
1
2
PREFERENCE
PARTIAL
DISCRIMINATION
CRITERIA
ORDERS
(FAILING)
Pref.xii Pref.xiii
VSO
SVO (Discr.iii)
No rules apply, order unchanged
3
Pref.iiib (Pref.xii)
OVS VOS OSV
4
Pre.iii Pref.xi
S=[] -VS-
5
Pref.iiib (Pref.xii)
SVO VSO
Discr.x (Discr.v) (Discr.viii)
RESULTING ORDER(S)
SVO
SVX OVS
V[S]X
SVO (Discr.i)
Table 6: Example continued: Deriving constituent orders In further research, we wish to extend the scope of translated constructions to di-transitives and passives. We shall also give due attention to relative clauses. Centering in English can be further refined by allowing verbal and adjectival centers as well as by determining anti-center constructs. We have thus tackled the question of information distribution in terms of communicative functions and examined its influence on the syntactic structure of the source and target utterances. How and why intersentential relations are to be transmitted across the two languages remains an intricate question, but we believe to have partially contributed to the solution of this problem. REFERENCES Brennan, Susan E., Marilyn W. Friedman & Carl J. Pollard. 1987. "A Centering Approach to Pronouns". Proceedings of the Annual Conference of the Asso ciation for Computational Linguistics (ACL'87), 155-162. Stanford, Calif. Firbas, Jan. 1992. Functional Sentence Perspective in Written and Spoken Com munication. Cambridge: Cambridge University Press.
DISCOURSE ASPECTS IN ENGLISH - POLISH MT
223
Grosz, Barbara J. 1986. "The Representation and Use of Focus in a System for Understanding Dialogs". Readings in Natural Language Processing ed. by Grosz, Barbara, K. Jones & B. Webber, 353-362. Los Altos, Calif.: Morgan Kaufmann Publishers. , Aravind K. Joshi & Scott Weinstein. 1995. "Centering: A Framework for Modelling the Local Coherence of Discourse". Computational Linguistics 21:2.203-225. Gundel, Jeanette K. 1993. "Centering and the Givenness Hierarchy: A Pro posed Synthesis". Workshop on Centering Theory in Naturally Occurring Discourses. Philadelphia: University of Pennsylvania. Kameyama, Megumi. 1986. "A Property Sharing Constraint in Centering". Pro ceedings of the 24th Annual Conference of the Association for Computational Linguistics (ACL'86), 200-206. Columbia, N.Y. Mitkov, Ruslan. 1994. "A New Approach for Tracking Center". Proceedings of the International Conference "New Methods in Language Processing"', 150154. Manchester: UMIST. Siewierska, Anna. 1987. "Postverbal Subject Pronouns in Polish in the Light of Topic Continuity and the Topic/Focus Distinction". Getting One's Words into Line ed. by J. Nuyts and G. de Schutter, 147-161. Dordrecht: Foris. 1993a. "Subject and Object Order in Written Polish: Some Statistical Data". Folia Linguistica 27:1/2.147-169. 1993b. "Syntactic Weight vs. Information Structure and Word Order Variation in Polish". Journal of Linguistics 29:233-265. Szwedek, Aleksander J. 1976. Word Order, Sentence Stress and Reference in English and Polish. Edmonton: Linguistic Research, Inc. Walker, Marilyn Α., Masayo Ida & S. Cote. 1994. "Japanese Discourse and the Process of Centering". Computational Linguistics 20:2.193-227.
Two Engines Are Better Than One: Generating More Power and Confidence in the Search for the Antecedent RUSLAN M I T K O V
University of Wolverhampton Abstract The paper presents a new combined strategy for anaphor resolution based on the interactivity of two engines which, separately, have been successful in anaphor resolution. The first engine incorporates the constraints and preferences of an integrated approach for anaphor resolution reported in (Mitkov 1994), while the second engine follows the principles of the uncertainty reasoning approach described in (Mitkov 1995). The combination of a traditional and an alternative approach aims at providing maximal efficiency in tackling the tough problem of anaphor resolution. Preliminary results already show improved performance when both approaches are united into a more powerful and confident searcher for the antecedent. 1
Introduction
Approaches to anaphor resolution have so far been mostly linguistic (Carbonel & Brown 1988; Hayes 1981; Hobbs 1978; Ingria & Stallard 1989; Lapin & McCord 1990; Nasukawa 1994; Pérez 1994; Preuß et al. 1994; Rich & LuperFoy 1988; Rolbert 1989) with the exception of a few pro jects where statistical (Dagan & Itai 1990) or machine learning (Cononoly, Burger & Day 1994) methods have been developed. Given the complexity of the problem and its central importance in Natural Language Processing, it would be wise to consider a combination of various approaches to comple ment the traditional methods and increase chances of success by combining the advantages of each method used. We have already reported on an integrated approach for anaphor resolu tion based on linguistic constraints and preferences and a statistical method for center tracking (Mitkov 1994). As an alternative, we have successfully developed an uncertainty reasoning approach (Mitkov 1995). To improve performance, we have recently developed a combined strategy based on two engines: the first engine searches for the antecedent using the integrated ap proach, whereas the second engine performs uncertainty reasoning to rate
226
RUSLAN MITKOV
the candidates for antecedents. The preliminary tests show encouraging results. 2
A n i n t e g r a t e d a n a p h o r resolution approach
Our anaphor resolution model described in (Mitkov 1994) incorporates mod ules containing different types of knowledge — syntactic, semantic, domain, discourse and heuristic (Figure 1).
Fig. 1: An integrated anaphor resolution architecture The syntactic module, for example, knows that the anaphor and antecedent must agree in number, gender and person. It checks if the c-command constraints hold and establishes disjoint reference. In cases of syntactic parallelism, it prefers the noun phrase with the same syntactic role as the anaphor as the most probable antecedent. It knows when cataphora is
TWO-ENGINE APPROACH TO ANAPHOR RESOLUTION
227
possible and can indicate syntactically topicalised noun phrases, which are more likely to be antecedents than non-topicalised ones. The semantic module checks for semantic consistency between the anaphor and the possible antecedent. It filters out semantically incompatible candidates following verb semantics or animacy of the candidate. In cases of semantic parallelism, it prefers the noun phrase, which has the same semantic role as the anaphor, as the most likely antecedent. Finally, it generates a set of possible antecedents whenever necessary. The syntactic and semantic modules have been enhanced by a discourse module which plays a very important role because it keeps a track of the centers of each discourse segment (it is the center which is, in most cases, the most probable candidate for an antecedent). Based on empirical studies from the sublanguage of computer science, we have developed a statistical approach to determine the probability of a noun (verb) phrase to be the center of a sentence. Unlike the known approaches so far, our method is able to propose the center with high probability in every discourse sentence, including the first one. The approach uses an inference engine based on Bayes' formula which draws an inference in the light of some new piece of evidence. This formula calculates the new probability, given the old probability plus some new piece of evidence (Mitkov 1994). The domain knowledge module is practically a knowledge base of the concepts of the domain considered and the discourse knowledge module knows how to track the center of the current discourse segment. The heuristic knowledge module can sometimes be helpful in locating the antecedent. It has a set of useful rules (e.g., the antecedent is preferably to be located in the current sentence or in the previous one) and can forestall certain impractical search procedures. The referential expression filter plays an important role in filtering out the impersonal 'it'-expression (e.g., "it is important", "it is necessary", "it should be pointed out" etc.), where 'it' is not anaphoric. The syntactic and semantic modules usually filter the possible candid ates and do not propose an antecedent (with the exception of syntactic and semantic parallelism). Generally, the proposal for an antecedent comes from the domain, heuristic, and discourse modules. 3
A n uncertainty reasoning approach
We have developed a new uncertainty reasoning approach for anaphor res olution (Mitkov 1995). The strategy for determining the antecedent of a
228
RUSLAN MITKOV
pronoun uses AI uncertainty reasoning techniques. Uncertainty reasoning was selected as an alternative because: 1. in Natural Language Understanding, the program is likely to estimate the antecedent of an anaphor on the basis of incomplete information: even if information about constraints and preferences is available, it is natural to assume that a Natural Language Understanding program is not able to understand the input completely; 2. the necessary initial constraint and preference scores are determined by human beings; therefore the scores are originally subjective and should be regarded as uncertain facts. The uncertainty reasoning approach makes use of various 'anaphor resol ution symptoms' which have already been studied in detail. Apart from the widely used syntactic and semantic constraints and preferences such as agreement, c-command constraints, parallellity, topicalisation, verb-case role, the approach makes use of further symptoms based on empirical evid ence like subject preference, domain concept preference, object preference, section head preference, reiteration preference, definiteness preference, main clause preference etc. The availability/non-availability of a certain symptom will correspond to an appropriate score or certainty factor (CF) which is attached to it. For instance, the availability of a certain symptom s assigns CFs a w (0 < CFsav < 1), whereas the non-availability corresponds to CFsnon-av ( - 1 < CFs -av<0)· For easier reference and brevity, we associate with the symptom s only the certainty factor CFs which we regard as a two-value function (CFs € {CFsav,CFsnon-av}). The antecedent searching procedure makes use of an uncertainty reason ing strategy: the search for an antecedent is regarded as an affirmation (or rejection) of the hypothesis that a certain noun phrase is the correct ante cedent. The certainty factor (CF) serves as a quantitative approximation of the hypothesis. The availability/non-availability of each symptom s causes recalculation of the global hypothesis certainty factor CFhyp (increase or decrease) until: CFhyp > CFthreshoid for affirmation, or CFhyp < CFmin for rejection of the hypothesis. The evaluation process is clearly divided into two steps: 1. proposal of a hypothesis on the basis of preliminary (usually 3-5) tests on the most 'significant' symptoms; and 2. hypothesis verification.
TWO-ENGINE APPROACH TO ANAPHOR RESOLUTION
229
We use a hypothesis verification formula for recalculation of the hypothesis on the basis of availability (in our case also of non-availability) of certain symptoms. The present version of the formula is a modified version of the formula in (Pavlov, Mitkov & Filev 1989) which we have already successfully used in adaptive testing. 4
T h e two-engine s t r a t e g y
Two engines are better than one: on the basis of the above two developed and tested approaches, we have studied and proposed a combined strategy which incorporates the advantages of each of these approaches, generating more power and confidence in the search for the antecedent. The two-engine strategy evaluates each candidate for anaphor from the point of view of both the integrated approach and the uncertainty reason ing approach. If opinions coincide, the evaluating process is stopped earlier than would be the case if only one engine were acting. This also makes the searching process shorter: our preliminary tests show that the integrated approach engine needs about 90% of the search it would make when oper ating on its own; similarly, the uncertainty reasoning engine does only 67% of the search it would do when operating as a separate system. In addition, the results of using both approaches are more accurate (see table below). This combined strategy enables the system to consider all the symptoms in a consistent way; it does not regard any symptom as absolute or uncon ditional. This 'behaviour' is very suitable for symptoms like 'gender' (which could be regarded as absolute in languages like English but 'conditional' in languages like German) or 'number' 1 . The rationale for selecting a two-engine approach is the following: 1. two independent estimations, if confirmed, bring more confidence in proposing the antecedent; 2. the use of two approaches could be usefully complementary: e.g., the conditionality of gender is better captured by uncertainty reasoning; 3. in sentences with more than one pronoun, center tracking alone (and therefore the integrated approach) is not very helpful for determining the corresponding antecedents; 4. though the uncertainty reasoning approach may be considered more stable in such situations, it is comparatively slow but could adopt a 1
In German for instance, "Mädchen" (girl) is neuter, but one can refer to "Mädchen" by a female pronoun (sie). In other languages, singular pronouns (e.g., some singular pronouns denoting a collective notion) may be referred to by a plural pronoun.
230
RUSLAN MITKOV
lower CF, if intermediate results obtained by both engines are reported to be close (lower CF will result in making its process faster); and 5. the two-engine strategy does not depend exclusively on the notion of 'center' which is often considered as 'intuitive' in nature. We see that in certain situations one of the engines could be expected to operate more successfully than the other and complement it but it is also the parallel confirmation of the results obtained that generates more confidence in the search for the antecedent. We have implemented the integrated model as a program which runs on Macintosh computers and the following table shows its success rate. Four text excerpts served as inputs, each text taken from a computer science book. Excerpts ranging from 500 to 1000 words, estimated to contain a comparatively high number of pronouns were selected (it was not always easy to find paragraphs abundant in pronominal anaphors). These docu ments were different from the corpus initially used for the development of various 'symptom rules' and were hand-annotated (syntactic and semantic roles). Other versions of these excerpts, which contained anaphoric refer ences marked by a human expert, were used as an evaluation corpus. We tested on these inputs the three programs (i) the integrated ap proach, (ii) the uncertainty reasoning approach, and (iii) the two-engine approach. The results (Table 1) show an improvement in resolving anaphors when the integrated approach and the uncertainty reasoning approach are combined into a two-engine strategy. Table 1: Anaphor resolution success rate using different approaches
Text Text Text Text
5
1 2 3 4
INTEGRATED
UNCERTAINTY
TWO-ENGINE
APPROACH
REASONING
STRATEGY
89.1 92.6 91.7 88.6
% % % %
87.3 93.6 90.4 89.2
% % % %
91.7 95.1 93.8 93.7
% % % %
Illustration
As an illustration of how the new approach works, consider the following sample text:
TWO-ENGINE APPROACH TO ANAPHOR RESOLUTION
231
SYSTEM PROGRAMS
System programs, such as the supervisor and the language trans lator should not have to be translated every time theyi are used, otherwise this would result in a serious increase in the time spent in processing user's programs. System programsi are usually written in the assembly version of machine languages and are translated once into the machine code itself. From then on theyi can be loaded into memory in machine code without the need for any intermediate translation phase. Step 1. Integrated approach engine: A N A P H O R = they CANDIDATES: {system programs, machine languages, assembly version, machine code, user's programs} A G R E E M E N T CONSTRAINTS: {system programs, machine languages, user's programs} SEMANTIC CONSTRAINTS: {system programs, machine languages, user's programs} (no discrimination) C E N T E R TRACKING: {system programs} (proposed with higher probabil ity) Step 2. Uncertainty reasoning approach engine: A N A P H O R = they CANDIDATES: {system programs, machine languages, assembly version, machine code, user's programs} Candidate 1: system programs symptom 1: number — CFnumber = 0.3; symptom 2: person - CFperson = 0.3; CFhyp = 0.3 + 0.3 - 0.3 * 0.3 = 0.51 symptom 3: gender — CFgenader = 0.3; CFhyp = 0.657 symptom 4: verb case role — Fverbcaseroles = 0.3; CFhyp = 0.7029 symptom 5: syntactic parallelism - CFsyntactic parallelism = -0.2; CFhyp = (0.7029 - 0.2)/[l - min(|0.7029|, | - 0.2|)] = 0.6286 symptom 6: semantic parallelism - CFsemantic parallelism = 0.5; CFhyp = 0.8143 symptom 7: topicalisation - Ftocalisation = 0.0; CFhyp = 0.8143 symptom 8: subject - CFsubject - 0.25; CFhyp — 0.8607 symptom 9: repetition — CFrepetition — 0.6; CFhyp = 0.8906 symptom 10: head - CFhead ~ 0.35; CFhyp = 0.8923 symptom 11: previous — CFprevious — 0.15; CFhyp = 0.9085
232
RUSLAN MITKOV
Candidate 1 accepted Candidate 2: machine languages symptom 1: number — CFnumber = 0.3; symptom 2: person - CFperson — 0.3; CFhyp — 0.3 + 0.3 - 0.3 * 0.3 = 0.51 symptom 3: gender — CFgender — —0.6; CFhyp = (0.51 - 0.6)/[l - mm(|0.51|, | - 0.6|)] = -0.1836 symptom 4: = verb case role — CFverbcaseroles = 0.1; CFhyp = (-0.1836 + 0.1)/[1-min|0.1|,|-0.1836|) = -0.0929 symptom 5: syntactic parallelism - CFsyntacticparaellisrn = -0.2; CFhyp - -0.1672 symptom 6: semantic parallelism - Fsemanticparalllelsm = -0.2; CFhyp = -0.2938 Candidate 2 r e j e c t e d Due to results similar to those for candidate 2, candidates 3, 4 and 5 are also rejected. Step 3. Results in 1 and 2 confirm the selection of "system programs" as the antecedent. Using the uncertainty reasoning approach within the two-engine ap proach would mean that the CF could be put lower: a CF of 0.89 would be in this case satisfactory, shortening the procedure by two steps. On the other hand, the uncertainty approach confirms the proposal of the integ rated approach.
6
Conclusion
We have presented a two-engine strategy for pronoun resolution which com bines the engines and advantages of an integrated architecture for anaphor resolution (Mitkov 1994) and of an uncertainty-based anaphor resolution model (Mitkov 1995). Preliminary evaluations show improvement of per formance and though further investigations and comparisons have to be carried out, we believe that the first results can be regarded as promising.
TWO-ENGINE APPROACH TO ANAPHOR RESOLUTION
233
REFERENCES Barros, Flávia A. & Anne Deroeck. 1994. "Resolving anaphora in a portable natural language front end to databases". Proceedings of the 4th Conference on Applied Natural Language Processing (ANLP'94), 119-124. Stuttgart, Germany. Carbonell, James G. & Ralf D. Brown. 1988. "Anaphora Resolution: A MultiStrategy Approach". Proceedings of the 12th International Conference on Computational Linguistics (COLING-88), vol.1, 96-101. Budapest, Hungary. Connoly, Dennis, John D. Burger & David S. Day. 1994. "A Machine Learning Approach to Anaphoric Reference". Proceedings of the International Confer ence "New Methods in Language Processing", 255-261. Manchester: UMIST. Dagan, Ido & Alon Itai. 1990. "Automatic Processing of Large Corpora for the Resolution of Anaphora References". Proceedings of the 13th Interna tional Conference on Computational Linguistics (COLING-90), vol.III, 1-3. Helsinki, Finland. Hayes, Philip J. 1981. "Anaphora for Limited Domain Systems". Proceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI'81), 416-422. Vancouver, Canada. Hirst, Graeme. 1981. Anaphora in Natural Language Understanding. Springer Verlag.
Berlin:
Hobbs, Jerry R. 1978. "Resolving Pronoun References". Lingua 44.339-352. Ingria, Robert J.P. & David Stallard. 1989. "A computational mechanism for pronominal reference". Proceedings of the 27th Annual Meeting of the ACL, 262-271. Vancouver, British Columbia. Lappin, Shalom & Michael McCord. 1990. "Anaphora Resolution in Slot Gram mar". Computational Linguistics 16:4.197-212. Mitkov, Ruslan. 1994a. "An integrated model for anaphora resolution". Pro ceedings of the 15th International Conference on Computational Linguistics (COLING-94), 1170-1176. Kyoto, Japan. 1994b. "A New Approach for Tracking Center". Proceedings of the International Conference "New Methods in Language Processing", 50-54. Manchester: UMIST. 1995. "An Uncertainty Reasoning Approach for Anaphora Resolution". Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS'95), 149-154. Seoul, Korea. Nasukawa, Tetsuya. 1994. "Robust Method of Pronoun Resolution Using FullText Information". Proceedings of the 15th International Conference on Computational Linguistics (COLING-94)·, 1157-1163. Kyoto, Japan.
234
RUSLAN MITKOV
Pavlov, Radoslav, Ruslan Mitkov & Philip Filev. 1989. "An Adaptive Uncer tainty Reasoning-Based Model for Computerised Testing". Proceedings of the 3rd International Conference "Children in the information age", 92-98. Sofia, Bulgaria. Rico Pérez, Celia. 1994. Statistical-algebraic approximation to discourse ana phora. Ph.D. dissertation, Department of English Philology, University of Alicante. Alicante, Spain. [In Spanish.] Preuß, Susanne, Birte Schmitz, Christa Hauenschild & Carla Umbach. 1994. "Anaphora Resolution in Machine Translation". Studies in Machine Trans lation and Natural Language Processing, vol.VI (Text and content in Machine Translation: Aspects of discourse representation and discourse processing) ed. by Wiebke Ramm, 29-52. Luxembourg: Office for Officiai Publications of the European Community. Rich, Elaine & Susann LuperFoy. 1988. "An Architecture for Anaphora Res olution". Proceedings of the 2nd Conference on Applied Natural Language Processing, 18-24. Austin, Texas. Rolbert, Monique. 1989. Résolution de formes pronominales dans Vinterface d'interogation d'une base de données [Resolution of Pronouns in Natural Lan guage Front Ends]. Ph.D. dissertation, Faculty of Science, Luminy, France. Sidner, Candace L. 1986. "Focusing in the Comprehension of Definite Anaphora". Readings in Natural Language Processing ed. by Barbara J. Grosz et al., 363394. Los Altos, Calif.: Morgan Kaufmann.
Effects of Grammatical Annotation on a Topic Identification Task TADASHI N O M O T O
Advanced Research Laboratory, Hitachi Ltd. Abstract The paper describes a new method for discovering topical words in discourse. It shows that text categorisation techniques can be turned into an effective tool for dealing with the topic discovery problem. Experiments were done on a large Japanese newspaper corpus. It was found that training the model on annotated corpora does lead to an improvement on the topic recognition task. 1
Introduction
The problem of identifying a topic or subject matter of discourse has long attracted attention from diverse research paradigms. In computational lin guistics, the problem more or less takes the form of resolving the anaphora (Hobbs 1978; Grosz & Sidner 1986; Lappin & Leass 1994) or locating the discourse center (Joshi & Weinstein 1981; Walker et al. 1994). In inform ation retrieval (IR), the problem came to be known as text categorisation (TC), which concerns classifying documents under a set of pre-defined cat egories (Lewis 1992; Finch 1994) While incorporating some of the insights from computational linguistics, the present work extends the text categorisation paradigm to solve the topic identification issue. The paper recasts the topic identification as a task of finding in a text representative words under which that text is most likely to classify. Thus to identify a topic in the text requires us working with an unbounded set of categories rather than with a bounded, possibly small number of pre-defined categories. Although the idea of using a complex representation of text has yet to prove its value in text classification (Lewis 1992), recent years have seen some progress in the area of the corpus-based NLP toward exploiting lin guistic representation more sophisticated than the simple word form (Hindle 1990). As our part of contribution to research in this direction, we are going to report some promising results from experiments with Japanese corpora. In particular, we will show that a representation based on postpositional
236
TADASHI NOMOTO
phrases (PP), i.e., phrases composed of a noun and following case particle, is more effective for the topic identification task than a simple word-based representation. (Futsu) (gin) -ga (Kiev)-ni (chuuzai) (in) (jimusho). French bank SBJ Kiev at resident staff office
[(Futsu) (gin) (Oote)
]-no
[(Société) (Général)] -wa
French bank big-name which is Société
General
(U*) U-
(Kiev)] Kiev
(kura*) kra-
(ina*)] ina
-no whose
[(shuto) capital
[(15-nichi**),
as for on 15th -ni at
[(chuzai) resident
(in) staff
(jimusho)]-wo [(kaisetsu)]-ø suruto [(happyo)]-ø shita. Sude-ni [(Kiev) (shi) office OBJ open plan disclose did Already Kiev city (tookyoku)]-no
authority
[(kyoka)
]-mo
eta
to-iwu.
whose permission as well obtained sources say
M A J O R F R E N C H BANK OPENS OFFICE IN K I E V
Société Général, a major French bank, disclosed on the 15th a plan to open a resident office in Kiev, capital of Ukraine. The bank has already obtained a permission from the city authority, sources say. Fig. 1: Annotating a news story
2
Topic recognition model
This section describes an approach to the topic identification problem. What we are going to do is formulate the problem as a text categorisation task, i.e., one of classifying documents with respect to a set of pre-defined categories. The formulation is fairly straightforward; we define the problem of finding a topic in text as that of categorising a text with respect to a set of nouns derived from that text. A most probable topic is, then, one with which the text in question is most likely to classify. The job of text categorisation is to estimate C(c | d), the likelihood that a document d is assigned to a category Let us call a word with which to classify the text a potential topic of the text. Given a set W(d) of words comprising a text d and a set S(d) of potential topics for d, the job of topic identification is to find an estimate C(c | d), for € 5(d), where S(d) W(d).
EFFECTS OF GRAMMATICAL ANNOTATION Now let us consider a likelihood function, (c | d) =
237
defined by:
P(c | t)P{t | d)
which is meant to be a relativisation of the relationship between and d to some index t (Fuhr 1989); the index t could be a simple word or linguistically more sophisticated representation. The set of such indices is said to represent a text. Assume that every index t will be assigned to some category. Then by Bayes' theorem, we have an equation1:
Given a set R(d) of indices for a text d, we define the likelihood function C(c\d) by: forcЄ S(d),
We refer to the formula above as 'TRM' hereafter. 'T = w ' is meant to denote an event that a randomly selected word from a document coincides with w. P(c) represents the probability that a randomly selected document is assigned to category c; P(T = w \ c) is the probability that a word randomly selected from a document coincides with w, given that category is assigned to that document; P(T = w) denotes the probability that a word w is selected from a randomly chosen document; P(T = w \ d) is the probability that word w is randomly selected from document d. We estimate the component probabilities by: P(c) P(T
= W\C)
=
Dc/D
= FCW/FC*
P(T = w\d) = Fwd/F*d P(T = w) = FwdF*D 1
There are some choices we can make as to the nature of t. In the binary term index ing (Lewis 1992), a document is represented as a binary vector {0, l } n , which records presence/absence of terms for the document and t ranges over a set of possible doc uments. On the other hand, in the weighted term indexing (Iwayama &; Tokunaga 1994), a document is represented as a set of term frequencies and t ranges over a set of possible terms {wi, ...,wn}. A most important difference between the two indexing policies is that the former is concerned with document frequency, i.e., the number of documents in which a term occurs, while the latter is concerned with term frequency, the frequency of a term within a document. We decided to go along with the weighted term indexing policy, because the binary policy, as it stands, is known to fail where training data is not sufficiently available (Iwayama & Tokunaga 1994).
238
TADASHI NOMOTO
where D is the number of texts found in the training corpus, Dc is the number of texts whose title contains a term c, Fc is the number of indices in DC' Fwc is the frequency of w in Dc, Fd is the count of token indices in d, Fwd is the count of w in D, and FD is the total number of token indices in D. TRM is based on the simple assumption that if an index is more typical or characteristic of a text, it is more likely to associate with a topic of that text. For instance, turmeric is a popular coloring spice used in many of the recipes for Indian food. Thus the word 'turmeric' appears very often in an Indian cookbook and therefore, does not serve to indicate a particular dish. (For that matter, peas or beans may better indicate what a partic ular recipe is for.) How much typical an index is of a particular text, is determined statistically by measuring the degree of its maldistribution or skewness (Umino 1988). TRM uses the following measure for evaluating the typicalness of an index (call it Icd(w)): Idc d
_ P(T = w\c)P(T = (w) P(T = w)
w\d)
Let x and be an index for a text d. Suppose that Fxe = Fyc, and FxD = FyD. If Fxd > Fyd then χ has a more skewed distribution and contributes more to the value than /, i.e., Icd(x) > Icd{) The same result follows if Fxc = Fyc, Fxd= Fyd,and Fxd < FyD. In either case, χ is said to be more typical of d than y. 3
Text r e p r e s e n t a t i o n
A major concern of the paper is with finding out whether annotating corpora with some grammatical information affects the model's performance on the topic recognition task. In text categorisation, a text is represented in terms of an indexing language, a set of indices constructed from the vocabulary that makes up that text. We make use of two languages for indexing a text: one is formed from nouns that occur in the text and another is formed from nouns tagged with a postposition of a phrase in which they occur. For a text di let R+(d) be a indexing language with taggings and R~(d) be one without. Annotating a text goes through two processes, the tokenising of a text into a array of words and tagging words in a postpositional phrase with its postposition, or case particle. We start by dividing a text, which is nothing but a stream of characters, into words. The procedure is carried out with
EFFECTS OF GRAMMATICAL ANNOTATION
239
the use of a program called JUMAN, a popular public-domain software for the morphological analysis of Japanese (Matsumoto et al. 1993). Since there was no Japanese parser robust enough to deal with free texts such as one used here, postpositional phrases were identified by using a very simple strategy of breaking an array of word tokens into groups at punctuation marks ('.,') as well as at case particles. After examining the results with 10 to 20 texts, we decided that the strategy was good enough. Each token in a group was tagged with a case particle which is a postposition of the group. Figure 1 lists a sample news article from the test data used in our exper iments. The part above the horizontal line corresponds to a headline; the part below the line corresponds to the body of article. We indicate nouns by a parenthesis '( )' and case particles by a preposed dash '—'. In addition, we use a square bracket '[ ]' to indicate a phrase for which a case particle is a postposition. A tokenisation error is marked with a single star ('*'), a parsing error is doubly starred('**'). 'φ' indicates that a noun it attaches to is part of the verbal morphology and thus does not take a regular case particle. For the sake of readability, we adopt the convention of repres R-(d)
=
{ French, bank, big-name, Societé, General, on 15th, U-, kra-, ine, capital, Kiev, resident, staff, office, open, disclose, city, authority,permission }
R+(d) =
{ Frenchα", bankα", big-nameα", Societéβ, General β , on 15th α , U- α , kra- α , ineα", capital γ , Kiev γ , residentδ, staffδ, officeδ, open ø , discloseø, Kiev α , city α , authority α , permissionε } Fig. 2: Indexing languages
enting Japanese index words by their English equivalents. A plain index language is made up of nouns found in the sample article; an annotated index language is like a plain one except that nouns are tagged with case particles (denoted by superscripts). The list of the particles is given along with explanations in Table 1. Shown in Figure 2 are two kinds of indexing vocabulary derived from the news article example in Figure 1. Superscripts on words, α, β, γ, δ and e correspond to particles no, wa, ni, wo and mo, respectively; thus 'Socitéβ', for instance, is meant to represent a Japanese wa-annotated term 'Societé wa' and similarly for others. Notice that un like the plain index language, the language with annotation contains two
240
TADASHI NOMOTO ga no
SUBJECT
WO
OBJECT
wa ni to de e mo ka
AS FOR, AS REGARDS T O
kara yori
FROM
O F , WHOSE
FOR, TO AND AT, IN T O , IN T H E DIRECTION OF AS WELL OR FROM
Table 1: Case particles based on (Sahuma 1983) instances of 'Kiev', i.e., 'Kiev γ ' and 'Kiev α ', reflecting the fact that there are two particles in the news piece (no, ni) which are found to occur with the word 'Kiev'. 4
Experiments
In this section, we will report performances of the topic recognition model on indexing languages, with and without grammatical annotation. Recall that an indexing language is something that represents text corpora and usually consists of a set of terms derived in one way or another from the corpora. Our experiments used the total of 44,001 full-text news stories from Ni kon Keizai Shimbun, a Japanese economics newspaper. All of the stories appeared in the first half of the year 1992. Of these, 40,401 stories, which appeared on May 31, 1992 and earlier, were used for training and the re maining 3,600 articles, which appeared on June 1, 1992 or later, were used for testing. 4.1
Test setting
We divided the test set into nine subsets of stories according to the length of the story. The subsets each contained 400 stories. The test set 1, for instance, contains stories less than 100 (Japanese) characters in length, the test set 2 consists of stories between 100 and 200 characters in length, and the test set 3 contains stories whose length ranges from 200 to 300 characters (Table 2).
EFFECTS OF GRAMMATICAL ANNOTATION test set
Ί 2 3 4 5 6 7 8 9
length (in char.) < 100 100-200 200-300 300-400 400-500 500-600 600-700 700-800 800-900
241
num. of doc. 400 400 400 400 400 400 400 400 400
Table 2: Test sets
The topic identification is a two-step process: (1) it estimates, for each potential topic, the degree of its relationship with the text, i.e., L(c | d), and (2) then identifies a potential topic which is likely to be an actual topic of the text 2 . This involves using decision strategies like k-per doc, proportional assignment and probabilistic thresholding. The estimating part will use TRM as a measure of relationship between a potential topic and a text d, for c G S(d) and d Є D. TRM takes as inputs a text d from the test corpus and a potential topic c, and determines how actual is with respect to d. Here are some details on how to estimate probabilities. The training set of 40,401 stories were used to determine prior probabilities for P(c), P(T = w), and P(T — w | c). P(c) is the probability that a story chosen randomly from the training set is assigned to a title term c. As mentioned in Section 2, we estimated the probability as Dc/D, where Dc is the number of texts whose title has an occurrence of c, and D is the total number of texts, i.e., D = 40,401. The estimation of P(T = w) and P(T = w \ c) ignored the frequency of w in a title. P(T = w) was estimated as FwD/F*D, with F ¿ = 3,213,617, the number of noun tokens found in the training corpus. We estimated P(T = w \ c) by FwcF*c where Fwc = ΣdeDcFwd and F*c = ΣdeDc R{d). Again in estimating P(T = w | c), we have counted out any of w's occurrences in a headline. P(T = w \ d) was estimated as Fwd/F*d for an input text d. We would have F*D = 19 for a text in Figure 1, which contains 19 noun tokens. Now for the deciding part. Based on the probability estimates of C(c \ d), we need to figure out which topic(s) should be assigned to the text. The text categorisation literature makes available several strategies for doing this (Lewis 1992). In the probabilistic thresholding scheme, a category (= 2
A potential topic is said to be actual if it occurs in the text's headline.
242
TADASHI NOMOTO
potential topic) is assigned to a document d just in case L(c \ d) > s, for some threshold constant s.3 In a k-per doc strategy, a category is assigned to documents with the top scores on that category. Another commonly used strategy is called proportional assignment A category is assigned to its top scoring documents in proportion to the number of times the category is assigned in the training corpus. In the experiments, we adopted a probabilistic thresholding scheme4. Al though it is perfectly all right to use the k-per doc here, the empirical truth is that the text categorisation fares better on the probabilistic thresholding than on the k-per doc. 4.2
Result and analysis
In what follows, we will discuss some of the results of the performance of the topic recognition model. The model was tested on 9 test sets in Table 2. For each of the test sets, we experimented with two indexing languages, one with annotation and one without, to observe any effects annotation might have on the recognition task. The goal was to determine terms most likely to indicate a topic of the article on the basis of estimates of C(c | d) for each indexing term in the article. Following (Gale et al. 1992), we compare our model against a baseline model, which establishes lower bounds on the recognition task. We estimate the lower bound as the probability that a title term is chosen randomly from the document, i.e., P(c | d). The baseline represents a simple, straw man approach to the task, which should be outperformed by any reasonable model. The baseline model P(c | d) represents a simple idea that a word with more frequency would be a more likely candidate for topichood. Figure 3 shows the performance of the recognition model on plain and annotated indexing languages for a test corpus with stories less than 100 character long (test set 1). The baseline performance is also shown as a comparison. As it turns out, at the break-even point 5 , the model's perform3
4
5
One of the important assumptions it makes is that the probability estimates are com parable across categories as well as across documents: that is, it assumes that it is possible to have an ordering, L(c1 | d1) > L(c1 | d2) > · · · > L(cn | dm), among the possible category/document pairs in the test corpus. There is an obvious reason for not using the proportional assignment policy in our experiments. Since the set of categories (title terms) in the training corpus is openended and thus not part of the fixed vocabulary, it is difficult to imagine how the assignment ratio of a category in the training corpus is reflected on the test set. A break even point is defined to be the highest point at which precision and recall are
EFFECTS OF GRAMMATICAL ANNOTATION
243
ance is higher by 5% on the annotated language (54%) than on the plain language (49%). Either score is much higher than the baseline (19%). Table 3 summarises results for all of the test sets. We see from the table that grammatical annotation does enhance the model's performance6. Note, however, that as the length of a story increases, the model's performance rapidly degrades, falling below the baseline at test set 5. This happens regardless of whether the model is equipped with extra information. The reason appears to be that benefits from annotating the text are cancelled out by more of the irrelevancies or noise contained in a larger text. The increase in text length affects factors like S(d) and R(d), which we assumed to be equal. Recall that the former denotes a set of potential topics and the latter a set of indices or nouns extracted from the text. Thus the increase in text length causes both R(d) and S(d) to grow accordingly. Since the title length stays rather constant over the test corpus, the possibility that an actual topic is identified by chance would be higher for short texts 6
equal. It is intended to be a summary figure for a recall precision curve. Figures in the table are micro-averaged, i.e., expected probabilities of recall/precision per categorisation decision (Lewis 1992).
244
TADASHI NOMOTO test set
Ï 2 3 4 5 6 7 8 9
length (in char.) < 1oo 100 - 200 200 - 300 300 -400 400 - 500 500 - 600 600 - 700 700 - 800 800 - 900
R-(d) 49% 42% 35% 31% 31% 30% 28% 25% 26%
R+{d) 54% 44% 37% 32% 33% 31% 29% 26% 26%
baseline
19% 33% 30% 32% 35% 35% 37% 34% 35%
Table 3: Summary statistics than for lengthy ones. Indeed we found that 13% of index terms were actual for the test set 1, while the rate went down to 3% for the test set 9. One way to increase its resistance to the noise would be to turn to the idea of mutual information (Hindle 1990) or use only those terms which strongly predict a title term (Finch 1994). Or one may try a less sophistic ated approach of reducing the number of category assignments to, say, the average length of the title. 5
Conclusion
In this paper, we have proposed a method for identifying topical words in Japanese text, based on probabilistic models of text categorisation (Fuhr 1989; Iwayama & Tokunaga 1994). The novelty of the present approach lies in the idea that the problem of identifying discourse topic could be recast as that of classifying a text with terms occurring in that text. The results of experiments with the Japanese corpus, showed that the model's performance is well above the baseline for texts less than 100 char acters in length, though it degrades as the text length increases. Also shown in the paper was that annotating the corpus with extra information is worth the trouble, at least for short texts. Furthermore, the model applies to other less inflectional languages, in so far as it works on a word-based represent ation. The next step to take would be to supply the ranking model with inform ation on the structure of discourse and develop it into a model of anaphora resolution (Hearst 1994; Nomoto & Nitta 1994; Fox 1987).
EFFECTS OF GRAMMATICAL ANNOTATION
245
Acknowledgements. The author is indebted to Makoto Iwayama and Yoshiki Niwa for discussions and suggestions about the work. REFERENCES Finch, Steven. 1994. "Exploiting Sophisticated Representations for Document Retrieval". Proceedings of the 4th Conference on Applied Natural Language Processing, 65-71, Stuttgart, Germany: Institute for Computational Lin guistics, University of Stuttgart. Fox, Barbara A. 1987. Discourse Structure and Anaphora. (= Cambridge Studies in Linguistics, 48). Cambridge: Cambridge University Press. Fuhr, Norbert. 1989. "Models for Retrieval with Probabilistic Indexing". formation Processing & Management 25:1.55-72.
In
Gale, William, Kenneth W. Church, & D. Yarowsky. 1990. "Estimating Up per and Lower Bounds on the Performance of Word-Sense Disambiguation Programs". Proceedings of the 22nd Annual Meeting of the Association for Computational Linguistics (ACL'90), 249-256. Grosz, Barbara & Candance Sidner. 1986. "Attention, Intentions and the Struc ture of Discourse". Computational Linguistics 12:3.175-204. Hearst, Marti A. 1994. "Multi-Paragraph Segmentation of Expository Text". Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (CL'94, 9-16. Hindle, Donald. 1990. "Noun Classification from Predicate-Argument Struc tures". Proceedings of the 22nd Annual Meeting of the Association ¡or Com putational Linguistics, 268-275. Hobbs, Jerry. 1978. "Resolving Pronoun References". Lingua 44.311-338. Iwayama, Makoto & Takenobu Tokunaga. 1994. "A Probabilistic Model for Text Categorisation: Based on a Single Random Variable with Multiple Values". Proceedings of the 4the Conference on Applied Natural Language Processing, 162-167. Joshi, Aravind K. & Scott Weinstein. 1981. "Control of Inference: Role of Some Aspects of Discourse Structure — Centering". Proceedings of the Interna tional Joint Conference on Artificial Intelligence, 385-387. Lappin, Shalom & Herbert J. Leass. 1994. "An Algorithm for Pronominal Ana phora Resolution". Computational Linguistics 20:4.235-561. Lewis, David D. 1992. "An Evaluation of Phrasal and Clustered Representations on a Text Categorisation Task". Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Re trieval, 37-50.
246
TADASHI NOMOTO
Matsumoto, Yuji, Sadao Kurohashi, Takehito Utsuro, Yutaka Taeki & Makoto Nagao. 1993. Japanese Morphological Analysis System JUMAN Manual Kyoto, Japan: Kyoto University. [In Japanese.] Nomoto, Tadashi & Yoshihiko Nitta. 1994. "A Grammatico-Statistical Approach to Discourse Partitioning". Proceedings of the 15th International Conference on Computational Linguistics (COLING-94)·, 1145-1149, Kyoto, Japan. Sakuma, Kanae. 1983. Gendai Nihongohō-no Kenkyu [A Study on Grammar of Modern Japanese]. Tokyo, Japan: Kuroshio-Shuppan. Umino, Bin. 1988. "Shutsugen-Hindo-Jyouhou ni-motozuku Tango-Omomizukeno-Genri [Some Principles of Weighting Methods Based on Word Frequencies for Automatic Indexing]". Library and Information Science 26:67-88. Walker, Marilyn, Masayo Iida &· Sharon Cote. 1994. "Japanese Discourse and the Process of Centering". Computational Linguistics 20.2:193-232.
Discourse Constraints on Theme Selection W I E B K E RAMM
University of the Saarland Abstract In this paper we deal with the area of thematisation as a grammat ical as well as a discourse phenomenon. We investigate how discourse parameters such as text type and subject matter can effect sentencelevel theme selection as one of the grounding devices of language. Aspects of local and global thematisation are described in terms of a system-functionally oriented framework and it is argued that cor relations between text-level and sentence-level discourse features can be modelled as inter-stratal constraints in a stratificational text gen eration architecture. 1
Introduction
Our starting point is the observation that language is quite flexible regard ing how a piece of information can be communicated; the same state of affairs often can be expressed by very different linguistic means such as word order alternatives, by using different lexical material or by applying different grammatical constructions. In most cases these options are not arbitrarily interchangeable, however, since in addition to the transmission of propositional meaning, a linguistic utterance also aims to achieve cer tain pragmatic effects which can only be reached when the information is presented in an appropriate manner. To this end, language is provided with special grammatical and semantic devices guiding the foregrounding and backgrounding of particular parts of information in a sentence (cf. Ramm et al. (1995:34f.)): - Focusing1 is a textual means responsible for the information distribution in a clause. The focus, which is usually intonationally marked, is the locus of principal inferential effort within each message (cf. Lavid 1994a:24) and has a typical correlation with what is new (in contrast to what is given) in a sentence. 1
The notions of focus as well as theme have found diverging interpretations in different linguistic and computational-linguistic schools (for a comparison cf. Lavid 1994a). The definitions we are working with here are mainly inspired by the theory of systemicfunctional linguistics (SFL). We will outline some central concepts of this approach below.
248
WIEBKE RAMM
- Thematisation (in its sentence-grammatical notion) is guiding the local contextualisation of a sentence by assigning particular thematic prominence to a part of the message, the theme . "The theme is the element which serves as the point of departure of the message; it is that with which the clause is concerned. The remainder of the messages, the part in which the theme is developed, is called ... the rheme" (Halliday21994:37). - Ranking relates to how an element of a situation (e.g., an event or en tity) is encoded grammatically, for instance, whether it is realised as a verbal construction, a nominalisation, a complement or a circumstance. The gram matical mechanisms of ranking closely interact with the textual means of focusing and thematisation. - Taxis — with its basic options hypotaxis and parataxis — provides an other type of grounding distinction rooted in grammar, this time in terms of a dependency structure holding between clauses. How these linguistic devices are actually deployed in the realisation of a message in order to achieve a particular communicative goal depends on factors such as the (local) textual context in which it appears, but also on global parameters, such as the text type to which the whole discourse belongs, of which the message forms a part, and the subject matter it is about. In this paper we will focus on the area of thematisation in German. In particular, we will investigate in which way aspects of global discourse organisation, namely text type and subject matter, may influence the se lection of grammatical theme on sentence level. The types of correlations we are looking for can be relevant for different NLP applications where the local, sentence-level, as well as the global, text-level, organisation of dis course has to be accounted for. Our application domain is text generation where one of the notorious problems is the gap between global-level text planning (strategic generation) and lexico-grammatical expression (tactical generation), which has been termed the generation gap (cf. Meteer 1992). The output quality of many full-scale generators is compromised because the text planner cannot exercise sufficient control on the fine-grained dis tinctions available in the grammar. We argue that some of the problems can be accounted for by recognising the variety of linguistic resources involved as distinct modules or strata in a multi-stratal language architecture and by representing characteristic correlations between selections on different strata as inter-stratal constraints.
DISCOURSE CONSTRAINTS ON THEME SELECTION 2
249
Text type, subject matter and theme selection
Before having a look at the realisation of theme in some concrete text ex amples, we will start with a few more words on the conception of grammat ical theme we are proceeding from and the options the German language provides according to our model. As mentioned in the beginning, our notion of theme is inspired by the theory of systemic-functional linguistics (SFL) (for an overview of basic ideas of SFL cf. Halliday 21994; Matthiessen & Bateman 1991), according to which theme is a textual resource of the language system which is — together with other cohesive and structural means such as reference, sub stitution, ellipsis, conjunction, lexical cohesion and focus — responsible for the coherence of a text. Theme as "the resource for setting up the local context' Matthiessen (1992:449) in which each clause is to be interpreted, the point of departure in Halliday's definition (see above), provides only one of the textually significant variation possibilities of word order; it closely in teracts with other resources such as focus, transitivity, voice/diathesis, and mood. The theme is a function with particular textual status (thematic prominence) in the clause and becomes the resource for manipulating the contextualisation of the clause. Theme in this systemic-functional meaning has originally been described with respect to English grammar; the account of theme in the German clause, some basic ideas of which we will briefly summarise now, is described in more detail in Steiner & Ramm (1995). For the realisation of theme in German, there is a clear rough correspond ence with what is described as 'Vorfeld' in other approaches (see e.g., Hoberg 1981), i.e., the theme is realised in the position before the finite verb. One of the typical features of a systemic functional account of theme is the observa tion that the theme can be realised by metafunctionally different elements, i.e., it can be ideational, interpersonal or textual. Metafunctional diver sification is a central notion of systemic-functional theory that reflects the view of language as being functionally diversified into three generalised func tions: the ideational which is concerned with propositional-content type of linguistic information; the interpersonal which provides the speaker/writer with the resources for creating and maintaining social relations with the listener/reader; and the textual which provides the resources for contextualising the other two types of information, i.e., presents ideational and interpersonal information as text in context (cf. Matthiessen & Bateman 1991:68). The following examples illustrate the three types of information: contextualisation of a message or proposition employing ideational means
250
WIEBKE RAMM
draws on circumstantial and participant roles of an event, e.g., Ich werde ge hen. (I will go.) In grammatical terms, this is a subject-theme. An example of contextualisation by interpersonal means is thematisation of an interac tion marker, such as a modal circumstantial role, e.g., Vielleicht werde ich gehen. (Possibly I will go.) On the grammatical level, the theme is filled by a modal adjunct. Contextualisation by textual means operates on the resource of logico-semantic relations, expressed grammatically by conjunc tions or conjunctive adjuncts, e.g., Daher werde ich gehen. (Therefore I will go.) Theme variation in German comprises two further dimensions, namely simple vs. multiple, and unmarked vs. marked theme. The former distinguishes themes realised by a single semantic function from those filled by more than one, the latter relates to whether a certain theme choice leads to marked intonation which closely relates to the area of focus. We will now investigate how these options surface in 'real-life' texts of different text types. The two texts we are going to have a look at are taken from a more representative corpus of short texts covering text types ranging from narrative, descriptive and expository to argumentative and instructive texts. The texts which have been selected in correspondence with a parallel corpus of English texts (cf. Lavid 1994b) have been analysed according to a number of parameters such as discourse purpose, subject matter, global chaining strategy, and focus category (cf. Villiger 1995). The first sample text — a section from a travel guide — is of the descriptive type. Text 1: "Sevilla" (from: T. Schröder: Andalusion. M. Müller Verlag, Erlangen, 1993, pp.332-333.) 2 (01) Sevillas Zentrum liegt östlich eines Seitenkanals des Rio Guadalquivir, der die Stadt etwa in Nord-Süd-Richtung durchzieht. (The Centre of Seville is situated east of a side canal of the Rio Guadalquivir which runs through the city roughly from north to south.) (02) Hauptstraße ist die Avenida de la Constitucion; (The main street is the Avenida de la Constitucion;)
(03) in ihrer unmittelbaren Umgebung liegen mit Kathedrale und Giralda
sowie der Alcazaba die bedeutendsten Sehenswürdigkeiten der Stadt. (in its surroundings,
immediate
the most important sights of the city, the cathedral, the Giralda, and the
Alcazaba, are situated.)
(04) Östlich schließt sich das Barrio de Santa Cruz an, Sevil
las lauschiges Vorzeigeviertel. (In the east, the Barrio de Santa Cruz, Seville's
secluded
showpiece quarter, borders on the city.) (05) Die Avenida de la Constitucion beginnt im Süden am Verkehrsknotenpunkt Puerta de Jerez und mündet im Norden in den Dop2
English glosses of the German text passages are given in italics; the sentence theme of each clause is underlined. If English theme is roughly equivalent in type and meaning, we have also underlined the themes in the English version.
DISCOURSE CONSTRAINTS ON THEME SELECTION
251
pelplatz Plaza San Francisco/Plaza Nueva; (The Avenida de la Constitucion begins in the south at the Puerta de Jerez junction and leads into the double square Plaza San Francis co/Plaza Nueva in the north.) (06) Hier liegt auch das Geschäftsviertel um die Haupteinkaufsstraße Calle Sierpes. (Here also the shopping centre around the main shop ping street, Calle Sierpes, is situated.) (07) Südlich des engeren Zentrums erstrecken sich der Park Parque de Maria Luisa und das Weltausstellungsgelände von 1929, die Plaza de Espana. (South of the immediate centre the park Parque de Maria Luisa and the site of the world fair 1929, the Plaza de Espana, are located.) (08) Jenseits des Gualdalquivir sind zwei ehemals selbständige Siedlungen zu abendlichen und nächtlichen Anlaufad ressen avanciert: das volkstümliche Barrio de Triana auf Höhe des Zentrums und, südlich anschließend, das neuzeitlichere Barrio de los Remedios auf Höhe des Parque de Maria Luisa. (Beyond the Guadalquivir two formerly independent settlements have developed into places to go to in the evenings and at night: the traditional Barrio de Triana, which is on a level with the centre and, bordering on this area in the south, the more modern Barrio de los Remedios, which is on a level with the Parque de Maña Luisa.) The sentence themes 3 in this text constantly are ideational elements real ised as subject theme ((01) and (05)), subject complement (02), or circum stantials ((03), (04), (06), (07), and (08)). In terms of semantic categories, these themes are participants ((01), (05) and (02)), or circumstances (time & place) ((03), (04), (06), (07), and (08)). Before analysing the text in more detail, consider the thematic choice in another example. The second text is argumentative, a satirical article published in the commentary part of a German newspaper: Text 2: "Nostalgiekarte Jahrgang 1992" (Nostalgia map of the year 1992) (From: Saarbrücker Zeitung, December 14./15. 1991, p.5) (01) So war die politische Geographie einmal zu fernen Zeiten. (This is how the political geography used to be a long time ago.) (02) Deutschland noch nicht vereint, (Germany not yet united,) (03) der Saar-Lor-Lux-Raum ein weißer Fleck auf der Landkarte. (the Saar— Lor-Lux region a blank area on the map.) (04) Zu fernen Zeiten? (A long time ago?) (05) Mitnichten!!! (Far from it!!!) (06) Die oben abgebildete Deutschlandkarte fin det sich im neuen Taschen-Terminkalender 1992 der Sparkasse Saarbrücken. (The map of Germany shown above is published in the new pocket diary 1992 of the savings bank of Saarbrücken.) (07) Dort hat man noch nicht mitbekommen, (There no-one has yet 3
We have not applied our theme analysis to dependent clauses since in most cases, theme in dependent clauses is more or less grammaticalised (typically. realised by elements such as conjunctions or wh-elements), i.e., there is no real choice regarding what can appear in theme position. For a few further remarks on theme in dependent clauses see Steiner & Ramm (1995:75ff).
252
WIEBKE RAMM
noticed (08) daß Deutschland um die Kleinigkeit von fünf neuen Bundesländern größer geworden ist. (that Germany has grown by the trifling amount of five new
Bundesländer.)
(09) Zudem gibt es jenseits der alten DDR-Grenze noch andere Städte als Leipzig und Berlin, so zum Beispiel Rostock, Dresden, Magdeburg oder Saarbrückens Partnerstadt Cottbus. (Moreover, there are still other cities beyond the former frontier to the GDR apart from Leipzig and Berlin, for example Rostock, Dresden, Magdeburg, or Cottbus, the twin city of Saarbrücken.)
(10) Außerdem scheint den Herren von der Sparkasse entgan
gen zu sein, (Besides, it seems that the gentlemen of the savings bank didn't realise) (11) daß am Ende des Terminello-Jahres 1992 der europäische Binnenmarkt steht. (that at the end of the Terminello-Year
1992 the Single European Market will come into force.)
(12) Nicht zuletzt vermittelt das Kärtchen den Eindruck, (Last but not least the little map suggests,) (13) daß Saarbrücken der Nabel (Alt-)Deutschlands zu sein scheint. (that Saarbrücken was the navel of (the former) Germany.)
(14) Je nach anatomischer Sichtweis,
kann es aber auch ein anderes Körperteil sein, (depending on the anatomical point of view, however, it might also refer to another part of the body.)
Here we have a clear priority of ideational themes in the first part of the text (propositions (02), (03), (06) and (07)), whereas the rest of the text is dominated by textual themes, as in (09), (10) and (12). The question is now, how theme selection in text is motivated and whether the differences between the two texts are typical for the respective text types they belong to. As Fries (1983) shows for English texts, different kinds of theme selection patterns correlate both with different text types or genres and are closely related to the subject matter or domain of the text. In particular, there is a close relation between thematic content, i.e., the semantic content of the themes of a text segment, and the method of development of a text which comprises general organisations such as spatial, temporal, general to specific, object to attribute, object to parts or compare and contrast. As also Danes (1974:113) points out, theme plays a decisive constructional role for building up the structure of a text. Note that the method of development is not the same as Danes' thematic progression: the former relates to the semantic content of the grammatical themes and the relations holding between the themes, whereas the latter refers to possible types of patterns built between themes and rhemes of a text. Turning back to our sample texts, the most characteristic feature of Text 1 is its reflection of the spatial organisation of the underlying domain. This is a typical property of many descriptive texts and in this case leads to the incremental creation of a cognitive map of the domain, 'centre of Seville'. The centrality of the domain structure for the construction of the mean ing of the text is mirrored in its linguistic appearance, also with respect to
DISCOURSE CONSTRAINTS ON THEME SELECTION
253
thematic choice: all sentence themes in this text refer to spatial conceptu alisations which are inherently ideational, with a clear difference regarding linguistic realisation between object concepts (realised semantically as par ticipant themes as in (01), (02) and (05)) and (spatial) relational concepts (realised as circumstance themes (as in (03), (04), (06), (07) and (08)). As a result, the sequence of concepts verbalised as themes allows the reader of the text to navigate through a cognitive map of the domain by keeping to a strict spatial method of development In each of the clauses the rhematic part (which includes the focus) elaborates on the specific spatial concept in troduced as theme, i.e., adds certain attributes in the form of other spatial concepts in order to build up a spatial representation of the domain. What can be observed here is a typical 'division of labour' between theme and rheme, namely that the themes play the decisive constructional role by in troducing new domain concepts, whereas the foci, contained in the rhemes, add new pieces of information. In terms of subject matter, the second text basically deals with spatial information, too, but here the domain is not the main factor responsible for the structuring of the text. In this case, the underlying main discourse purpose is not to inform the reader about some state of affairs as in the descriptive example, but rather to argue in favour of some opinion taken by the author. This is clearly reflected in the linguistic structure of the text: propositions (01) - (06) represent the contra-argumentation, in the sense of providing the facts/arguments against which the author is going to argue. The task of this discourse segment is to present the background information on which the subsequent pro-argumentation ((07) - (14)) in which the author develops her/his opinion is based. The different commu nicative functions of these two stages of the macro structure of the text are also reflected in the means deployed for local contextualisation (i.e., thematisation): ideational elements referring to relevant concepts of the do main predominate in the contra-argumentation, whereas textual elements are chosen to guide the local contextualisation in the pro-argumentation. In this second segment of the text, a sequence of conjunctive themes func tions as the indicator of an (additive) argumentative chain formed by the rhematic parts of the respective sentences: 'zudem' (09) — 'außerdem' (10) — 'nicht zuletzt' (12)). To sum up our text analyses, in the descriptive text we have found a clear, text-type specific correlation between the structure of the domain and the method of development of the text (realised by ideational themes). The argumentative text, in contrast, exhibited two characteristic thematisation
254
VVIEBKE RAMM
strategies, one constructing the state of affairs under discussion and one supporting the chain of argumentation. So, what these sample analyses show is not only that text type and subject matter constrain theme options, but that the theme pattern is also sensitive to the individual stages of the macro structure (or generic structure) of a text. 3
Theme selection as interstratal constraints
How can such types of correlations between discourse features and sentencelevel realisation be accounted for? Correlations between the discourse char acteristics of a text and lexico-grammatical features such as the ones illus trated in the previous section can be straightforwardly employed for gener ation in an architecture that recognises the information types of text type and subject matter as necessary constraints on the well-formedness of a text. One such architecture is implemented in the systernic-functionally ori ented -PENMAN text generation system (cf. Teich & Bateman 1994, Bateman & Teich 1995), a German spin-off of the English PENMAN system (cf. Mann 1983). The system architecture reflects the stratifìcational organisation of the language system presupposed by systemic-functional theory, according to which a linguistic utterance is the result of a complex choice process which recursively selects among options provided by interconnected networks of semantic, grammatical and lexical choice systems associated with different levels of abstraction, strata, such as lexico-grammar, (sentence-)semantics, register and genre (cf. again Matthiessen & Bateman 1991 for an overview). Features of the text type are represented at the most abstract strata of genre and register (encoding the contexts of culture and situation). The typical structural configuration of the texts of a genre, i.e., their typical (global) syntagmatic organisation, is accounted for by representing their so-called generic structure potential, (GSP) (cf. Hasan 1984). A GSP consists of those stages that must occur in the development of a text in order to classify it as belonging to that specific genre. These stages roughly correlate with what is called 'macrostructures' in other approaches (cf. van Dijk 1980). Linguistic resources at all strata are represented as system networks which constitute multiple inheritance hierarchies consisting of various linguistic types. Proceeding from such an architecture, the correlation between text type and theme selection can be conceived of as a set of inter-stratal constraints between the global-level textual resource and the lexico-grammatical re-
DISCOURSE CONSTRAINTS ON THEME SELECTION
255
source which is mediated via a semantic stratum of a local-level textual resource that abstracts from the purely grammatical distinctions provided by the grammar. The representation of such inter-stratal constraints follows the lines presented in Teich & Bateman (1994): At the level of genre, a typo logy of texts is modelled as a system network (based on Martin 1992:560ff.) which covers various descriptive, expository, narrative and argumentative types of texts. Typical GSP structures are associated with individual genres providing the guideline for syntagmatic realisation in the form of global dis course structures. Moreover, depending on the specific communicative func tions pursued, either whole texts or single GSP stages are characterised by three metafunctionally distinct register parameters, namely field (referring to ideational properties, for instance, of the subject matter), tenor (describ ing the interpersonal relations among the participants in the discourse) and mode (the textual dimension, characterising the medium or channel of the language activity). Choices at the stratum of register have characteristic consequences on the lexico-grammatical level, i.e., lead to selections on the lower-level resources of the language system realising the higher ones by appropriate lexical and grammatical means.
Fig. 1: Theme selection as interstratal constraint This architecture also gives room for modelling aspects of discourse con straints on thematisation such as those addressed in this paper: proper ties of the domain or subject matter can be accounted for by the choice
256
WIEBKE RAMM
of appropriate field options at register level which are reflected at the ideational-semantic stratum as specific conceptual configurations (the do main model) with clear mappings defined for lexical and grammatical in stantiation (covered by the ideational-semantic resource, the 'upper model', cf. Bateman et al. 1990). Global thematisation strategies have to be ad dressed at register level as well and are paradigmatically reflected on the individual GSP stages for which a certain method of development holds. The choice of a certain method of development for (a stage of) a text con strains the options at the textual-semantic 4 and textual-grammatical level. For a (simplified) illustration of how this might work, for instance, with respect to a descriptive texts with spatial method of development — say, a travel guide (or a section from it) — see Figure 1. Two kinds of operations support the control of thematisation: The real isation operation of preselection takes as arguments a function inserted at a higher stratum (e.g., a stage inserted in the discourse structure) and a fea ture of a system at a lower stratum (e.g., a feature of the SEMANTIC-THEME system (cf. Ramm et al. 1995)). In the figure, inter-stratal preselection is marked by the arrow between (1) and (2). The chooser/inquiry interface (Mann 1983) is used to interface lexico-grammar and semantics (denoted in Figure 1 by the arrow between (2) and (3)). Each system at the lexicogrammatical stratum is equipped with a number of inquiries that are or ganised in a decision tree (a chooser). The inquiries are implemented to access information from the higher adjacent stratum (here: the local-level textual resource). The inquiries of the chooser of the lexico-grammatical system T H E M E - T Y P E , e.g., must be provided with information about se mantic theme selection in order to decide whether to generate a circum stance (for instance, as a prepositional phrase) or a participant theme (e.g., as a nominal phrase). 4
Conclusions
What we have tried to illustrate in this paper is how discourse paramet ers such as text-type and subject-matter can effect thematisation as one of the grounding devices of language. We have described aspects of local and global thematisations in terms of a system-functionally oriented framework that also underlies an implementation in a text generation system. We have suggested to model correlations between text-level and sentence4
Due to lack of space, we cannot go into details regarding this stratum here. For its motivation and description, see Erich Steiner's contribution in Ramm et al. 1995:36ff.
DISCOURSE CONSTRAINTS ON THEME SELECTION
257
level discourse features as interstratal constraints holding between different levels of the language system. The approach as it is now is certainly still limited, since the mechanisms currently deployed are quite strict and unflexible; they should be enhanced, for instance, by a better micro planning. However, although we could only very roughly sketch our ideas here, we feel t h a t they could provide a step towards closing the generation gap between global and local text planning. Acknowledgements. Most of the research described in this paper was done in the context of the Esprit Basic Research Project 6665 DANDELION. I am grateful to Elke Teich for her extensive feedback and support both with previous versions of this paper and with the implementation. I would also like to thank Claudia Villiger for providing the text corpus and the analyses on which this work is grounded. Last but not least, thanks are due to Erich Steiner for helping with the English — with full responsibility for still existing weaknesses remaining with the author, of course. REFERENCES Bateman, John Α., R. T. Kasper, J. D. Moore & R. A. Whitney. 1990. "A General Organization of Knowledge for Natural Language Processing: the PENMAN Upper Model". Technical Report (ISI/RS-90-192). Marina del Rey, Calif.: Information Science Inst., Univ. of Southern California. & E. Teich. 1995. "Selective Information Presentation in an Integrated Publication System: an Application of Genre-Driven Text Generation". In formation Processing and Management 31:5. 753-767. Danes, Frantisek. 1974. "Functional Sentence Perspective and the Organization of the Text". Papers on Functional Sentence Perspective ed. by F. Danes, 106-128. Prague: Academia. Fries, Peter H. 1983. "On the Status of Theme in English: Arguments from Discourse". Micro and Macro Connexity of Discourse ed. by J. S. Petöfi & E. Sözer (Papiere zur Textlinguistik; Bd. 45). 116-152. Hamburg: Buske. Halliday, Michael A. K. 1994. An Introduction to Functional Grammar. 2nd edition. London: Edward Arnold. Hasan, Ruqaiya. 1984. "The Nursery Tale as a Genre". Nottingham Linguistic Circular 13. 71-192. Hoberg, Ursula. 1981. Die Wortstellung in der geschriebenen deutschen Gegen wartssprache. München: Hueber. Lavid, Julia. 1994a. "Thematic Development in Texts". Deliv. Rl.2.1, ESPRIT Project 6665 DANDELION. Madrid: Universidad Complutense de Madrid.
258
WIEBKE RAMM 1994b. "Theme, Discourse Topic, and Information Structuring". De liverable R1.2.2b, ESPRIT Project 6665 DANDELION. Madrid: Universidad Complutense de Madrid.
Mann, William . 1983. "An Overview of the PENMAN Text Generation System". Proceedings of the National Conference on Artificial Intelligence (83), 261-265. Martin, James R. 1992. English Text: System and Structure. Philadelphia: John Benjamins.
Amsterdam &
Matthiessen, Christian M. I. M. 1988. "Semantics for a Systemic Grammar: the Chooser and Inquiry Framework". Linguistics in a Systemic Perspect ive ed. by J. D. Benson, M. Cummings & W. S. Greaves. Amsterdam & Philadelphia: John Benjamins. & J. Α. Βateman. 1991. Text Generation and Systemic-Functional Linguist ics: Experiences from English and Japanese. London: Frances Pinter. Forthcoming. Lexicogrammatical Cartography: English systems. Technical Report, Dept. of Linguistics. Sydney: University of Sydney. Meteer, Marie W. 1992. Expressibility and the Problem of Efficient Text Planning. London: Pinter. Ramm, Wiebke, A. Rothkegel, E. Steiner & . Villiger. 1995. "Discourse Gram mar for German". Deliverable R2.3.2, ESPRIT Project 6665 DANDELION. Saarbrücken: University of the Saarland. Steiner, Erich & W. Ramm. 1995. "On Theme as a Grammatical Notion for German". Functions of Language 2:1. 57-93. Teich, Elke & J. A. Bateman. 1994. "Towards the Application of Text Genera tion in an Integrated Publication System". Proceedings of the 7th Interna tional Workshop on Natural Language Generation, 153-162. Kennebunkport, Maine. van Dijk, Teun Α. 1980. Macro structures: An Interdisciplinary Study of Global Structures in Discourse, Interaction and Cognition. Hillsdale, New Jersey: Erlbaum. Villiger, Claudia. 1995. "Theme, Discourse Topic, and Information Structuring in German Texts". Deliverable R1.2.2c, ESPRIT Project 6665 DANDELION. Saarbrücken: University of the Saarland.
Discerning Relevant Information in Discourses Using TFA G E E R T - J A N M. K R U I J F F 1 & J A N SCHAAKE
University of Twente Abstract When taking the stance that discourses are intended to convey in formation, it becomes important to recognise the relevant informa tion when processing a discourse. A way to analyse a discourse with regard to the information expressed in it, is to observe the TopicFocus Articulation. In order to distinguish relevant information in particularly a turn in a dialogue, we attempt to establish the way in which the topics and foci of that turn are structured into a story line. In this paper we shall come to specifying the way in which the information structure of a turn can be recognised, and what relevant information means in this context. 1
Introduction
Discourses, whether written or spoken, are intended to convey information. Obviously, it is important to the processing of discourses that one is able to recognise the information that is relevant. The need for a criterion for relev ance of information arises out of the idea of developing a tool assisting in the extraction of definitions from philosophical discourses (PAPER/HCRAESprojects). A way to analyse a discourse with regard to the information ex pressed in it, is to observe the Topic-Focus Articulation. A topic of (part of) a discourse can be conceived of as already available information, to which more information is added by means of one or more foci. Several topics and foci of a discourse are organised in certain structures, characterised by a thematical progression ('story-line'). The theories about TFA and them atic progression have been developed by the Prague School of Linguistics. Particularised to our purposes, in order to discern the relevant information in a discourse, we try to establish the thematic progression(s) in a turn of a dialogue. It will turn out that it is important, not only how topics and foci relate to each other with regard to the thematic progression (sequentially, parallelly, etc.), but also how the topics and foci are related rhetorically (e.g. by negation). In this paper we shall come to specifying the way in which 1
Currently at the Dept. of Mathematics and Physics, Charles University, Prague,
260
GEERT-JAN M. KRUIJFF & JAN SCHAAKE
the information structure of a turn can be recognised, and what relevant information means in this context. In order to develop and to test these definitions we regarded it necessary to choose a domain of small texts where discerning relevant information is also needed. This domain we found in the SCHISMA project. The SCHISMA project is devoted to the development of a theatre information and booking system. One of the problems to be met in analysing dialogues is to discern what exactly is or are the point(s) made in a turn of the client. As we will see below, in one turn a client may make just one relevant remark, the rest being noise or background information that is not relevant to the system. It may also be the case that two or more relevant points are made in just one turn. These points have to be discerned as being both relevant. Throughout the paper examples of the occurrence of relevant information in a turn will be given. In sections 2 and 3, Thematic Progression and Rhetorical Structure Theory will be applied to dialogues taken from the SCHISMA corpus. In section 4, relevant information will be related to what will be called generic tasks; tasks that perform a small function centred around the goal of acquiring a specific piece of information (Chandrasekaran 1986). Conclusions will be drawn in the final section. 2
The communication of information
Surely, it might almost sound like a commonplace that a dialogue conveys, or communicates, information2 But what can we say about the exact features of such communication? If we want to a logical theory of information to be of any use, we should elucidate how we arrive at the information we express in information states (Van der Hoeven et al. 1994). Such elucidation is the issue of the current section. The assumption we make about the dialogues to be considered is that they are coherent. Rather than being a set of utterances bearing no relation to each other, a dialogue -by the assumption- should have a 'story line'. For example, the utterances can therein be related by referring to a common topic, or by elaborating a little further upon a topic that was previously introduced. More formally, we shall consider utterances to be constituted of a Topic and Focus pair. The Topic of an utterance stands for given information, while the Focus of an utterance stands for new information. 2
Supposed that the dialogue is meant be purposeful, of course. Otherwise, they are called 'parasitic' with respect to communicative dialogues (cf. Habermas).
DISCERNING RELEVANT INFORMATION
261
The theory of the articulation of Topic and Focus (TFA) has been developed by members of the Modern Prague School, notably by Hajicová (Hajicová 1993; Hajicová 1994). Consequently, the 'story line' of a dialogue becomes describable in terms of relations between Topics and Foci. The communication of information thus is describable in terms of how given information is used and new inform ation is provided. The relations between Topics and Foci may be conceived of in two ways, basically: thematically, and rhetorically. The thematical way concerns basically the coreferential aspect, while the rhetorical way concerns the functional relationship between portions of a discourse. Let us therefore have a closer look at each of these ways, and how they are related to each other. First, the relations between Topics and Foci can be examined at the level of individual utterances. In that case we shall speak of thematic rela tions, elucidating the thematic progression. Thematic progression is a term introduced in (Danes 1979) as a means to analyse the thematic build-up of texts. We shall use it here in the analysis of the manner in which given and new information are bound to each other by utterances in a dialogue. According to Danes , there are three possibilities in which Topics and Foci are bindable, which are described as follows: 1. Sequential progression : The Focus of utterance m , F m , is con stitutive for the Topic of a (the) next utterance n,Tn. Diagrammatically:
2. Parallel progression : The Topic of utterance m,T m , bears much similarity to the Topic of a (the) next utterance n,Tn. Diagrammatically:
3. Hypertheme progression : The Topic of utterance m, Tm, a s well as the Topic of utterance ,n, refer to an overall Topic called the Hypertheme, TH. Utterances m and n are said to be related hyperthematically. Diagrammatically:
262
GEERT-JAN M. KRUIJFF & JAN SCHAAKE
The following sentences are examples of these different kinds of progression: (1) The brand of GJ's car is Trabant. The Trabant has a two-stroke engine. (2) Trabis are famous for their funny motor-sound. Trabis are also wellknown for the blue clouds to puff. (3) Being a car for the whole family, the Trabant has several interesting features. One feature is that about every person can repair it. Another feature is that a child's finger-paint can easily enhance the permanent outlook of the car. It might be tempting to try to determine the kind of thematic progression between utterances by merely looking at the predicates and entities involved. In other words, directly in terms of information states. Especially sentences like (1) and (2) tend to underline such a standpoint. However, consider the following revision of (1), named (1'): (1') GJ has a Trabant. The motor is a cute two-stroke engine. Similar to (1) we would like to regard (Γ) as a sequential progression. Yet, if we would consider only predicates and entities, we would not be able to arrive at that preferred interpretation. It is for that reason that we propose to determine the kind of thematic progression obtaining between two utterances as follows. Instead of discerning whether the predicates and entities of a Topic Tm or a Focus Fm are the same as those of a Topic Tn' we want to establish whether Fm or Tm and Tn are coreferring. We take coreference to mean that two expressions, E1 and E2 a) are referring to the same concept, or b) are referring to a conceptual structure, where 1 is referring to a concept CEI which is the parent of a concept C E 2 , to which E2 is referring. Hence, the following relations hold3: 1. Fm and Tn are coreferring → sequential progression 2. Tm and Tn are coreferring → parallel progression 3. TH , Tm and Tn are coreferring → hypertheme progression By identifying a coreference obtaining between a focus or topic and a sub sequent topic, we conclude that such a pair has the same intensional content — they are about the same concept. Under the assumption that a concept is only instantiated once in a turn, we could even conclude further here that 3
The presented ideas about thematic progression and coreference result from discussions between Geert-Jan Kruijff and Ivana Korbayová.
DISCERNING RELEVANT INFORMATION
263
the focus or topic and subsequent topic have the same referential content — they refer to the same instantiation of the concept at hand. Clearly, if we would lift the assumption of single instantiation, it would be neces sary to establish whether the instantiations of the concept employed in the expressions are identical. 3
Rhetorical structure of turns
For our purposes we establish the thematic progression between a number utterances making up a single turn in a dialogue. As we already noted, utterances can also be related rhetorically besides thematically. Whereas the thematic progression shows us how information is being communicated by individual utterances, the rhetorical structure elucidates how parts of the communicated information functions in relation to other parts of in formation communicated within the same turn. In other words, the rhet orical structure considers the function of the information communicated by clusters of one or more utterances of a single turn. Such clusters will be called segments hereafter. When performing an analysis in order to explicate the rhetorical struc ture, we make use of Mann and Thompson's Rhetorical Structure Theory (RST) as laid down in Mann & Thompson (1987). Basically, RST enables us to structure a turn into separate segments that are functionally related to each other by means of so-called rhetorical relations. Important is that rhetorical relations are between segments, and that each segment in a rhetorical relation has an import relative to the other segment(s). Basically, two kinds can therein be distinguished: a nucleus N, and a satellite S. The distinction between them can be pointed out as fol lows. A nucleus is defined as a segment that serves as the focus of attention. A satellite is a segment that gains its significance through a nucleus. The concept of nuclearity is important to us: We would still have a coherent dialogue if we would consider the nuclei only. In our understanding, nuc learity is thus an expressive source that directs the response to a turn of a dialogue. Examples of such rhetorical relations are: (4) Segment S is evidence for segment N: (N) The engine of my car works really well nowadays. (S) It started yesterday within one minute. (5) Segment S provides background for segment N: (S) I spend a significant part of the year in Prague. (N) Nowadays, I am the proud owner of a Trabant.
264
GEERT-JAN M. KRUIJFF & JAN SCHAAKE (6)
Segment S is a justification for segment N: (S) When parking a little carelessly, I broke one of the rear lights. (N) I should buy a new rear light. A study of a corpus of dialogues we have gathered reveals that within our domain the following rhetorical relations are of importance: 1. Solutionhood: S provides the solution for N; "Yes, but grandma is a little cripple, so, well, then we'll go with the two of us." 2. Background: S provides background for N; "I would like to go to an opera. Is there one on Saturday?" 3. Conditional: S is a condition for N; "If the first row is right opposite to the stage, then the first row, please." 4. Elaboration: S elaborates on N; "I would like to go to Wittgenstein, because he was really enter taining last time." 5. R e s t a t e m e n t : S restates or summarises N; "So I have made a reservation for " 6. Contrast: Several N's are contrasted; "I would like to, but my friend does not. So, then we'd better not go to an opera; Can we go to an other performance?" 7. Joint: Several N's are joined; "How expensive would that be, and are there still vacant seats?" In case of rhetorical relations 1 through 3 the S is uttered after N, while in case of the relations 4 through 5 S is uttered before N. Relations and 7 are constituted by multiple nuclei. These orders are called canonical orders. Revisiting the thematic and rhetorical structure of a turn in a dialogue, we observe the following. The established thematic progression elucidates the actual flow of communicated information. Therein, we can observe which utterances convey what information. The rhetorical structure clarifies how information expressed by nuclei and satellites are functionally related to each other. Clearly, the question that might be raised subsequently is How does the segmentation of a turn into nuclei and satellites arise from the thematic progression? To answer the question, we should realise that we are actually deal ing with three smaller problems: First, how does a thematic progression segment a turn? A thematic progression divides a turn into discernible segments according to the flow of information. Intuitively, one might say that every time a new flow of information is commenced, a new segment
DISCERNING RELEVANT INFORMATION
265
is introduced. As we shall see in the example provided below, this means in general that when a parallel progression or hypertheme progression is invoked, a new segment starts. Second, how do we recognise the rhetorical relations involved? Mann and Thompson describe how rhetorical relations can be recognised by means of conditions (or constraints) that should hold for the textual structure. We conjecture that, in terms of our approach, rhetorical relations can be recognised by taking the thematic progression and the formed conceptual structure into account. Rephrased, rhetorical relations are conditioned by the thematic progression and the conceptual structure involved. Once the rhetorical relation has been recognised, the third problem of recognising nuclei and satellites is also solved (as Mann and Thompson state), for their characterisations follow inter alia from the canonical order of each rhetorical relation. 4
A n example
Here, we provide an example analysis of a turn into thematic progression and ensuing rhetorical structure. As will become obvious from the example, recognising the thematic progression as well as the rhetorical structure en ables us to observe which parts of a turn are to be considered as relevant. The issue of discerning relevance will be elaborated upon in the next section. (7)
For Wittgenstein tonight it is, yes. For four persons is fine. But the other one doesn't know. And because it is his birth day we would like to have our picture taken. Can you ask that too? Oh yes, and my husband would like to join us for dinner if that would be possible. No foreign stuff. So that is for three. Are you also in charge of the food?
Assuming that we have decent means to analyse the dialogue linguistically, let us commence with discerning the thematic progression. The schema displays sequential progressions (\\seq) and parallel progressions (||p0r) — see Figure 1. T3 and T3 refer hyperthematically to F 3 , being "(members of) the group that is going to the performance", but we shall not consider such in the case at hand. More interesting to observe is that the thematic progression quite naturally segmentates the turn of the dialogue, as we conjectured. Let us call the three segments STU S and ST6, the subscript denoting the Topic that initiates the segment. Subsequently, the segments can be said — quite uncontroversially, hope fully — to be rhetorically related as shown in Figure 2.
266
GEERT-JAN M. KRUIJFF & JAN SCHAAKE
T1
→
[It]
F1 [Wittgenstein tonight] \\seq
T 2 (ellipsis)
()
→
→
F3 [four persons]
F3 [the other one] \\seq
T6
[husband]
→
T 4 ...[his birthday] ... → .... F 4 [picture] 11 seq T 5 Question F 6 [to join for dinner] 11 seq T 7 (ellipsis) → F7 [foreign stuff] T 8 (dinner) ||Iar
→
F 8 [three (persons)]
T 9 Question Fig. 1: Thematic progression in (7) ST1 5T6
← ←
[elaboration] [elaboration]
→ →
5T3 5T3
Fig. 2: Segments rhetorically related (a) Using the canonical order noted earlier, we can consequently determine the nuclei and satellites and construct the following hierarchical organisation (see Figure 3). Apparently, it suffices to maintain only the nucleus 5T1 and still have a coherent and justly purposeful dialogue. As we stated already, the concept of nuclearity is important to us. It directs the response to the turn of the dialogue, which in this case could for example be that there is no perform ance by Wittgenstein tonight at all. 5
Discerning relevant information
The current section will explain the fashion in which we discern relevant information in a dialogue, thereby building forth upon the previous section. First and foremost we should then clarify what we understand by relevance. When we state that a particular piece of information is relevant, we
DISCERNING RELEVANT INFORMATION
267
ST1 = Nucleus \ [elaboration] \ ST3 — Satellite wrt ST1 / Nucleus wrt S T 6 \ [elaboration] \ S T 6 = Nucleus Fig. 3: Segments rhetorically related (b) mean that it is relevant from a certain point of view. We do not want to take all the information that is provided into consideration. Rather, we are looking for information that fits our purposes. And what are these purposes? Recall the discussion above, where the concept of generic tasks was introduced. Generic tasks were presented as units to carry out simple tasks, units which could be combined into an overall structure that would remain flexible due to the functional individuality of the simple tasks. These generic tasks are our 'purposes'. More specifically, when carrying out a generic task, we look among the nuclei found in the rhetorical structure for one that presents us with the information that we need for performing the task at hand. In other words, such a nucleus presents us with relevant information. For example, if when carrying out the task IDENTIFY_PERFORMANCE, the following information is of importance to uniquely identify a performance: a) the name of the entertainer, the performing group, or the performance itself; b) the day (and if more performances on one day, also the time). Obviously, the nucleus ST1 is highly relevant to this task. For it provides us with both ENTERTAINER_NAME as well as PERFORMANCE_DAY. Interesting to note is that once we have such information, a proper response can be generated by the dialogue manager. For example, the system could respond that there is no performance by the entertainer on the mentioned day, or ask (in case of several performances on the same day), whether one would like to go in the afternoon or in the evening. Furthermore, things also work the other way around. As we noted earlier, a nucleus directs response. Therefore, a nucleus should also be re garded as a possibility to initiate the execution of a particular generic task.
268
GEERT-JAN M. KRUIJFF & JAN SCHAAKE
Such requires the following assumptions, though. First of all, a linguistic analysis should provide us with the concepts that are related to words or word-groups. Observe that this assumption has been made already above. Second, from each generic task it should be known which concepts are in volved in the performance of that task. Thus, what kinds of information it gathers. It basically boils down to the following then. Namely, if we know the concepts involved, we should be able to identify the generic task that should be initiated to respond properly to the user. It is realistic to assume that, based on all the information the user provides, several generic tasks might be invoked. Such tasks should then be placed in an order that would appear natural to the user. We must note, though, that it will not be the case that different generic tasks will be invoked based on identical information. Each generic task is functionally independent and has a simple goal, and as such works with information that is not relevant to other generic tasks. Recapitulating, we perceive of relevance in terms of information that is needed for the performance of tasks that are functionally independent and have simple goals: The so-called generic tasks. Based on the thematic progression and the rhetorical structure, we look for information in the nuclei that we have identified. If the information found is needed for a task that is currently being carried out, or if it can be used to initiate a new task, then we consider the information to be relevant information. Clearly, our system thereby no longer organises its responses strictly to prefixed scripts nor strictly to a recognition of the user's intentions. Due to our use of generic tasks and integrated with our understanding of relevant information, our system carries out its tasks corresponding the way the user provides it with information. Thus, the system is able to respond more flexibly as well as more natural to the user. 6
Conclusions
In this paper we stated that the information we are basically interested in is relevant information, and we provided the means by which one can arrive at relevant information. For that purpose, we discussed the Prague concepts of Topic and Focus Articulation (TFA) and thematic progression, the struc ture in which Topics and Foci get organised. Subsequently, we examined rhetorical structures in the light of Rhetorical Structure Theory, and showed how the rhetorical structure of a turn builds forth upon the turn's thematic progression. We identified genuine nuclei in a rhetorical structure to be
DISCERNING RELEVANT INFORMATION
269
potentially providers of relevant information. That is, information that a currently running generic task would need or that could initiate a generic task. We closed our discussion by noting how such leads to a system that is capable of responding to a user in a flexible and natural way. A couple of concluding remarks could be made. First of all, in the discussion we do not treat of thematic progressions spanning over more than one turn. Currently, thematic progressions and thus rhetorical structures are bound to single turns of a dialogue. We intend to lift this restriction after examining how we can completely integrate our logical theory of information with the views presented here. Second, we would like to elaborate on how the mechanisms described here would fit into a dialogue manager that parses dialogues on the level of generic tasks. Regarding the segmentation of discourses and its relation to the dy namics of the communication of information, a topic for further research could be to compare our point of view to that of Firbas' Communicative Dynamism as described in (Firbas 1992). REFERENCES Chandrasekaran, B. 1986. "Generic Tasks in Knowledge-Based Reasoning: HighLevel Building Blocks for Expert System Design". IEEE Expert. Danes, Frantisek. "Functional Sentence Perspective and the Organisation of Text". Papers on Functional Sentence Perspective ed. by F. Danes. Prague: Academia. Firbas, Jan. 1992. Functional Sentence Perspective in Written and Spoken Com munication. Cambridge: Cambridge University Press. Hajicová, Eva. 1993. From the Topic/Focus Articulation of the Sentence to Discourse Patterns. Vilém Mathesius Courses in Linguistics and Semiotics, Prague. 1993. Issues of Sentence Structure and Discourse Patterns. Prague: Charles University. 1994. Topic/Focus Articulation and Its Semantic Relevance. Vilém Math esius Courses in Linguistics and Semiotics, Prague. 1994. "Topic/Focus and Related Research". Prague School of Structural and Functional Linguistics ed. by Philip A. Luelsdorff, 245-275. Amsterdam & Philadelphia: John Benjamins. Van der Hoeven, G.F., Τ.Α. Andernach, S.P. van de Burgt, G.J.M. Kruijff, A. Nijholt, J. Schaake & F.M.G. de Jong. 1994. "SCHISMA: A Natural Language Accessible Theatre Information and Booking System". TWLT 8:
270
GEERT-JAN M. KRUIJFF & JAN SCHAAKE Speech and Language Engineering ed. by L. Boves, Α. Nijholt, 137-149. En schede: Twente University.
Mann, William . & Thompson, Sandra A. 1987. Rhetorical Structure Theory: A Theory of Text Organisation. Reprint, Marina del Rey, Calif.: Information Sciences Institute.
IV GENERATION
Approximate Chart Generation from Non-Hierarchical Representations NICOLAS NICOLOV, CHRIS MELLISH & G R A E M E R I T C H I E
Dept. of Artificial Intelligence, University of Edinburgh Abstract This paper presents a technique for sentence generation. We argue that the input to generators should have a non-hierarchical nature. This allows us to investigate a more general version of the sentence generation problem where one is not pre-committed to a choice of the syntactically prominent elements in the initial semantics. We also consider that a generator can happen to convey more (or less) information than is originally specified in its semantic input. In order to constrain this approximate matching of the input we impose ad ditional restrictions on the semantics of the generated sentence. Our technique provides flexibility to address cases where the entire input cannot be precisely expressed in a single sentence. Thus the gener ator does not rely on the strategic component having linguistic know ledge. We show clearly how the semantic structure is declaratively related to linguistically motivated syntactic representation. We also discuss a semantic-indexed memoing technique for non-deterministic, backtracking generators. 1
Introduction
Natural language generation is the process of realising communicative in tentions as text (or speech). The generation task is standardly broken down into the following processes: content determination (what is the meaning to be conveyed), sentence planning 1 (chunking the meaning into sentence sized units, choosing words), surface realisation (determining the syntactic struc ture), morphology (inflection of words), synthesising speech or formatting the text output. In this paper we address aspects of sentence planning (how content words are chosen but not how the semantics is chunked in units realisable as sen tences) and surface realisation (how syntactic structures are computed). We thus discuss what in the literature is sometimes referred to as tactical generation, that is "how to say it" — as opposed to strategic generation Note that this does not involve planning mechanisms!
274
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
— "what to say". We look at ways of realising a non-hierarchical semantic representation as a sentence, and explore the interactions between syntax and semantics. Before giving a more detailed description of our proposals first we mo tivate the non-hierarchical nature of the input for sentence generators and review some approaches to generation from non-hierarchical representations — semantic networks (Section 2). We proceed with some background about the grammatical framework we will employ — D-Tree Grammars (Section 3) and after describing the knowledge sources available to the generator (Sec tion 4) we present the generation algorithm (Section 5). This is followed by a step by step illustration of the generation of one sentence (Section 6). We then discuss further semantic aspects of the generation (Section 7), the memoing technique used by the generator (Section 8) and the implement ation (Section 9). We conclude with a discussion of some issues related to the proposed technique (Section 10). 2
Generation from non-hierarchical representations
The input for generation systems varies radically from system to system. Many generators expect their input to be cast in a tree-like notation which enables the actual systems to assume that nodes higher in the semantic structure are more prominent than lower nodes. The semantic representa tions used are variations of a predicate with its arguments. The predicate is realised as the main verb of the sentence and the arguments are real ised as complements of the main verb — thus the control information is to a large extent encoded in the tree-like semantic structure. Unfortunately, such dominance relationships between nodes in the semantics often stem from language considerations and are not always preserved across languages. Moreover, if the semantic input comes from other applications, it is hard for these applications to determine the most prominent concepts because lin guistic knowledge is crucial for this task. The tree-like semantics assumption leads to simplifications which reduce the paraphrasing power of the gener ator (especially in the context of multilingual generation). 2 In contrast, the use of a non-hierarchical representation for the underlying semantics allows the input to contain as few language commitments as possible and makes it possible to address the generation strategy from an unbiased position. We have chosen a particular type of a non-hierarchical knowledge representa tion formalism, conceptual graphs (Sowa 1992), to represent the input to 2
The tree-like semantics imposes some restrictions which the language may not support.
APPROXIMATE CHART GENERATION
275
our generator. This has the added advantage that the representation has well defined deductive mechanisms. A graph is a set of concepts connected with relations. The types of the concepts and the relations form generalisa tion lattices which also help define a subsumption relation between graphs. Graphs can also be embedded within one another. The counterpart of the unification operation for conceptual graphs is maximal join (which is nondeterministic). Figure 1 shows a simple conceptual graph which does not have cycles. The arrows of the conceptual relations indicate the domain and range of the relation and do not impose a dominance relationship.
Fig. 1: A simple conceptual graph The use of semantic networks in generation is not new (Simmons & Slocum 1972; Shapiro 1982). Two main approaches have been employed for generation from semantic networks: utterance path traversal and incre mental consumption.3 An utterance path is the sequence of nodes and arcs that are traversed in the process of mapping a graph to a sentence. Gener ation is performed by finding a cyclic path in the graph which visits each node at least once. If a node is visited more than once, grammar rules determine when and how much of its content will be uttered (Sowa 1984). It is not surprising that the early approaches to generation from semantic networks employed the notion of an utterance path — the then popular grammatical framework (Augmented Transition Networks) also involved a notion of path traversal. The utterance path approach imposes unnecessary restrictions on the resources (i.e., that the generator can look at a limited portion of the input — usually the concepts of a single relation); This imposes a local view of the generation process. In addition a directionality of processing is introduced which is difficult to motivate; sometimes linguistic knowledge is used to traverse the network (adverbs of manner are to be visited before adverbs of time); finally stating the relation between syntax and semantics involves the notion of how many times a concept has been visited. 3
Here the incremental consumption approach does not refer to incremental generation!
276
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
Under the second approach, that of incremental consumption, generation is done by gradually relating (consuming) pieces of the input semantics to linguistic structure (Boyer & Lapalme 1985; Nogier 1991). Such covering of the semantic structure avoids some of the limitations of the utterance path approach and is also the general mechanism we have adopted (we do not rely on the directionality of the conceptual relations per se — the primit ive operation that we use when consuming pieces of the input semantics is maximal join which is akin to pattern matching). The borderline between the two paradigms is not clear-cut. Some researchers (Smith et al. 1994) are looking at finding an appropriate sequence of expansions of concepts and reductions of subparts of the semantic network until all concepts have real isations in the language. Others assume all concepts are expressible and try to substitute syntactic relations for conceptual relations (Antonacci 1992). Other work addressing surface realisation from semantic networks in cludes: generation using Meaning-Text Theory (Iordanskaja 1991), genera tion using the S N E P S representation formalism (Shapiro 1989), generation from conceptual dependency graphs (van Rijn 1991). Among those that have looked at generation with conceptual graphs are: generation using Lexical Conceptual Grammar (Oh et al. 1992), and generating from CGs using categorial grammar in the domain of technical documentation (Svenberg 1994). This work improves on existing generation approaches in the follow ing respects: (i) Unlike the majority of generators this one takes a nonhierarchical (logically well defined) semantic representation as its input. This allows us to look at a more general version of the realisation problem which in turn has direct ramifications for the increased paraphrasing power and usability of the generator; (ii) Following Nogier & Zock (1992), we take the view that lexical choice is essentially (pattern) matching, but unlike them we assume that the meaning representation may not be entirely con sumed at the end of the generation process. Our generator uses a notion of approximate matching and can happen to convey more (or less) information than is originally specified in its semantic input. We have a principled way to constrain this. We build the corresponding semantics of the generated sentence and aim for it to be as close as possible to the input semantics. (i) and (ii) thus allow for the input to come from a module that need not have linguistic knowledge; (iii) We show how the semantics is systematic ally related to syntactic structures in a declarative framework. Alternative processing strategies using the same knowledge sources can therefore be envisaged.
APPROXIMATE CHART GENERATION 3
277
D-Tree G r a m m a r s
Our generator uses a particular syntactic theory — D-Tree Grammar (DTG) which we briefly introduce because the generation strategy is influenced by the linguistic structures and the operations on them. D-Tree Grammar (DTG) (Rambow, VijayShanker & Weir 1995) is a new grammar formalism (also in the mathematical sense), which arises from work on Tree-Adjoining Grammars (TAG) (Joshi 1987).4 In the context of generation, TAGS have been used in a number of systems MUMBLE (McDon ald & Pustejovsky 1985), SPOKESMAN (Meteer 1990), WIP (Wahlster et al. 1991), the system reported by McCoy (1992), the first version of PRO T E C T O R 5 (Nicolov, Mellish & Ritchie 1995), and recently SPUD (by Stone & Doran), TAGS have been given a prominent place in the VERBMOBIL pro ject — they have been chosen to be the framework for the generation module (Caspari & Schmid 1994; Harbusch et al. 1994). In the area of grammar de velopment TAG has been the basis of one of the largest grammars developed for English (Doran 1994). Unlike TAGS, DTGS provide a uniform treatment of complementation and modification at the syntactic level, DTGS are seen as attractive for generation because a close match between semantic and syntactic operations leads to simplifications in the overall generation archi tecture, DTGS try to overcome the problems associated with TAGS while remaining faithful to what is seen as the key advantages of TAGS (Joshi 1987): 1. the extended domain of locality over which syntactic dependencies are stated; and 2. function argument structure is captured within a single initial con struction in the grammar. DTG assumes the existence of elementary structures and uses two operations to form larger structures from smaller ones. The elementary structures are tree descriptions6 which are trees in which nodes are linked with two types of links: domination links (d-links) and immediate domination links (ilinks) expressing (reflexive) domination and immediate domination relations between nodes. Graphically we will use a dashed line to indicate a d-link (see Figure 2). D-trees allow us to view the operations for composing trees as monotonic. The two combination operations that DTG uses are subsertion and sister-adjunction. 4 5 6
DTG and TAG are very similar, yet they are not equivalent (Weir pc). PROTECTOR is the generation system described in this paper. They are called d-trees hence the name of the formalism.
278
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
Fig. 2: Subsertion Subsertion. When a d-tree α is subserted into another d-tree β, a com ponent 7 of α is substituted at a frontier nonterminal node (a substitution node) of β and all components of α that are above the substituted compon ent are inserted into d-links above the substituted node or placed above the root node of β (see Figure 2). It is possible for components above the sub stituted node to drift arbitrarily far up the d-tree and distribute themselves within domination links, or above the root, in any way that is compatible with the domination relationships present in the substituted d-tree. In or der to constrain the way in which the non-substituted components can be interspersed DTG uses subsertion-insertion constraints which explicitly spe cify what components from what trees can appear within a certain d-links. Subsertion as it is defined as a non-deterministic operation. Subsertion can model both adjunction and substitution in TAG .
Fig. 3:
Sister-adjunction
Sister-adjunction. When a d-tree α is sister-adjoined at a node η in a d-tree β the composed d-tree γ results from the addition to β of α as a new leftmost or rightmost sub-d-tree below η. Sister-adjunction involves the addition of exactly one new immediate domination link. In addition several 7
A tree component is a subtree which contains only immediate dominance links.
APPROXIMATE CHART GENERATION
279
sister-adjunctions can occur at the same node. Sister-adjoining constraints associated with nodes in the d-trees specify which other d-trees can be sisteradjoined at this node and whether they will be right- or left-sister-adjoined. For more details on DTGS see (Rambow, Vijay-Shanker & Weir 1995a) and (Rambow, Vijay-Shanker & Weir 1995b).
4
Knowledge sources
The generator assumes it is given as input an input semantics (InputSem) and 'boundary' constraints for the semantics of the generated sentence (BuiltSem which in general is different from InputSem8). The bound ary constraints are two graphs (UpperSem and LowerSem) which convey the notion of the least and the most that should be expressed. So we want BuiltSem to satisfy: LowerSem < BuiltSem < UpperSem.9 If the generator happens to introduce more semantic information by choosing a particular expression, LowerSem is the place where such additions can be checked for consistency. Such constraints on BuiltSem are useful because in general InputSem and BuiltSem can happen to be incomparable (neither one subsumes the other). In a practical scenario LowerSem can be the knowledge base to which the generator has access minus any contentious bits. UpperSem can be the minimum information that necessarily has to be conveyed in order for the generator to achieve the initial communicative intentions. The goal of the generator is to produce a sentence whose corresponding semantics is as close as possible to the input semantics, i.e., the realisation adds as little as possible extra material and misses as little as possible of the original input. In generation similar constraints have been used in the generation of referring expressions where the expressions should not be too general so that discriminatory power is not lost and not too specific so that the referring expression is in a sense minimal. Our model is a generalisation of the paradigm presented in (Reiter 1991) where issues of mismatch in lexical choice are discussed. We return to how UpperSem and LowerSem are actually used in Section 7. 8
9
This can come about from a mismatch between the input and the semantic structures expressible by the generator. The notation G1 < G2 means that G1 is subsumed by G2. We consider UpperSem to be a generalisation of BuiltSem and LowerSem a specialisation of BuiltSem (in terms of the conceptual graphs that represent them).
280
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
Fig. 4: A mapping rule for transitive constructions 4.1
Mapping rules
Mapping rules state how the semantics is related to the syntactic repres entation. We do not impose any intrinsic directionality on the mapping rules and view them as declarative statements. In our generator a map ping rule is represented as a d-tree in which certain nodes are annotated with semantic information. Mapping rules are a mixed syntactic-semantic representation. The nodes in the syntactic structure are feature structures and we use unification to combine two syntactic nodes (Kay 1983). The semantic annotations of the syntactic nodes are either conceptual graphs or instructions indicating how to compute the semantics of the syntactic node from the semantics of the daughter syntactic nodes. Graphically we use dotted lines to show the coreference between graphs (or concepts). Each graph appearing in the rule has a single node ('the semantic head') which acts as a root (indicated by an arrow in Figure 4). This hierarchical struc ture is imposed by the rule, and is not part of the semantic input. Every mapping rule has associated applicability semantics which is used to license its application. The applicability semantics can be viewed as an evaluation of the semantic instruction associated with the top syntactic node in the tree description.
APPROXIMATE CHART GENERATION
281
Figure 4 shows an example of a mapping rule. The applicability semantics of this mapping rule is: -( j)→ . If this structure matches part of the input semantics (we explain more precisely what we mean by matching later on) then this rule can be triggered (if it is syntactically appropriate — see Section 5). The internal generation goals (shaded areas) express the following: (1) generate as a verb and subsert (substitute,attach) the verb's syntactic structure at the Vo node; (2) generate as a noun phrase and subsert the newly built structure at NP0; and (3) generate as another noun phrase and subsert the newly built structure at NP1. The newly built structures are also mixed syntactic-semantic representations (annotated d-trees) and they are incorporated in the mixed structure corresponding to the current status of the generated sentence.
5
Sentence generation
In this section we informally describe the generation algorithm. In Fig ure 5 and later in Figure 8, which illustrate some semantic aspects of the processing, we use a diagrammatic notation to describe semantic structures which are actually encoded using conceptual graphs. The input to the generator is InputSem, LowerSem, UpperSem and a mixed structure, Partial, which contains a syntactic part (usually just one node but possibly something more complex) and a semantic part which takes the form of semantic annotations on the syntactic nodes in the syntactic part. Initially Partial represents the syntactic-semantic correspondences which are imposed on the generator. 10 It has the format of a mixed structure like the representation used to express mapping rules (Figure 4). Later during the generation Partial is enriched and at any stage of processing it represents the current syntactic-semantic correspondences. We have augmented the DTG formalism so that the semantic structures associated with syntactic nodes will be updated appropriately during the subsertion and sister-adjunction operations. The stages of generation are: (1) building an initial skeletal structure; (2) attempting to consume as much as possible of the semantics uncovered in the previous stage; and (3) con verting the partial syntactic structure into a complete syntactic tree. 10
In dialogue and question answering, for example, the syntactic form of the generated sentence may be constrained.
282 5.1
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE Building a skeletal structure
Generation starts by first trying to find a mapping rule whose semantic structure matches 11 part of the initial graph and whose syntactic structure is compatible with the goal syntax (the syntactic part of Partial). If the initial goal has a more elaborate syntactic structure and requires parts of the semantics to be expressed as certain syntactic structures this has to be respected by the mapping rule. Such an initial mapping rule will have a syntactic structure that will provide the skeleton syntax for the sentence. If Lexicalised DTG is used as the base syntactic formalism at this stage the mapping rule will introduce the head of the sentence structure — the main verb. If the rule has internal generation goals then these are explored re cursively (possibly via an agenda — we will ignore here the issue of the order in which internal generation goals are executed 12 ). Because of the minimality of the mapping rule, the syntactic structure that is produced by this initial stage is very basic — for example only obligatory complements are considered. Any mapping rule can introduce additional semantics and such additions are checked against the lower semantic bound. When ap plying a mapping rule the generator keeps track of how much of the initial semantic structure has been covered/consumed. Thus at the point when all internal generation goals of the first (skeletal) mapping rule have been exhausted the generator knows how much of the initial graph remains to be expressed. 5.2
Covering the remaining semantics
In the second stage the generator aims to find mapping rules in order to cover most of the remaining semantics (see Figure 5) . The choice of mapping rules is influenced by the following criteria: Connectivity: The semantics of the mapping rule has to match (cover) part of the covered semantics and part of the remaining semantics. Integration: It should be possible to incorporate the semantics of the mapping rule into the semantics of the current structure being built by the generator. Realisability: It should be possible to incorporate the partial syntactic 11 via the maximal join operation. Also note that the arcs to/from the conceptual relations do not reflect any directionality of the processing — they can be 'tra versed '/accessed from any of the nodes they connect. 12 Different ways of exploring the agenda will reflect different processing strategies.
APPROXIMATE CHART GENERATION
283
Fig. 5: Covering the remaining semantics with mapping rules structure of the mapping rule into the current syntactic structure be ing built by the generator. Note that the connectivity condition restricts the choice of mapping rules so that a rule that matches part of the remaining semantics and the extra semantics added by previous mapping rules cannot be chosen (e.g., the 'bad mapping' in Figure 5). While in the stage of fleshing out the skeleton sen tence structure (Section 5.1) the syntactic integration involves subsertion, in the stage of covering the remaining semantics it is sister-adjunction that is used. When incorporating semantic structures the semantic head has to be preserved — for example when sister-adjoining the d-tree for an adverbial construction the semantic head of the top syntactic node has to be the same as the semantic head of the node at which sister-adjunction is done. This explicit marking of the semantic head concepts differs from (Shieber et al. 1990) where the semantic head is a PROLOG term with exactly the same structure as the input semantics. 5.3
Completing a derivation
In the preceding stages of building the skeletal sentence structure and cov ering the remaining semantics, the generator is mainly concerned with con suming the initial semantic structure. In those processes, parts of the se mantics are mapped onto partial syntactic structures which are integrated and the result is still a partial syntactic structure. That is why a final step of 'closing off' the derivation is needed. The generator tries to convert the partial syntactic structure into a complete syntactic tree. A morphological post-processor reads the leaves of the final syntactic tree and inflects the words.
284 6
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE Example
In this section we illustrate how the algorithm works by means of a simple example.13 Suppose we start with an initial semantics as given in Figure 1. This semantics can be expressed in a number of ways: Fred limped quickly, Fred hurried with a limp, Fred's limping was quick, The quickness of Fred's limping . . . , etc. Here we show how the first paraphrase is generated.
In the stage of building the skeletal structure the mapping rule (i) in Figure 6 is used. Its internal generation goals are to realise the instantiation of (which is as a verb and similarly as a noun phrase. The generation of the subject noun phrase is not discussed here. The main verb is generated using the terminal mapping rule14 (iii) in Figure .15 The skeletal structure thus generated is Fred limp(ed). (see (i) in Figure 7). An interesting point is that although the internal generation goal for the verb referred only to the concept in the initial semantics, all of the information suggested by the terminal mapping rule (iii) in Figure is consumed. We will say more about how this is done in Section 7. At this stage the only concept that remains to be consumed is This is done in the stage of covering the remaining semantics when the 13
14
15
For expository purposes some VP nodes normally connected by d-edges have been merged. Terminal mapping rules are mapping rules which have no internal generation goals and in which all terminal nodes of the syntactic structure are labelled with terminal symbols (lexemes). In Lexicalised DTGS the main verbs would be already present in the initial trees.
APPROXIMATE CHART GENERATION
285
mapping rule (ii) is used. This rule has an internal generation goal to generate the instantiation of as an adverb, which yields quickly. The structure suggested by this rule has to be integrated in the skeletal structure. On the syntactic side this is done using sister-adjunction. The final mixed syntactic-semantic structure is shown on the right in Figure 7. In the syntactic part of this structure we have no domination links. Also all
Fig. 7: Skeletal structure and final structure of the input semantics has been consumed. The semantic annotations of the S and VP nodes are instructions about how the graphs/concepts of their daughters are to be combined. If we evaluate in a bottom up fashion the semantics of the S node, we will get the same result as the input semantics in Figure 1. After morphologicalpost-processing the result is Fred limped quickly. An alternative paraphrase like Fred hurried with a limp16 can be generated using a lexical mapping rule for the verb hurry which groups and together and a another mapping rule expressing as a PP. To get both paraphrases would be hard for generators relying on hierarchical representations. 7
Matching the applicability semantics of mapping rules
Matching of the applicability semantics of mapping rules against other se mantic structures occurs in the following cases: when looking for a skeletal structure; when exploring an internal generation goal; and when looking for mapping rules in the phase of covering the remaining semantics. During the exploration of internal generation goals the applicability semantics of 16
Our example is based on Iordanskaja et al.'s notion of maximal reductions of a semantic net (see lordanskaja 1991:300). It is also similar to the example in (Nogier &· Zock 1992).
286
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
a mapping rule is matched against the semantics of an internal generation goal. We assume that the following conditions hold: 1. The applicability semantics of the mapping rule can be maximally joined with the goal semantics. 2. Any information introduced by the mapping rule that is more special ised than the goal semantics (additional concepts/relations, further type instantiation, etc.) must be within the lower semantic bound (Lower Sem). If this additional information is within the input se mantics, then information can propagate from the input semantics to the mapping rule (the shaded area 2 in Figure 8). If the mapping rule's semantic additions are merely in LowerSem, then information cannot flow from LowerSem to the mapping rule (area 1 in Figure 8).
Fig. 8: Interactions involving the applicability semantics of a mapping rule Similar conditions hold when in the phase of covering the remaining se mantics the applicability semantics of a mapping rule is matched against the initial semantics. This way of matching allows the generator to convey only the information in the original semantics and what the language forces one to convey even though more information might be known about the particular situation. In the same spirit after the generator has consumed/expressed a concept in the input semantics the system checks that the lexical semantics of the generated word is more specific than the corresponding concept (if there is one) in the upper semantic bound. 8
Preference-based chart generation
During generation appropriate mapping rules have to be found. However, at each stage a number of rules might be applicable. Due to possible in-
APPROXIMATE CHART GENERATION
287
teractions between some rules the generator may have to explore different allowable sequences of choices before actually being able to produce a sen tence. Thus, generation is in essence a search problem. Our generator uses a non-deterministic generation strategy to explore the search space.17 The generator explores each one of the applicable mapping rules in turn through backtracking. In practice this means that whenever the generator reaches a dead end (a point in the process where none of the available alternatives are consistent with the choices made so far) it has to undo some previous commitments and return to an earlier choice point where there are still unexplored options. It often happens that computations in one branch of the search space have to re-done in another even if the first branch did not lead to a solution of the generation goal. Consider a situation where the semantics in Figure 9 is to be expressed. FULL SCALE
( ι
ALEXANDER
TOWN: #
Fig. 9: Alexander attacked the town. The attack was jullscale. The generator will first choose a skeletal mapping rule anchored by at tack — X attacked Y18 and then will go on to generate the subject and object NPs. The skeletal string will be Alexander attacked the town. Then, in the phase of covering the remaining semantics, the system will attempt to generate as an adverb and will fail. The generator will return to previous choice points and will revise some of the earlier de cisions. Another skeletal mapping rule (X launched an a t t a c k on Y) will be choosen which eventually will lead to a successful solution: Alexander launched a full scale attack on the town. Yet, because the computations after the first incorrectly chosen skeletal mapping rule were undone, the ef fort of generating the two NPs (subject and object) will have to be repeated. There is no way for the system to predict this situation — the reason for the failure above is a lexical gap. Thus, recomputation of structures is a recurrent problem for backtracking generators.19 17 18 19
This is in contrast to systemic and classification approaches which are deterministic. The syntactic structure of the mapping rule is a simple declarative transitive tree. It can be argued that the problem with reaching a dead end above is due to the fact that the two available mapping rules have been distinguished too early. Both
288
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
In order to address the problem of recomputing structures we have explored aspects of a new semantic-indexed memoing technique based on on-line caching of generated constituents. The general idea is very simple: every time a constituent is generated it is being stored and every time a generation goal is explored the system first checks if the result isn't stored already. Following the corresponding term in parsing this technique has come to be known as chart-generation. Information about partial structures in kept in a chart which is not indexed on string positions (because a certain constituent might appear in different positions in different paraphrases) but on the heads of the headed conceptual graphs which represent the built semantics for the subphrases. 20 We also introduce agenda-based control for chart generators which al lows for an easy way to define an array of alternative processing strategies simply as different ways of exploring the agenda. Having a system that allows for easy definition of different generation strategies provides for the eventual possibility of comparing different algorithms based on the uniform processing mechanism of the agenda-based control for chart generation. 21 One particular aspect which we are currently investigating is the use of syntactic and semantic preferences for rating intermediate results. Syn tactic/stylistic preferences are helpful in cases where the semantics of two paraphrases are the same. One such instance of use of syntactic preferences is avoiding (giving lower rating to) heavy constituents in split verb-particle constructions. With regard to semantic preferences we have defined a novel measure which compares two graphs (say applicability semantics of two mapping rules) with respect to a third (in our case this is the input se mantics). Given a conceptual graph the measure defines what does it mean for one graph to be a better approximate match than another. 22 Thus,
20
21
22
alternatives share a lot of structure and neither can be ruled out in favour of the other during the stage of generating their skeletal structures. Obviously if we used a 'parallel' generation technique that explores shared forests of structure, there would be less need for backtracking. This aspect has remained underexplored in generation work. The major assumption about memoing techniques like chart generation is that retriev ing the result is cheaper than computing it from scratch. For a very long time this was the accepted wisdom in parsing, yet new results show that storing all constituents might not always lead to the best performance (van Noord forthcoming). Chart generation has been investigated by Shieber (1988), Haruno et al. (1993), Pianesi (1993), Neumann (1994), Kay (1996), and Shemtov (forthcoming). For a good discussion of preference-driven processing of natural language (mainly pars ing) see Erbach (1995).
APPROXIMATE CHART GENERATION
289
the generator finds all possible solutions (i.e., it is complete) producing the 'best' first. 9
Implementation
We have developed a sentence generator called PROTECTOR (approxim ate PROduction of TExts from Conceptual graphs in a declaraTive framewORk). P R O T E C T O R is implemented in LIFE (Aït-Kaci & Podelski 1993). The syntactic coverage of the generator is influenced by the XTAG system (the first version of P R O T E C T O R in fact used TAGS 2 3 ). By using DTGS we can use most of the analysis of xTAG while the generation algorithm is simpler because complementation and modification on the semantic side correspond to subsertion and sister-adjunction on the syntactic side. Thus in the stage of building a skeletal structure only subsertion is used. In covering the re maining semantics only sister adjunction is used. We are in a position to express subparts of the input semantics as different syntactic categories as appropriate for the current generation goal (e.g., VPs and nominalisations). The syntactic coverage of P R O T E C T O R includes: intransitive, transitive, and ditransitive verbs, topicalisation, verb particles, passive, sentential comple ments, control constructions, relative clauses, nominalisations and a variety of idioms. On backtracking P R O T E C T O R returns all solutions. We are also looking at the advantages that our approach offers for multilingual genera tion. 10
Discussion
In the previous section we mentioned that generation is a search problem. In order to guide the search a number of heuristics can be used. In (Nogier & Zock 1992) the number of matching nodes has been used to rate different matches, which is similar to finding maximal reductions in (Iordanskaja 1991:300). Alternatively, a notion of semantic distance (cf. Foo 1992) might be employed. In P R O T E C T O R we will use a much more sophisticated notion of what it is for a conceptual graph to match better the initial semantics than another graph. This captures the intuition that the generator should try to express as much as possible from the input while adding as little as possible extra material. We use instructions showing how the semantics of a mother syntactic node is computed because we want to be able to correctly update the se mantics of nodes higher than the place where substitution or adjunction has 23
PROTECTOR-95 was implemented in PROLOG.
290
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
taken place — i.e., we want to be able to propagate the substitution or ad junction semantics up the mixed structure whose backbone is the syntactic tree. We also use a notion of headed conceptual graphs, i.e., graphs that have a certain node chosen as the semantic head. The initial semantics need not be marked for its semantic head. This allows the generator to choose an appropriate (for the natural language) perspective. The notion of semantic head and their connectivity is a way to introduce a hierarchical view on the semantic structure which is dependent on the language. When matching two conceptual graphs we require that their heads be the same. This reduces the search space and speeds up the generation process. Our generator is not coherent or complete (i.e., it can produce sentences with more general/specific semantics than the input semantics). We try to generate sentences whose semantics is as close as possible to the input in the sense that they introduce little extra material and leave uncovered a small part of the input semantics. We keep track of more structures as the generation proceeds and are in a position to make finer distinctions than was done in previous research. The generator never produces sentences with semantics which is more specific than the lower semantic bound which gives some degree of coherence. Our generation technique provides flexibility to address cases where the entire input cannot be expressed in a single sen tence by first generating a 'best match' sentence and allowing the remaining semantics to be generated in a follow-up sentence. Our approach can be seen as a generalisation of semantic head-driven generation (Shieber et al. 1990) — we deal with a non-hierarchical input and non-concatenative grammars. The use of Lexicalised DTG means that the algorithm in effect looks first for a syntactic head. This aspect is similar to syntax-driven generation (König 1994). Unlike semantic head-driven generation we generate modifiers after the corresponding syntactic head has been generated which allows for better treatment of colocations. We have specified a declarative definition of 'derivation' in our framework (including the semantic aspects of the approximate generation), yet due to space constraints we omit a full discussion of it here. The notion of derivation in generation is an important one. It allows one to abstract from the procedural details of a particular implementation and to consider the logical relationships between the structures that are manipulated. If alternative generation strategies are to be developed clearly stating what a derivation is, is an important prerequisite. If similar research had been done for other frameworks we could make comparisons with relevant generation
APPROXIMATE CHART GENERATION
291
work; regretably this is not the case.24 Potentially the information in the mapping rules can be used by a nat ural language understanding system too. However, parsing algorithms for the particular linguistic theory that we employ (DTG) have a complexity of O(n 4k+3 ) where n is the number of words in the input string and is the number of d-edges in elementary d-trees. This is a serious overhead and we have not tried to use the mapping rules in reverse for the task of understanding. 25 The algorithm has to be checked against more linguistic data and we intend to do more work on additional control mechanisms and also using alternative generation strategies using knowledge sources free from control information. 11
Conclusion
We have presented a technique for sentence generation from conceptual graphs. The use of a non-hierarchical representation for the semantics and approximate semantic matching increases the paraphrasing power of the generator and enables the production of sentences with radically different syntactic structure due to alternative ways of grouping concepts into words. This is particularly useful for multilingual generation and in practical gen erators which are given input from non linguistic applications. The use of a syntactic theory (D-Tree Grammars) allows for the production of linguist ically motivated syntactic structures which will pay off in terms of better coverage of the language and overall maintainability of the generator. The syntactic theory also affects the processing — we have augmented the syn tactic operations to account for the integration of the semantics. The gen eration architecture makes explicit the decisions that have to be taken and allows for experiments with different generation strategies using the same declarative knowledge sources.26 24
25
26
Yet there has been work on a unified approach of systemic, unification and classification approaches to generation. For more details see (Mellish 1991). The first author is involved in a large project (with David Weir & John Carroll at the University of Sussex) for "Analysis of Naturally-Occurring English Text with Stochastic Lexicalised Grammars" which uses the same grammar formalism (D-Tree Grammars). The goal of the project is to develop a wide-coverage parsing system for English. From the point of view of generation it is interesting to investigate the bidirecţionality of the grammar, i.e., whether the grammar used for parsing can be used for generation. More details about the above mentioned project can be found at http ://www.cogs.susx.ac.uk/lab/nlp/dtg/. More details about the P R O T E C T O R generation system are available on the world-wide web: h t t p : / / w w w . c o g s . s u s x . a c . u k / l a b / n l p / n i c o l a s / .
292
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE REFERENCES
Aït-Kaci, Hassan & Andreas Podelski. 1993. "Towards a Meaning of LIFE". Journal of Logic Programming 16:3&4.195-234. Antonacci, F. et al. 1992. "Analysis and Generation of Italian Sentences". Con ceptual Structures: Current Research and Practice ed. by Timothy Nagle et al., 437-460. London: Ellis Horwood. Boyer, Michel & Guy Lapalme. 1985. "Generating Paraphrases from MeaningText Semantic Networks". Computational Intelligence 1:1.103-117. Caspari, Rudolf & Ludwig Schmid. 1994. "Parsing and Generation in TrUG [in German]". Verbmobil Report 40. Siemens AG. Doran, Christine et al. 1994. "XTAG — A Wide Coverage Grammar for Eng lish". Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), 922-928. Kyoto, Japan. Erbach, Gregor. 1995. Bottom-Up Barley Deduction for Preference-Driven Nat ural Language Processing. Ph.D. dissertation, University of the Saarland. Saarbrücken, Germany. Foo, Norman et al. 1992. "Semantic Distance in Conceptual Graphs". Concep tual Structures: Current Research and Practice ed. by Timothy Nagle et al., 149-154. London: Ellis Horwood. Harbusch, Karin, G. Kikui & A. Kilger. 1994. "Default Handling in Incremental Generation". Proceedings of the 15th International Conference on Computa tional Linguistics (COLING-94)) 356-362. Kyoto, Japan. Iordanskaja, Lidija, Richard Kittredge & Alain Polguère. 1991. "Lexical Selec tion and Paraphrase in a Meaning-Text Generation Model". Natural Lan guage Generation in Artificial Intelligence and Computational Linguistics ed. by C.Paris, W.Swartout & W.Mann, 293-312. Dordrecht, The Nether lands: Kluwer. Joshi, Aravind. 1987. "The Relevance of Tree Adjoining Grammar to Gen eration". Natural Language Generation ed. by Gerard Kempen, 233-252. Dordrecht, The Netherlands: Kluwer. Kay, Martin. 1983. "Unification Grammar". Technical Report. Palo Alto, Calif.: Xerox Palo Alto Research Center. 1996. "Chart Generation", Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL'96), 200-204, Santa Cruz, Calif.: Association for Computational Linguistics. König, Esther. 1994. "Syntactic Head-Driven Generation". Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), 475-481. Kyoto, Japan.
APPROXIMATE CHART GENERATION
293
McCoy, Kathleen F., . Vijay-Shanker & G. Yang. 1992. "A Functional Ap proach to Generation with TAG". Proceedings of the 30th Meeting of the Association for Computational Linguistics ACL (ACL'92), 48-55. Delaware: Association for Computational Linguistics. McDonald, David & James Pustejovsky. 1985. "TAGs as a Grammatical Form alism for Generation". Proceedings of the 23rd Annual Meeting of the As sociation for Computational Linguistics (ACL'85), 94-103. Chicago, Illinois: Association for Computational Linguistics. Mellish, Chris. 1991. "Approaches to Realization in Natural Language Genera tion". Natural Language and Speech ed. by Ewan Klein & Frank Veltman, 95-116. Berlin: Springer-Verlag. Meteer, Marie. 1990. The "Generation Gap": The Problem of Expressibüüy in Text Planning. Ph.D. dissertation, Univ. of Massachusetts, Mass. (Also available as COINS TR 90-04.) Neumann, Günter. 1994. A Uniform Computational Model for Natural Lan guage Parsing an Generation. Ph.D. dissertation. University of Saarland, Saarbrücken, Germany. Nicolov, Nicolas, Chris Mellish & Graeme Ritchie. 1995. "Sentence Genera tion from Conceptual Graphs". Conceptual Structures: Applications, Imple mentation and Theory, (LNAI 954) ed. by -G. Ellis, R. Levinson, W. Rich & J. Sowa, 74-88. Berlin: Springer. Nogier, Jean-François. 1991. Génération Automatique Conceptuels. Paris: Hermès.
de Langage et Graphs
& Michael Zock. 1992. "Lexical Choice as Pattern Matching". Conceptual Structures: Current Research and Practice, ed. by Timothy Nagle et al., 413-436. London: Ellis Horwood. van Noord, Gertjan. Forthcoming. "An Efficient Implementation of the HeadCorner Parser". To appear in Computational Linguistics. Oh, Jonathan et al. 1992. "NLP: Natural Language Parsers and Generators". Proceedings of the 1st International Workshop on PEIRCE: A Conceptual Graph Workbench, 48-55. Las Cruces: New Mexico State University. Pianesi, Fabio. 1993. "Head-Driven Bottom-Up Generation and Government and Binding: A Unified Perspective". New Concepts in Natural Language Generation ed. by H. Horacek & M. Zock, 187-214. London: Pinter. Rambow, Owen, K. Vijay-Shanker & David Weir. 1995a. "D-Tree Grammars". Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL',95), 151-158. Boston, Mass.: Association for Computa tional Linguistics.
294
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
Rambow, Owen, K. Vijay-Shanker & David Weir. 1995b. "Parsing D-Tree Gram mars". Proceedings of the International Workshop on Parsing Technologies (IWPT'95), 252-259. Prague. Reiter, Ehud. 1991. "A New Model of Lexical Choice for Nouns". Computational Intelligence (Special Issue on Natural Language Generation) 7:4.240-251. Shapiro, Stuart. 1982. "Generalized Augmented Transition Network Grammars for Generation from Semantic Networks". Computational Linguistics 2:8.1225. 1989. "The CASSIE projects: An approach to NL Competence. Proceed ings of the 4th Portugese Conference on AI: EPIA-89 (LNAI 390), 362-380. Berlin: Springer. Shemtov, Hadar. Forthcoming. Logical Forms".
"Generation of Paraphrases from Ambiguous
Shieber, Stuart, Gertjan Noord, Robert Moore & Fernando Pereira. 1990. "A Semantic Head-Driven Generation Algorithm for Unification-Based Formal isms". Computational Linguistics 16:1.30-42. Simmons, R. & J. Slocum. 1972. "Generating English Discourse from Semantic Networks". Communications of the Association for Computing Machinery (CACM) 15:10.891-905. Smith, Mark, Roberto Garigliano & Richard Morgan. 1994. "Generation in the LOLITA System: An Engineering Approach". Proceedings of the 7th Int. Workshop on Natural Language Generation, 241-244. Kennebunkport, Maine, U.S.A. Sowa, John. 1984. Conceptual Structures: Information Processing in Mind and Machine. Reading, Mass.: Addison-Wesley. Sowa, John. 1992. "Conceptual Graphs Summary". Conceptual Structures: Current Research and Practice ed. by Timothy Nagle et al., 3-51. London: Ellis Horwood. Svenberg, Stefan. 1994. "Representing Conceptual and Linguistic Knowledge for Multilingual Generation in a Technical Domain". Proceedings of the 7th International Workshop on Natural Language Generation (IWNLG'94), 245248. Kennebunkport, Maine, U.S.A. Afke van Rijn. 1991. Natural Language Communication between Man and Ma chine. Ph.D. dissertation, Technical University Delft, The Netherlands. Wahlster, Wolfgang et al. 1991 "WIP: The Coordinated Generation of Mul timodal Presentations from a Common Representation". Technical Report RR 91-08. Saarbrücken, Germany: DFKI.
Example-Based Optimisation of Surface-Generation Tables CHRISTER SAMUELSSON
Universität
des
Saarlandes
Abstract A method is given that 'inverts' a logic grammar and displays it from the point of view of the logical form, rather than from that of the word string. LR-compiling techniques are used to allow a recursivedescent generation algorithm to perform 'functor merging' much in the same way as an LR parser performs prefix merging. This is an improvement on the semantic-head-driven generator that results in a much smaller search space. The amount of semantic lookahead can be varied, and appropriate tradeoff points between table size and resulting nondeterminism can be found automatically. This can be done by removing all spurious nondeterminism for input sufficiently close to the examples of a training corpus, and large portions of it for other input, while preserving completeness. 1
1
Introduction
W i t h the emergence of fast algorithms and optimisation techniques for syn tactic analysis, such as the use of explanation-based learning in conjunction with LR parsing, see (Samuelsson & Rayner 1991) and subsequent work, surface generation has become a major bottleneck in NLP systems. Surface generation will here be viewed as the inverse problem of syntactic analysis and subsequent semantic interpretation. The latter consists in constructing some semantic representation of an input word-string based on the syn tactic and semantic rules of a formal grammar. In this article, we will limit ourselves to logic grammars t h a t attribute word strings with expressions in some logical formalism represented as terms with a functor-argument struc ture. T h e surface generation problem then consists in assigning an o u t p u t 1
I wish to thank greatly Gregor Erbach, Jussi Karlgren, Manny Rayner, Hans Uszkoreit, Mats Wirén and the anonymous reviewers of ACL, EACL, IJCAI and RANLP for valuable feedback on previous versions of this article. Special credit is due to Kristina Striegnitz, who assisted with the implementation. Parts of this article have previously appeared as (Samuelsson 1995). The presented work was funded by the N3 "Bidirektionale Linguistische Deduktion (BiLD)" project in the Sonderforschungsbereich 314 Künstliche Intelligenz — Wissensbasierte Systeme.
296
CHRISTER SAMUELSSON
word-string to such a term. This is a common scenario in conjunction with for example transfer-based machine-translation systems employing revers ible grammars, and it is different from that when a deep generator or a text planner is available to guide the surface generator. In general, both these mappings are many-to-many: a word string that can be mapped to several distinct logical forms is said to be ambiguous. A logical form that can be assigned to several different word strings is said to have multiple paraphrases. We want to create a generation algorithm that generates a word string by recursively descending through a logical form, while delaying the choice of grammar rules to apply as long as possible. This means that we want to process different rules or rule combinations that introduce the same piece of semantics in parallel until they branch apart. This will reduce the amount of spurious search, since we will gain more information about the rest of the logical form before having to commit to a particular grammar rule. In practice, this means that we want to perform 'functor merging' much in the same ways as an LR parser performs prefix merging by employing parsing tables compiled from the grammar. One obvious way of doing this is to use LR-compilation techniques to compile generation tables. This will however require that we reformulate the grammar from the point of view of the logical form, rather than from that of the word string from which it is normally displayed. The rest of the paper is structured as follows: We will first review ba sic LR compilation of parsing tables in Section 2. The grammar-inversion procedure turns out to be most easily explained in terms of the semantichead-driven generation (SHDG) algorithm. We will therefore proceed to outline the SHDG algorithm in Section 3. The grammar inversion itself is described in Section 4, while LR compilation of generation tables is dis cussed in Section 5. The generation algorithm is presented in Section 6. The example-based optimisation technique turns out to be most easily ex plained as a straight-forward extension of a simpler optimisation technique predating it, why this simpler technique is given in Section 7. This exten sion is described in Section 8 and the relation between this example-based optimisation technique and explanation-based learning is discussed in Sec tion 9.
OPTIMISATION OF GENERATION TABLES 2
297
LR compilation for parsing
LR compilation in general is well-described in for example (Aho et al. 1986:215-247). Here we will only sketch out the main ideas. An LR parser is basically a pushdown automaton, i.e., it has a pushdown stack in addition to a finite set of internal states and a reader head for scanning the input string from left to right one symbol at a time. The stack is used in a characteristic way. The items on the stack consist of alternating grammar symbols and states. The current state is simply the state on top of the stack. The most distinguishing feature of an LR parser is however the form of the transition relation — the action and goto tables. A nondeterministic LR parser can in each step perform one of four basic actions. In state S with lookahead symbol2 Sym it can: 1. accept: Halt and signal success. 2. error: Fail and backtrack. 3. shift S2: Consume the input symbol Sym, push it onto the stack, and transit to state S2 by pushing it onto the stack. 4. reduce R: Pop off two items from the stack for each grammar symbol in the RHS of grammar rule R, inspect the stack for the old state S1 now on top of the stack, push the LHS of rule R onto the stack, and transit to state S2 determined by goto(S1,LHS,S2) by pushing S2 onto the stack. Consider the small sample grammar given in Figure 1. To make this simple grammar slightly more interesting, the recursive Rule 1, S → S QM, allows the addition of a question mark (QM) to the end of a sentence (S), as in John sleeps?. The LHS S is then interpreted as a yes-no question version of the RHS S. Each internal state consists of a set of dotted items. Each item in turn corresponds to a grammar rule. The current string position is indicated by a dot. For example, Rule 2, S → NP VP, yields the item S → NP . VP, which corresponds to just having found an NP and now searching for a VP. In the compilation phase, new states are induced from old ones. For the indicated string position, a possible grammar symbol is selected and the dot is advanced one step in all items where this particular grammar symbol immediately follows the dot, and the resulting new items will constitute the kernel of the new state. Non-kernel items are added to these by selecting 2
The lookahead symbol is the next symbol in the input string, i.e., the symbol under the reader head.
298
CHRISTER SAMUELSSON
s s
VP VP VP VP PP
→ → -→→
→ → → →
S QM NP VP VP PP VP AdvP
Vi
Vt NP Ρ ΝΡ
1 2 S
4 5 6 7
NP NP NP
vi vt Ρ AdvP QM
→ → → → → → → →
John Mary Paris sleeps sees in today ?
Fig. 1: Sample grammar grammar rules whose LHS match grammar symbols at the new string posi tion in the new items. In each non-kernel item, the dot is at the beginning of the rule. If a set of items is constructed that already exists, then this search branch is abandoned and the recursion terminates.
State 1 ff. •S S • 5" QM S NP VP
State 2
ff S•. S S • QM
State 3 S NP • VP VP •VP PP VP •. VP AdvP VP •.Vi VP • Vt N
State 4 S NP VP . VP VP • PP VP VP • AdvP PP • Ρ ΝΡ
State 8 PP Ρ NΡ .
State 5 VP • VP ΡΡ .
State 10 VP Vt • NP
State 6 VP = VP AdvP .
State 11 VP Vt NP•
State 7 PP Ρ . NP
State 12 S S QM •
State 9 VP Vi •.
Fig. 2: LR-parsing states for the sample grammar The state-construction phase starts off by creating an initial set consist ing of a single dummy kernel item and its non-kernel closure. This is State 1 in Figure 2. The dummy item introduces a dummy top grammar symbol as its LHS, while the RHS consists of the old top symbol, and the dot is at the beginning of the rule. In the example, this is the item S' • S. The rest
299
OPTIMISATION OF GENERATION TABLES
of the states are induced from the initial state. The states resulting from the sample grammar of Figure 1 are shown in Figure 2, and these in turn will yield the parsing tables of Figure 3. The entry "s3" in the action table, for example, should be interpreted as "shift the lookahead symbol onto the stack and transit to State 3". The entry "r7" should be interpreted as "re duce by Rule 7". The accept action is denoted "acc" . The goto entries, like "g4", simply indicate what state to transit to once a nonterminal of that type has been constructed. NP 1 s3 2 3 4 5 6 7 s8 8 9 10 s11 11 12
VP
PP
AdvP
g4 g5
Vi
Vt
s9
s lO
Ρ
S g2
QM
eos
sl2
C
s6 r3 r4
s7 r3 r4
r2 r3 r4
r2 4
r7 r5
r7 r5
r7 r5
7 5
r6
r1
r1
Fig. 3: LR-parsing tables for the sample grammar In conjunction with grammar formalisms employing complex feature structures, this procedure is associated with a number of interesting prob lems, many of which are discussed in (Nakazawa 1991) and (Samuelsson 1994c). For example, the termination criterion must be modified: if a new set of items is constructed that is more specific than an existing one, then this search branch is abandoned and the recursion terminates. If, on the other hand, it is more general, then it replaces the old one. 3
The semantic head-driven generation algorithm
Generators found in large-scale systems such as the DFKI DISCO system (Uszkoreit et al. 1994), or the SRI Core Language Engine (Alshawi (ed.) 1992:268-275), tend typically to be based on the semantic-head-driven gen eration (SHDG) algorithm. The SHDG algorithm is well-described in (Shieber et al. 1990); here we will only outline the main features.
300
CHRISTER SAMUELSSON
The grammar rules of Figure 1 have been attributed with logical forms as shown in Figure 4. The notation has been changed so that each constitu ent consists of a quadruple, where W0 and W1 form a difference list representing the word string that Cat spans, and Sem is the logical form. For example, the logical form corresponding to the LHS S of the (S,mod(X,Y), W0 W) → (S,X,W0, W1) rule, consists of a modifier Y added to the logical form X of the RHS S. As we can see from the last grammar rule, this modifier is in turn realised as ynq. 1 2 3 4 5 6
7
For the SHDG algorithm, the grammar is divided into chain rules and nonchain rules. Chain rules have a distinguished RHS constituent, the semantic head, that has the same logical form as the LHS constituent, modulo λabstractions; non-chain rules lack such a constituent. In particular, lexicon entries are non-chain rules, since they do not have any RHS constituents at all. This distinction is made since the generation algorithm treats the two rule types quite differently. In the example grammar, rules 2 and 5 through 7 are chain rules, while the remaining ones are non-chain rules. A simple semantic-head-driven generator might work as follows: Given a grammar symbol and a piece of logical form, the generator looks for a non-chain rule with the given semantics. The constituents of the RHS of that rule are then generated recursively, after which the LHS is connected
OPTIMISATION OF GENERATION TABLES
301
to the given grammar symbol using chain rules. At each application of a chain rule, the rest of the RHS constituents, i.e., the non-head constituents, are generated recursively. The particular combination of connecting chain rules used is often referred to as a chain. The generator starts off with the top symbol of the grammar and the logical form corresponding to the string that is to be generated. The inherent problem with the SHDG algorithm is that each rule com bination is tried in turn, while the possibilities of prefiltering are rather limited, leading to a large amount of spurious search. The generation al gorithm presented in the current article does not suffer from this problem; what the new algorithm in effect does is to process all chains from a partic ular set of grammar symbols down to some particular piece of logical form in parallel before any rule is applied, rather than to construct and try each one separately in turn. 4
Grammar inversion
Before we can invert the grammar, we must put it in normal form. We will use a variant of chain and non-chain rules, namely functor-introducing rules corresponding to non-chain rules, and argument-filling rules corresponding to chain rules. The inversion step is based on the assumption that there are no other types of rules. Since the generator will work by recursive descent through the logical form, we wish to rearrange the grammar so that arguments are generated together with their functors. To this end we introduce another difference list A0 and A to pass down the arguments introduced by argument-filling rules to the corresponding functor-introducing rules. Here the latter rules are assumed to be lexical, following the tradition in GPSG where the presence of the SUBCAT feature implies a preterminal grammar symbol, see e.g., (Gazdar et al. 1985:33), but this is really immaterial for the algorithm. The grammar of Figure 4 is shown in normal form in Figure 5. The grammar is compiled into this form by inspecting the flow of arguments through the logical forms of the constituents of each rule. In the functorintroducing rules, the RHS is rearranged to mirror the argument order of the LHS logical form. The argument-filling rules have only one RHS constituent — the semantic head — and the rest of the original RHS constituents are added to the argument list of the head constituent. Note, for example, how the NP is added to the argument list of the VP in Rule 2, or to the argument list of the Ρ in Rule 7. This is done automatically, although currently, the exact flow of arguments is specified manually.
302
CHRISTER SAMUELSSON
Functor-introducing rules ‹S,mod(X,Y),W0,W,ϵ,ϵ→ ‹S,X,W0,W1e,e ‹QM,Y,W u W,e,e ̂ m od(Y,Z),W 0 ,W,A Q ,A → VP,X VP, X̂Y,Wo, W1A0, A AdvP, Z, W1, W, e, e VP,X ̂ m od(Y,Z),W 0 ,W,A 0 ,A→ VP, X̂Y,W0, W1, A0, A PP, Ζ, W1, W, e, e NP,joha,[John\W],W,A,e → A NP,mary,[Mory|W],W,A,ϵ→ A. N Ρ , p a r i s , [Paris|W],W, A, ϵ → A Vi,X^rsleep(X),[sleeps\W],W,A,e → A V t ,X^Y^see(X,Y),[see\W],W,A,e → A P,X^in(X),[in|W],W;A,ϵ → A AdvP, today, [today\W),W, A,ϵ → A QM,ynq,[?|W],W,A,ϵ → A
1 3 4
Argument-filling rules S,Y,W0, W, ϵ,ϵ → VP,XY,W1,W,[NP,X,W0,W1],ϵ VP,X,W0,W;A0,A → Vi,X,W0,W,A0,A) 5 VP,Y,W0,W;A0,A)→ Vt,X^Y,W0,[NP,X,W1,W|A0],A
P P , Y , W 0 , W , A 0 , A >→ Ρ,Χ^YW0.W1[NP,XW1,W Fig. 5: Sample grammar in normal form
|A0],A
6
7
We assume that there are no purely argument-filling cycles. For rules that actually fill in arguments, this is obviously impossible, since the number of arguments decreases strictly. For the slightly degenerate case of argumentfilling rules which only pass along the logical form, such as the (VP,X) → Vi, X rule, this is equivalent to the off-line parsability requirement, (Kaplan & Bresnan 1982:264-266).3 We require this in order to avoid an infinite number of chains, since each possible chain will be expanded out in the inversion step. Since subcategorisation lists of verbs are bounded in length, PATR π style VP rules do not pose a serious problem, which on the other hand the 'adjunct-as-argument' approach taken in (Bouma & van Noord 1994) may do. However, this problem is common to a number of other generation algorithms, including the SHDG algorithm. Let us return to the scenario for the SHDG algorithm given at the end of Section 3: we have a piece of logical form and a grammar symbol, and 3
If the RHS Vi were a VP, we would have a purely argument-filling cycle of length 1.
OPTIMISATION OF GENERATION TABLES
303
we wish to connect a non-chain rule with this particular logical form to the given grammar symbol through a chain. We will generalise this scenario just slightly to the case where a set of grammar symbols is given, rather than a single one. Each inverted rule will correspond to a particular chain of argumentfilling (chain) rules connecting a functor-introducing (non-chain) rule in troducing this logical form to a grammar symbol in the given set. The arguments introduced by this chain will be collected and passed down to the functors that consume them in order to ensure that each of the inver ted rules has a RHS matching the structure of the LHS logical form. The normalised sample grammar of Figure 5 will result in the inverted grammar of Figure 6. Note how the right-hand sides reflect the argument structure of the left-hand-side logical forms. As mentioned previously, the collected arguments are currently assumed to correspond to functors introduced by lexical entries, but the procedure can readily be modified to accommodate grammar rules with a non-empty RHS, where some of the arguments are consumed by the LHS logical form. The grammar inversion step is combined with the LR-compilation step. This is convenient for several reasons: Firstly, the termination criteria and the database maintenance issues are the same in both steps. Secondly, since the LR-compilation step employs a top-down rule-invocation scheme, this will ensure that the arguments are passed down to the corresponding functors. In fact, invoking inverted grammar rules merely requires first invoking a chain of argument-filling rules and then terminating it with a functor-introducing rule. 5
LR compilation for generation
Just as when compiling LR-parsing tables, the compiler operates on sets of dotted items. Each item consists of a partially processed inverted grammar rule, with a dot marking the current position. Here the current position is an argument position of the LHS logical form, rather than some position in the input string. New states are induced from old ones: For the indicated argument po sition, a possible logical form is selected and the dot is advanced one step in all items where this particular logical form can occur in the current ar gument position, and the resulting new items constitute a new state. All possible grammar symbols that can occur in the old argument position and that can have this logical form are then collected. From these, all rules with
304
CHRISTER SAMUELSSON
(S, mod(X, Y), W 0 ,W, e, e) → (S,X,W0iWue,e) (QM,Y,W1,W,e,e) S,mod(Y,Z),W 0 ,W,e,e) → {VP,X^Y,W1,W2,[(NP,X,WQ,W1)},e) (AdvP,Z,W2,W,e,e) S,mod(Y,Z),W 0,W,e,e) → (VP, X^Y, W1, W2, [{NP, X, W0, W ) ] , e) (PP, Z, W2, W, e, ϵ VP, X^mod(Y, Z), W1, W, [(NP, X, W0, W1)),ϵ ) → VP, X^Y, W1, W2, [{NP, X, W0, W1], ϵ ΛdvΡ, Z, W2, W,ϵ ,ϵ) (VP, X^mod(Y, Z), W1, W, [NΡ, X, W0, W1],ϵ → (VP, X-Y, W1, W2, [NP, X, W0, W1,],ϵ P P , Z, W2,W,ϵ,ϵ (S,sleep(X),W 0 ,W,ϵ ,ϵ → NP,X, W0, [sleeps|W],ϵ,ϵ (VP, X^sleep(X), [sleeps\W},W, [NP, X, W0, [sleeps| W])],ϵ→ (NP, X, W0,[sleepsl|W],ϵ,ϵ (S,see(X,Y),Wo,W,ϵ ,ϵ → (MP,X, W1; W,ϵ ,ϵ NP,Y, W0,[sees|W1],ϵ,ϵ VP,Y^see(X,Y),[sees|W0],W,[(NP,Y,W1,[sees|W0]], ϵ → NP,X,Wo,W,ϵ ,ϵ (NP,Y, Wl, [sees|W0]ϵ ,ϵ PP,X^in(X),[in|W 0 ],W,ϵ ,ϵ → NP, X, W0,W,ϵ,ϵ NP, John, [Johin|W],W,ϵ ,ϵ → ϵ NP,mary, [Mary|W],W,ϵ ,ϵ → ϵ NP, p a r i s , [Paris|W],W,ϵ ,ϵ → ϵ AdvP, today, [today|W],W,ϵ ,ϵ → ϵ QM,ynq,[?|W],W,ϵ ,ϵ → ϵ Fig. 6: Inverted sample grammar a matching LHS are invoked from the inverted grammar. Each such rule will give rise to a new item where the dot marks the first argument position, and the set of these new items will constitute another new state. If a new set of items is constructed that is more specific than an existing one, then this search branch is abandoned and the recursion terminates. If it on the other hand is more general, then it replaces the old one. The state-construction phase starts off by creating an initial set con sisting of a single dummy item with a dummy top grammar symbol and a dummy top logical form, corresponding to a dummy inverted grammar rule. In the sample grammar, this would be the rule (S',f (X), W0, W,ϵ ,ϵ → S, X, W 0 ,Wϵ ,ϵ The dot is at the beginning of the rule, selecting the first and only argument. The rest of the states are induced from this one.
OPTIMISATION OF GENERATION TABLES
305
The first three states resulting from the inverted grammar of Figure 6 are shown in Figure 7, where the difference lists representing the word strings are omitted. State 1 S',f(X),e,e)
=> .<S,X,ϵ,ϵ>
State 2 S,mod(X,Y),ϵ,ϵ => . 5,X,ϵ,c) (QM,Y,ϵ,ϵ S,mod(Y,Z),ϵ,ϵ => · VP,XY,[NP,X],e (AdvP,Z,ϵ,ϵ S,mod(Y,Z),ϵ,ϵ => . VP,XY,[NP,X],e(PP,Z,ϵ,ϵ State 3 S,mod(X,Y),ϵ,ϵ => . S,X,e,eQM,Y,c,e S,mod(Y,Z),ϵ,ϵ => . VP,XY,[(NP,X)],ϵ) (AdvP,Z,ϵ,ϵ S,mod(Y,Z),ϵ,ϵ =» .VP,X^Y,[(7VP,X)],6)PP,Z,ϵ,ϵ VP,X^mod(Y,Z,[(NP,X)],ϵ) => . VP,X^Y, [NP,X)],ϵ. VP,X^Y, [NP,X)],ϵ P P , Z, ϵ,ϵ Fig. 7: The first three generation states The sets of items are used to compile the generation tables in the same way as is done for LR parsing. The goto entries correspond to transiting from one argument of a term to the next, and thus advancing the dot one step. The reductions correspond to applying the rules of items that have the dot at the end of the RHS, as is the case when LR parsing. There is no obvious analogy to the shift action — the closest thing would be the descend actions transiting from a functor to one of its arguments. Note that there is no need to include the logical form of each lexicon entry in the generation tables. Instead, a typing of the logical forms can be introduced, and a representative of each type used in the actual tables, rather than the individual logical forms. This decreases the size of the tables drastically. For example, there is no point in distinguishing the states reached by traversing John, mary and p a r i s , apart from ensuring that the correct word is added to the output word-string. This is accomplished much in the same way as preterminals, rather than individual words, figure in LRparsing tables.
306 6
CHRISTER SAMUELSSON The generation algorithm
The generator works by recursive descent through the logical form while transiting between the internal states. It is driven by the descend, goto and reduce tables. A pushdown stack is used to store intermediate constituents. When generating a word string, the current state and logical form determ ine a transition to a new state, corresponding to the first argument of the logical form, through the descend table. A substring is generated recurs ively from the argument logical form, and this constituent is pushed onto the stack. The argument logical form, together with the new current state, determine a transition to the next state through the goto table. The next state corresponds to the next argument of the original logical form, and an other substring is generated from this argument logical form, etc. When no more arguments remain, an inverted grammar rule is selected nondeterministically by the reduce table and applied to the top portion of the stack, constructing a word string corresponding to the original logical form and completing this generation cycle.4 We now turn to optimising the generation tables. 7
Optimising the generation tables
The basic idea underlying the optimisation technique presented in this art icle is to remove as much nondeterminism from the generation tables as possible. One problem is that it may be impossible to remove all nondeterminism for the simple reason that the current piece of logical form may in fact allow multiple paraphrases. In this case, we say that we have 'real' nondeterminism. On the other hand, it may be the case that although locally, several alternatives are possible, subsequent generation may rule out all but one of them. We will call this 'spurious' nondeterminism. Due to the grammar inversion, and the way the sets of items are con structed, all LHS logical forms of the items in some particular state will be the same, and will thus have equal arity. Thus, there will be nothing analogous to shift-reduce conflicts in the resulting generation tables, only reduce-reduce conflicts. This means that the latter is the sole source of nondeterminism, and that this will arise only in states with more than one possible reduction. By inspecting the number of items left in each 'reduct4
This is a bottom-up rule invocation scheme. It could easily be modified so that a rule is instead applied before constructing the substrings recursively, resulting in a top-down rule-invocation scheme.
OPTIMISATION OF GENERATION TABLES
307
descend(1, mod((mod(_,_),ynq), 2A). descend(1, mod(see(_,_),ynq), 2B). descend(1, mod (sleep (_), ynq), 2C) . State 2A (S,mod(mod(X,Y),ynq),ϵ)
⇒
. (S,mod(X, Y), e,ϵ) (QM,ynq,ϵ, ϵ)
State 2B (S',mod(see(X,Y),ynq),e,e)
⇒
. (5,see(X,Y),ϵ, ϵ) (QM,ynq,ϵ, ϵ)
State 2C (S',mod(sleep(X),ynq),e,6)
⇒
· (S, sleep(X),ϵ, ϵ) (QM,ynq, ϵ, ϵ)
Fig. 8: Alternative generation states ive state', i.e., each state where the dot is at the end of the rules, we can determine whether or not the generation tables will be deterministic. The logical form can be inspected down to an arbitrary depth of recur sion when compiling the sets of items, and this parameter can be varied. This is closely related to the use of lookahead symbols in an LR parser; increasing the depth is analogous to increasing the number of lookahead symbols. The amount of semantic lookahead will be reflected in the goto and descend table entries. No semantic lookahead would mean only taking the functor of the logical form into consideration, and in the example above, a typical action table entry would be descend(1,mod ( _ , _ ) , 2) . 5 This would mean that the generator would operate on State 2 of Figure 7 when gener ating from the first argument of the mod/2 term, and both the S alternative and the (merged) VP alternative(s) would be attempted nondeterministically. By taking the arguments of the logical form into account, the degree of nondeterminism can be reduced, and for the grammar given in Figure 1, it is eliminated completely. In the example, if the second argument of the mod/2 term is ynq, then only the S alternative will be considered when generating from the first argument, since the relevant descend entries and states will be those of Figure 8. The optimal depth may vary for each individual table entry, and even within it, and a scheme has been devised to automatically find such an optimum. 5
Here "_" denotes a don't-care variable.
308
CHRISTER SAMUELSSON
Assuming that it is actually possible to construct fully deterministic, gener ation tables by filtering on a large enough amount of semantic lookahead, the problem reduces to for each table entry finding a lookahead depth that will result in only one single remaining item in each reductive state. This is in fact a stronger requirement than that all nondeterminism be spurious: It may be the case that for each possible logical form, it is possible to determ ine the appropriate reduction by a sufficient amount of semantic lookahead, but due to potentially infinite recursion, no preassigned limit on it will do. This is elaborated in the following section. The scheme employs iterative deepening. It tries to construct fully de terministic tables by first allowing a total amount of semantic lookahead of one, then of two, etc., up to some maximum limit. This is however not done globally, but at each recursive call to the sets-of-items construction step, when a piece of logical form and a set of grammar symbols are used to invoke new inverted grammar rules to construct new sets of items. At this point, the total amount of available lookahead is distributed through the arguments of the functor of the current piece of logical form, and then further down to the arguments of the arguments, etc., until all has been used up. The current sets of items are then tentatively constructed. Increased semantic-lookahead depth will split potential nondeterminism in the resulting reductive states into distinct sets of items, and thus into dis tinct reductive states with less nondeterminism, or preferably, with no non determinism at all. If the resulting reductive states are all deterministic, then this particular semantic-lookahead setting is used to compile the actual generation tables, and the scheme recurses. In more detail, a set of terms mirroring the various ways of assigning semantic lookahead are generated and ordered according to how much lookahead they use up. The first one to yield fully deterministic reductive states is used when constructing the actual tables and is passed down in the recursion.
descend(l, mod(_,ynq), 2 ) . State 2: S',mod(X, ynq),ϵ, ϵ ⇒ · (S,X, ϵ, ϵ) QM,ynq, ϵ, ϵ Fig. 9: Alternative generation states
In the running example, the first argument of mod/2 contributes no import ant information when descending from State 1, while the second one does.
OPTIMISATION OF GENERATION TABLES
309
The scheme correctly finds the optimal depths when transiting from State 1, resulting in the State 2 and descend entry of Figure 9. Since the scheme employs iterative deepening, this will guarantee that locally, no alternative table entries can inspect a smaller portion of the lo gical forms and still be deterministic, given the previous choices of semantic lookahead. This is a greedy algorithm, and it could potentially be the case that another choice of semantic lookahead would lead to less required lookahead in total by reducing that of the table entries generated in later recursion steps.
8
An example-based optimisation technique
The optimisation scheme as described so far is limited to grammars without real nondeterminism that only have removable spurious nondeterminism. A simple way of extending this to more general grammars is to introduce a second outer level of iterative deepening controlling the amount of non determinism tolerated in each recursive call to the sets-of-items construc tion step. First, we try to construct generation tables with only one reduc tion in each reductive state. If this proves impossible within the maximum amount of total semantic lookahead allowed, we try to construct tables with at most two reductions in each resulting reductive state, etc. Since there is a finite number of inverted grammar rules, and thus a finite num ber of possible items, this process will terminate. Again, this optimisa tion is done locally at each recursive call to the sets-of-items construction step. A problem with this approach is that the number of possible ways of as signing semantic lookahead increases drastically with the amount of looka head allowed, and some heuristics are needed to direct the search. We will shortly describe a method that constructs more fine-tuned generation tables by using training examples to guide the search; to determine how much real nondeterminism there is at each point that cannot be removed; and to find appropriate lookahead depths that will remove all spurious non determinism on the training corpus. First, we will however examine spurious nondeterminism a bit closer. Assume that we add the following grammar rules for handling NPs with internal structure: 6 6
Again, the difference lists representing the word strings have been omitted.
310
CHRISTER SAMUELSSON NP,q(X,Y)) → Det,l) Y ,X) → (N,X ,mod(X,Y) → ,Y) ,X ,mod(X,Y)) → AΡ,Υ) (N,Χ AP,mod(X,Y) → N,Y) (AP,X
AP,x) → A,X
This will allow derivations like that of Figure 10. Here APoNB reads "Adject ive phrase or N-bar". This in turn will allow constructing logical forms like
mod(NO,mod(mod(. . .mod(AoN,Nn),. . . ,N2) ,N1)). To determine which of the rules (,mod(X,Y)) → (, Y) (, X) and (iV,mod(X, Y)) → (AP,Y)(N,X) to apply, we must inspect the first argument AoN — verb or noun — of the innermost mod/2 term, which may be arbitrarily deeply nested. Although this will never introduce multiple paraphrases, it does allow spurious nondeterminism that cannot be handled by a bounded amount of semantic lookahead. A highly respectable objection to the presented example is that, apart from the proposed treatment of noun-noun and noun-adjective compounds being linguistically somewhat dubious, we will in practice never see cases where we need a very large amount of semantic lookahead. Precisely this is one of the two corner stones on which the example-based optimisation technique presented in this section rests. The other one is the observation that a lower bound of the amount of real nondeterminism can easily be established for each (portion of a) training example, while it is in the general case difficult to do this directly from the grammar.
OPTIMISATION OF GENERATION TABLES
311
Thus, the training examples are used for three purposes: Firstly, to limit the search to search branches that are relevant for input data that actually occur in real life. Secondly, to establish the minimum amount of nondeterminism at each point, i.e., the amount of real nondeterminism at this point that cannot be removed by greater lookahead depth. Thirdly, to find appropriate lookahead depths that will remove all spurious nondeterminism at each point in the training example. The generation tables are constructed much in the same way as in the previous section. The main difference is that instead of aiming at full de terminism, the target nondeterminism is the real nondeterminism at each point of each training example. In more detail, a set of terms mirroring the various ways of assigning semantic lookahead are generated from the set of training examples, and they are ordered according to how much lookahead they employ. Intuitively, a (sub)term is constructed from each training ex ample by replacing parts of it with free variables, thus removing the inform ation contained in these parts of the training example, and the subterms are merged to form one term. Thus, terms employing more lookahead will contain more detailed information from the set of training examples. The first term to yield as deterministic reductive states as the one correspond ing to the set of whole training examples, where no information has been blocked out by variables, is used for constructing the actual tables and is passed down in the recursion. A technical complication is that the training examples interact with the termination criteria of the sets-of-items construction step: Although a new set of items may be more specific than an old one, it may stem from more demanding training examples. In the current version of the scheme, this would result in recompiling the sets of items from the earliest point where too simple examples were used, this time including the more demanding examples. To handle input outside the training corpus, a default lookahead depth is assigned to the possible continuations that are not encountered among the training examples. This means that the resulting generation tables preserve completeness and are guaranteed to be optimal, modulo the limitations of greedy algorithms, for input sufficiently similar to combina tions of examples in the training corpus, but not necessarily for other input. The degree of generalisation is considerable: To return to the running example of the nondeterminism in State 2 discussed above, a single training example like (a logical form corresponding to) John sleeps? or Mary sees a house in Paris will remove all nondeterminism in this state. In general, the table size seems to increase moderately with the number of training
312
CHRISTER SAMUELSSON
examples due to the good degree of generalisation, although this needs to be more thoroughly investigated. The modified algorithm for including the training examples into the LR-compilation algorithm is guaranteed to terminate if the original LRcompilation algorithm terminates. The worst-case complexity is however not very good. However, for the grammars and training sets tested this far, processing efficiency is not a problem, though we can envision that for considerably larger grammars and training sets, there will be a need for optimising the optimisation procedure further. 9
Discussion
The new generation algorithm constitutes an improvement on the semantichead-driven generation algorithm that allows 'functor merging', i.e., enables processing various grammar rules, or rule combinations, that introduce the same semantic structure simultaneously, thereby greatly reducing the search space. The algorithm proceeds by recursive descent through the logical form, and using the terminology of the SHDG algorithm, what the new al gorithm in effect does is to process all chains from a particular set of gram mar symbols down to some particular piece of logical form in parallel until a reduction is attempted, rather than to construct and try each one separately in turn. This requires a grammar-inversion technique that is fundamentally different from techniques such as the essential-argument algorithm, see the following, since it must display the grammar from the point of view of the logical form, rather than from that of the word string. LR-compilation tech niques accomplish the functor merging by compiling the inverted grammar into a set of generation tables. The grammar inversion rearranges the grammar as a whole according to the functor-argument structure of the logical forms. Other inversion schemes, such as the essential-argument algorithm (Strzalkowski 1990) or the direct-inversion approach (Minnen et al. 1995), are mainly concerned with locally rearranging the order of the RHS constituents of individual grammar rules by examining the flow of information through these con stituents, to ensure termination and increase efficiency. Although this can occasionally change the set of RHS symbols in a rule, it is done to these ends, rather than to reflect the functor-argument structure. Although the sample grammar used throughout the article is essentially context-free, there is nothing in principle that restricts the method to such grammars. In fact, the method could be extended to grammars employing
OPTIMISATION OF GENERATION TABLES
313
complex feature structures as easily as the LR-parsing scheme itself, see for example (Nakazawa 1991), and this is currently being done. Some hand editing is necessary when preparing the grammar for the inversion step, but it is limited to specifying the flow of arguments in the grammar rules. Furthermore, this could potentially be fully automated. The set of applicable reductions can be diminished by resorting to deeper semantic lookahead, at the price of a larger number of internal states, and there is in general a tradeoff between the size of the resulting generation tables and the amount of nondeterminism when reducing. The employed amount of semantic lookahead can be varied, and a scheme has been de vised and tested that automatically determines appropriate tradeoff points, optionally based on a collection of training examples. The latter version of the scheme turns out to be related to explanationbased learning (EBL) which has proved quite successful for optimising LRparsing tables for syntactic analysis. There, the basic idea is to learn spe cial grammar rules from the original ones and a set of training examples by chunking together the former based on how they are used to parse the lat ter. The relevant references are (Samuelsson & Rayner 1991), (Samuelsson 1994a) and (Neumann 1994). Rayner and Samuelsson basically trade coverage for speed and accuracy by using the training examples to compile a new grammar that is used in stead of the original one. Their problem is that the underlying NL systems that they work on employ find-all parsing strategies and subsequent selec tion of the preferred analysis. This makes it very difficult to integrate the learned grammar with the original one without losing all processing speed gained. Neumann strives for a very close integration between the learned and original grammars by falling back to the original grammar when pro cessing with the learned grammar alone has proved insufficient. He utilises the fact that his original system employs a best-first parsing strategy, which allows intelligent reuse of partial results from the attempt to parse with the learned grammar. Another problem that has not previously been satisfactorily resolved is how to determine the degree of generalisation of the examples, or viewed from another point of view, how to chunk together the original grammar rules. Rayner and Neumann hand-code special meta-rules, so-called op erationality criteria, for this based on linguistic intuition. These criteria are then refined manually by experimentation. Samuelsson offers an auto matic method for doing this that relates the desired coverage to the way the examples are generalised (Samuelsson 1994b). This quantity is however
314
CHRISTER SAMUELSSON
only indirectly related to the actual performance of the system using the resulting learned grammar. In contrast to this, the method described in the current article auto matically preserves completeness; achieves fully seamless integration, since there is only one processing mode; and automatically determines the degree of generalisation by minimising a quantity that has a profound direct influ ence on the resulting performance, namely the amount of nondeterminism in each reductive state. It would be very interesting to see if this idea could be carried over to syntactic parsing by manipulating the number of lookahead symbols to minimise the number of shift-reduce and reduce-reduce conflicts in the resulting LR parsing tables. The method has been implemented and applied to more complex gram mars than the simple one used as an example in this article, and it works excellently. Although these grammars are still too naive to form the basis of a serious empirical evaluation lending substantial experimental support to the method as a whole, it should be obvious from the algorithm itself that the reduction in search space compared to the SHDG algorithm is most substantial. Nonetheless, such an evaluation is a top-priority item on the future-work agenda. REFERENCES Aho, Alfred V., Ravi Sethi & Jeffrey D. Ullman. 1986. Compilers, Principles, Techniques and Tools. Reading, Massachusetts: Addison-Wesley. Alshawi, Hiyan, ed. 1992. The Core Language Engine. Cambridge, Massachu setts: MIT Press. Bouma, Gosse and Gertjan van Noord. 1994. "Constraint-based Categorial Grammars". Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 147-154. Gazdar, Gerald, Ewan Klein, Geoffrey Pullum & Ivan Sag. 1985. Generalized Phrase Structure Grammar. Cambridge, Massachusetts: Harvard University Press. Kaplan, Ronald M. & Joan Bresnan. 1982. "Lexical-Functional Grammar: A Formal System for Grammar Representation". The Mental Representation of Grammatical Relations ed. by Joan Bresnan, 173-281. Cambridge, Mas sachusetts: MIT Press. Minnen, Guido, Dale Gerdemann & Erhard Hinrichs. 1996. "Direct Automated Inversion of Logic Grammars". To appear in New Generation Computing 14:2.
OPTIMISATION OF GENERATION TABLES
315
Nakazawa, Tsuneko. 1991. "An Extended LR Parsing Algorithm for Grammars Using Feature-based Syntactic Categories". Proceedings of the 5th Confer ence of the European Chapter of the Association for Computational Linguist ics, 69-74. Neumann, Günter. 1994. "Application of Explanation-Based Learning for Ef ficient Processing of Constraint-based Grammars". Proceedings of the 10th IEEE Conference on Artificial Intelligence for Applications, 208-215. San Antonio, Texas. Samuelsson, Christer. 1994a. Fast Natural-Language Parsing Using ExplanationBased Learning. Ph.D. dissertation, Royal Institute of Technology. Edsbruk, Sweden: Akademitryck. 1994b. "Grammar Specialisation through Entropy Thresholds". Proceed ings of the 32nd Annual Meeting of the Association for Computational Lin guistics, 188-195. 1994c. "Notes on LR Parser Design". Proceedings of the 15th Conference on Computational Linguistics, 386-390. ICCL.
International
1995. "An Efficient Algorithm for Surface Generation". Proceedings of the 14th International Joint Conference on Artificial Intelligence, 1414-1419. Morgan Kaufmann. & Manny Rayner. 1991. "Quantitative Evaluation of Explanation-Based Learning as an Optimisation Tool for a Large-Scale Natural Language Sys tem". Proceedings of the 12th International Joint Conference on Artificial Intelligence (IJCAI'91), 609-615. Morgan Kaufmann. Shieber, Stuart M., Gertjan van Noord, Fernando C. N. Pereira, & Robert C. Moore. 1990. "Semantic-Head-Driven Generation". Computational Lin guistics 16:1. 30-42. Strzalkowski, Tomek. 1990. "How to Invert a Natural Language Parser into an Efficient Generator: An Algorithm for Logic Grammars". Proceedings of the 13th International Conference on Computational Linguistics (COLING-90), 347-352, ICCL. Uszkoreit, Hans, Rolf Backofen, Stephan Busemann, Abdel Kader Diagne, Eliza beth A. Hinkelman, Walter Kasper, Bernd Kiefer, Hans-Ulrich Krieger, Klaus Netter, Günter Neumann, Stephan Oepen, & Stephen P. Spackman. 1994. "DISCO — an HPSG-based NLP System and its Application for Appoint ment Scheduling". Proceedings of the 15th International Conference on Com putational Linguistics (COLING-94), 436-440. Kyoto, Japan.
Sentence Generation by Pattern Matching: The Problem of Syntactic Choice MICHAEL ZOCK
LIMSI,
CNRS
Abstract This paper tries to account for verbal fluency, that is, the speed with which people compute syntactic structures. As we all know, people produce speech fluently without making too many mistakes. Given the known time constraints this is a remarkable performance. How is this possible? Verbal fluency, we believe, can be accounted for by the following two facts: people essentially use pattern matching and mapping rules as strategy and knowledge source. Rather than being confined to local strategies (strict incremental processing on a concept-to-concept basis) and formal grammars, they operate on larger chunks (global strategy) by using mapping rules. This is more economical, without being necessearily more error prone. Second, proficient speakers have learnt to recognise potential linguistic struc tures on the basis of the formal characteristics of the conceptual structures, that is, proficient speakers are able to make good guesses concerning the syntactic structures that best express the conceptual input. 1
1
Introduction: T h e speaker's problem
Text or discourse production basically consists in determining, organising and translating content in order to achieve specific communicative goals. We shall be concerned here only with the last component, the translation of a conceptual structure (message) into its corresponding linguistic form. Looking at this problem from a psycholinguistic point of view we will try to 1
This paper is a slightly revised version of a paper presented in 1988 at the 1st In ternational Workshop on "Cognitive Linguistics", held in Bulgaria. It was meant to appear three years later in a book entitled "Explorations in Cognitive Linguistics". Unfortunately, though announced, this book never saw the market. While our views have evolved in the meantime, — this was then only preliminary work, — we believe that our basic premises concerning the process and the speaker's knowledge still hold: (i) natural language generation is basically pattern-matching; (ii) the speaker's expert ise resides in knowing a set of patterns and a set of mapping rules for converting an input (message) into an output (linguistic form).
318
MICHAEL ZOCK
provide evidence for two claims, one dealing with knowledge, the other one dealing with the process. Claim 1: the speaker's expertise resides in knowing a set of patterns (conceptual, syntactic) and a set of rules (mapping rules) for converting a given structure (deep structure) into its corresponding form (surface structure). 2 Claim 2: the process of conversion is basically patterndriven (pattern matching). People typically work on chunks (e.g., noun groups, propositions) rather than on atomic units (single words, concepts). If one accepts the idea that messages are coded in terms of semantic networks or conceptual graphs (Sowa 1984), one will understand the text producer's problem. Having generated a message (what to say), s/he is left with a set of nodes and arcs (concepts and relationships) for which s/he must find words and adequate sentence patterns (how to say it). 3 Surprisingly enough, this seems to be no real problem for most speakers, even in completely new settings (spontaneous discourse). Skilled speakers seem to have plenty of time, and they do not make many mistakes. One may well ask how they succeed in performing such a complex task given the known space and time constraints: human short-term memory is limited (Miller 1956), and speech is very fast (3-5 words per second). The secret of skilled speakers, we believe, is that, rather than operating on small isolated units such as words or concepts (local strategy), they op erate on larger chunks, that is, conceptual configurations. Put another way, skilled speakers do not proceed strictly word-by-word (concept-by-concept), rather they operate on larger conceptual patterns. 4 Having gained a large 2
3
4
The fact that people use patterns is not in contradiction with the notion of a formal grammar. The latter is actually a device to generate them. While the arcs are conceptual or syntactic relations (agent, subject, etc.), the nodes of the graph can be words or concepts of various levels of abstraction (animal vs. dog vs. f o u r - l e g g e d c a r n i v o r o u s w i l d or d o m e s t i c a t e d animal). The representation and function of abstract concepts and words is thus very much alike: both are very economical means, — a kind of short hand notation, — for larger conceptual chunks. The fact that graphs allow for hybrid knowledge representation, and the fact that they can be manipulated easily (contraction/expansion), makes them excellent tools at the interface level (i.e., for a potential user in the case of applications) and for modelling the cognitive process: expansions of an abstract, underspecified message graph (conceptual level); contraction of this conceptual graph to a lexically specified, but syntactically unspecified graph (lexical level); visualisation of syntactic reflexes resulting from choices made at a higher level (pragmatic, conceptual, linguistic). The syntactic consequences may show up in changes of the names of the links {agent be coming subject, a beneficiary becoming an indirect object, etc.) and in the addition of morphosyntactically relevant information (type of auxiliary, type of preposition, etc.). For an example, see Zock (1994). The fact that word-by-word processing may lead into dead ends has been shown by
SENTENCE
GENERATION
BY PATTERN
MATCHING
319
amount of experience in a given language, they recognise typical structures (patterns), i.e., they know straight away what conceptual structures match what linguistic forms.5
Fig. 1: Mapping rules, the missing link between conceptual structure and linguistic structure Actually, the idea of patterns or schemata is not new. It has a long tradition in philosophy (Kant 1781), in sociology (Goffmann 1974), in structuralZock et al. (1986). Actually, clitics in French nicely illustrate the need for lookahead or preplanning. Suppose you were to pronominalize y and ζ of the following proposition g i v e ( x , y , z ) . In this case it is not possible to determine their relative position, unless one knows the roles (person) of both objects. If you compare (a-c) you'll notice that the position of the direct object ("it", that is "le" in French) depends on the person of the indirect object (3d person or not). Put differently, the position of the two objects are interdependant, that is, their respective position cannot be determined unless the value of the attribute PERSON of both objects is known. (a) il me LE donne (he giyes it to ME) (b) (c)
il LE lui donne il te LE donne
(he gives it to HER) (he gives it to YOU)
5 Learning a language is thus learning a set of variably abstract patterns, a set of mapping rules and their respective conditions of use. Sentence generation can go either way, from abstract to specific patterns (refinement), or from specific to more general pat terns (generalisation), lower level patterns becoming integrated into higher level pat terns: ( d e t + adj + noun) => NP; (verb + noun) => VP; (adv + verb) => AdvP; (NP + VP) => Sentence. 6 The conceptual grammar controls the assembly, i.e., legal combinations of possible con tents, that is, it specifies what is meaningful in a given culture, whereas the linguistic grammar specifies the possible forms.
320
MICHAEL ZOCK
and in text linguistics (Harris 1951; Fries 1952; Roberts 1962; van Dijk 1977), in psychology (Bartlett 1932; Koffka 1935; Piaget 1970; Bruner 1973; Rumelhart 1975; Mandler 1979; Ausubel 1980) and in artificial intelligence (Minsky 1975; Wilks 1975, Schank & Abelson 1977).7 Nevertheless, despite its long standing tradition, schema approaches and formal grammars have two major shortcomings: with the exception of Mel'cuk's work, neither nor account for the correspondences (mappings) between the different structures or levels. Hence, they do not make explicit on what structural cues, i.e., formal characteristics of the message (conceptual structure), the speaker's decisions are based when s/he chooses a specific linguistic form (syntactic structure). Yet, structures are of little help if one does not know what to do with them, that is, what they stand for.8 Another problem with the schema approach lies in the fact that schemata are hard to constrain. Hence, lack of refinement, or lack of proper con straints may result in ambiguity (analysis — parsing) or overgeneration (production). If the user is not told where the limits of the schemata lie (explicitation of the schema constraints), s/he will use these patterns even in cases where they do not apply. If one agrees with our point of view that natural language processing is basically schema-driven,9 then the question arises of how people manage to recognise linguistic structures on the basis of conceptual structures. This is what this paper is about.
7 While Bartlett, Schank/Abelson, van Dijk and Rumelhart identified patterns on the text or discourse level (schemata, scripts, macro-structures, story grammars), Harris and Fries dealt with sentence patterns. The idea of linguistic patterns has also been extensively used in the classroom, where pattern-drills have been a major teaching strategy especially during the sixties, when behaviorism was at its peak (Lado 1964; Rivers 1972). Things have changed radically after Chomsky's devastating critique of Skinner's book "Verbal Behavior" (Skinner 1957; Chomsky 1959). 8 One can't but agree with Bock et al. when they write "In existing models of language production the first mapping from messages to linguistic relations involves linking nonlinguistic cognitive categories to linguistic categories. However the categories them selves are variably specified, because there is little concensus of what the appropriate ones might be." (Bock et al. 1992:151) 9 While early natural language systems like SIR (Raphael 1968), STUDENT (Bobrow 1968), ELIZA (Weizenbaum 1966) and SHRDLU (Winograd 1972 ) relied heavily on low-level schemata (syntactic patterns), more recent systems use high level schemata, i.e., text patterns (Mc Keown 1985; Rösner, 1987). For a criticism of the latter see Hovy (1990). See also Patten et al.'s use of the notion of knowledge compilation which is somehow akin to our notion of pattern matching (Patten et al. 1992).
SENTENCE GENERATION BY PATTERN MATCHING 2
321
What kind of evidence can we provide in favour of patternmatching?
There are several good reasons for accepting such an approach, both struc tural and procedural: Structural evidence: Human experience and social interactions are structured, regular, hence predictable to some extent. This regularity, of course, reflects in language. There are definite limits with regard to lin guistic creativity: new, original thoughts have still to be cast in old pat terns. Languages are schematic to a great extent, that is, every language has a fairly large set of patterns in order to express concepts, relations and events, etc. For example, <X> isthat A computer is a machine that processes information. <X> is a sort of A bicycle is a sort of vehicle. Table 1: Schemata for a definition <X> is somehow like. . A cat is somehow like a tiger. is to as is to . Good is to light as evil is to darkness. Table 2: Schemata for comparison This is true not only on the higher levels (paragraph, text level), — stories, news, weather forecast, sport reports, etc. are clearly schematic — but also on the lower levels (phoneme, word, sentence level). Actions, events, states, processes Entities, names, places Properties, attributes of entities Manner, attributes of actions Intensifier, location, time Means Spatial relations: path, position, direction
verbs nouns adjectives adverbs adverbs prepositions
build, happen, be, sleep car, Paul, Tokyo young, bright slowly very, here, tomorrow by, with
prepositions
from, in, on, towards
Table 3: Mapping of ontological categories on syntactic categories When translating a message into discourse, the speaker maps a conceptual structure (deep structure) onto a linguistic form (surface structure). Thus, concepts are mapped on words, each of which have a specific categorial
322
MICHAEL ZOCK
potential, i.e., part of speech (Table 3), 10 deep-case relations are mapped on grammatical functions (Table 4), and conceptual configurations, i.e., larger conceptual structures, are mapped on syntactic structures (Table 5), etc. agent, cause object, patient beneficiary, recipient
subject direct object indirect object
Table 4: Mapping of case relations on grammatical functions 1. [PERSON:#]<-agnt-[PERFORM]-obj->[MUSIC:*] D + N + V + D + N
11
T h e girl p l a y s a s o n g .
2. [MUSIC:*]<-attr-[QLTY: GOOD] D + Adj + Ν
a nice song
D + Ν + RelPr + Copula + Adj
a s o n g t h a t is n i c e
3. [PERSON:#]<-agnt-[PERFORM]-obj->[MUSIC:*]<-attr-[QLTY: GOOD]
D + N + V + D + Adj + N
The girl plays a nice song.
Table 5' Mapping of conceptual structures onto syntactic structures This mapping can be done in various ways, directly or indirectly, that is, via various intermediate structures. 12 The units on which these processes operate may be single concepts (Tables 3), relations (Tables 3 and 4), or larger chunks (Table 5). Subordinate patterns may be integrated into superordinate patterns, etc. 13 The following example reveals several interesting formal characteristics concerning patterns: 10 Hence, we assume that there is no one-to-one correspondence between concepts and words, or between words and parts of speech. For example, a given concept (LOVE) may well map on several words (love, like, be fond of, . . . ) , each of each may be realized by several syntactic categories (noun, verb, adjective). 11 We use the " # " and "*" signs to signal the communicative status of the referent (de terminate vs. indeterminate). The " # " sign signals the fact that the entity referred to is known, hence requires a definite article. In a similar vein "*" signals indeterminacy. 12 Swartout's (1983) and Mel'cuk's work (Mel'cuk & Zholkovskij 1970) lie at the extremes. The former used no intermediate structure at all, whereas the latter used no less than seven levels to get from a meaning representation to its surface form. ATN-approaches (Simmons & Slocum 1972) and semantic grammars (Burton 1976; Hendrix 1977) lie somewhere in between. 13 If you look at Table 5, you will notice that the conceptualisation 2 [MUSIC:*]<-attr- [QLTY: GOOD] is integrated into message 1, yielding message 3.
SENTENCE GENERATION BY PATTERN MATCHING
323
has (kindly) brought t o a t t e n t i o n the (following) (FACT/PROBLEM). Prof Joshi has kindly brought to my attention the following fact. • Patterns have a fixed and a variable part. The fixed parts are high lighted and written in small letters, while variable parts are written in capital letters and in angle brackets or normal parenthesis. • The variables may be optional or obligatory parts of the pattern. (PERSON), (FACT), (DET) are all obligatory elements, while (TITLE) is, from a linguistic point of view, optional. If we generalised "kindly" into some variable (MANNER), — we would do so if we found out that other attributes, or synonyms can occur in this position, — then we would have an optional variable. • The variables may be of different kind: semantic or syntactic ((PERSON) (TITLE) (PERSON) (DET)
vs. ( D E T ) ) .
• Patterns have optional parts. This is illustrated by the adverb "kindly" here above which appears in parenthesis. • Certain elements of the patterns may need morphological adjustment: subject-verb agreement, determiner (my vs. her), etc. • Patterns may contain information of different kind (conceptual, lex ical, syntactic). Put differently, patterns can be hybrid. • Patterns can be semantically equivalent, that is, they can be synonyms ("he brought to my attention" vs. "he drew to my attention"). • Patterns can be embedded into each other (hierarchy of patterns). A grammar could be seen as a list of patterns composed only of variables, and a hybrid approach like ours is only a cut through of this grammar at different levels of abstraction. All the above mentioned patterns are productive in the sense that they allow for the creation of a wide range of linguistic forms. There may be even direct connections between situations and conceptual structures on one hand, and between these conceptual structures and their linguistic coun terparts on the other. It should be noted, however, that the relationship between conceptual structures and linguistic structures is not one-to-one (see below). Hence, neither syntactic categories (e.g., part of speech), nor syntactic structures are determined by conceptual structures, that is, the former cannot be predicted solely on the basis of the latter. Nevertheless, there is a strong tendency to translate a given conceptual element or struc ture by a specific syntactic form, that is, there are default mappings (see Tables 3 and 5). For empirical evidence of how conceptual structures induce syntactic structures see Brown (1958).
324
MICHAEL ZOCK
As we have said, there are other reasons that plead in favour of schemadriven processing, namely, the speed and economy of learning and pro cessing. Procedural arguments: Speech is fast, yet it is slow compared to thought. As we all know, thoughts tend to get lost if not expressed in time. Word-by-word processing is thus not a good candidate for processing the data: it is not only slow, but also error prone. If words and syntactic structures were computed strictly incrementally, that is on a word-to-word, or concept-to-concept basis (local planning), we would never get the job done in time, and we would easily get stuck in dead ends, that is, talk ourselves into a corner. The order in which the concepts come into our minds, and the order in which they have to be expressed in the surface string are not necessarily the same. Global planning (look-ahead or global view) is thus necessary in order to speed up the process and to cut down on backtracking. It, should be noted though that, while pattern matching seems to be a good strategy at the initial stages of generation, it is not obvious at all that it can be used safely throughout the process. It provides generally only a first sketch (outline) which needs to be checked against the syntactic requirements, i.e., constraints of the lexical material actually used (subcategorisation features). It seems that an approach whereby pattern matching is only seen as a first step allowing for refinements or changes, contains all the basic ingredi ents to go from canned text to full-blown sentence generation. Another argument for pattern matching stems from the observation that people manage to communicate even very complex thoughts by using very few patterns. This is even more so if they speak in a foreign language. Many students use pattern matching as a basic strategy when producing language: their messages are structured the same way, that is, they are cast in the same kind of sentence- or discourse pattern. Actually, verbal skill can be measured in terms of the number of patterns a speaker is able to use adequately in a given communicative situation, and by his aptitude to vary and to make local adjustments to them. From the above it should be clear that spontaneous discourse is only possible if the speaker is operating on larger units (chunks) than words or single concepts.14 If this point of view is accepted, together with the point 14 For an enligthning discussion concerning the use of larger units than words (most notably, idiomatic and fixed epxressions) see Becker (1975). This view, just as ours, contrasts with the notion of strict incremental processing (see, for example, Kempen & Hoenkamp 1987; de Smedt 1990).
SENTENCE GENERATION BY PATTERN MATCHING
325
that proficient speakers are, above all, good pattern matchers, then the question arises how people manage to recognise these structures. In other words, is there a way to identify a good candidate for a relative clause, a thatclause, an infinitive, etc. on purely conceptual grounds, that is, on the basis of the formal characteristics of the underlying conceptual input? Before answering this question, we would like to show why this problem cannot be adequately adressed within the framework of structure-oriented linguistics. 3
W h y cognitive linguistics, or, why study natural language in the realm of cognitive science?
Natural languages are both products and processes. Understanding the way they function thus requires the study of both. In other words, language, or language use cannot be adequately accounted for solely by studying the outputs (sentence-structure). In contrast to structure-oriented linguistics which describes only the physical products (sentences), cognitive linguistics tries to account for the processes, i.e., the operations necessary to transform an input (for example, a visual scene) into an output (text: description of the scene). While the former are concerned with products, the latter are interested in the processes operating on data. Hence the following questions are relevant for cognitive linguistics: • What are the different knowledge sources (pragmatic, conceptual, lin guistic)? • What are the input-output data? • What kind of operations are performed on these data (transforma tions, mapping rules)? • How do biological and cultural factors constrain the representations and processes? • What are the functional relations between the components (hierarch ical vs. heterarchical architecture)? • How is the process decomposed (control of information flow)? • How is the relevant information coded, stored, retrieved and pro cessed? The goal of cognitive linguistics is to describe and to explain linguistic com petency and performance for natural systems (human beings). Obviously, structure and process vary with the restrictions of the information processor (man vs. machine).
326
MICHAEL ZOCK
Languages are systems for the coding, manipulation and communication of information. They are symbolic means for storing, processing (reason) and transmitting information. As with any tool, they are designed with respect to a goal (function) and with respect to the user-constraints. As these constraints are different for human beings and for machines (memory, attention span), we would expect natural languages to be different from artificial languages (algebra, logic, etc.). 15 Natural languages, as opposed to artificial languages, are very flexible. The different components (conceptual, lexical and syntactic) are highly interdependent, each component possibly influencing the others. The advantage of such a heterarchical architecture is that it allows for various orders of data-processing. For example, lexical choice may precede the choice of syntactic structure and vice versa. For more details see Zock (1990). One could view the functioning of the mind, hence, the functioning of natural language somehow like the functioning of a complex society (olig archy). The two systems are organised in a similar way: (i) problem solving is decomposed: the result is produced not by a superexpert, but by a team of specialists; (ii) the different agents (components) contributing to the solu tion have a certain amount of autonomy; (iii) the agents negocíate, that is, they do not only communicate their results and draw on the results pro duced by their colleagues, but they can also adapt their behavior to allow for accomodation of the results produced by the other components. The advantages of such a heterarchical kind of organisation are multiple: (i) freedom of processing: various orders are possible to reach the solution; (ii) time-sharing: each agent can work on its own without having to wait for an order coming from a higher component; (iii) flexibility: information flow is bidirectional; (iv) opportunistic planning: as information becomes available at different moments and in unpredictable ways, and since the different components can accomodate, it is possible to have the different agents compete and to use the first result produced by any of them. The major drawback with this kind of system, where everything is more or less interdependant, is that it becomes extremely difficult to see the dependency relationships, that is, it is hard to see what causes what, or what action has what outcome. This is quite obvious for covert activities like 15 "Natural languages are ambiguous, imprecise and sometimes awkwardly verbose. These are all virtues for general communication, but something of a drawback for communicating concisely as precise a concept as the power of recursion. The language of mathematics is the opposite of natural language: it can express powerful formal ideas with only a few symbols." (Friedman & Feldeisen 1987: xi).
SENTENCE GENERATION BY PATTERN MATCHING
327
language, but it is also true for such complex activities as political decisions. Whichever the case, consequences of a choice may be far reaching and hard to predict. In this sense, there are many points in common between speaking well a language, hence communicate efficiently, and being a good politician. In both cases one has to make the right choice at the right moment. Man and machine are not subject to the same constraints. Humans have a very limited working memory,16 they are poor serial processors, and they are not very logical. But they are intuitive, creative and above all good pattern matchers. That is, they can spontaneously discover the right solution on intuitive grounds, they can conceive of efficient strategies to solve a problem (for example, how to access information stored in the longterm memory), and they can recognise complex patterns (configurations, Gestalt). Machines, on the other hand are logical, they are good serial processors, they have a perfect memory, but they have little capacity for creativity, for intuition, or for perceiving global structures. Linguists who want to work out an ecologically valid theory — that is, provide a description of the data which is not only formally correct, but which is also computationally sound (processable) — need to take these factors into account. Otherwise their theory will remain mere description with limited explanatory power or relevance for practical purposes. Now, if our point concerning pattern matching — that is the use of mapping rules applied to larger chunks (i.e., conceptual configurations) — is sound, one may ask several questions: Where do linguistic structures come from? (Section 4); What is the difference between conceptual and linguistic structures? (Section 5); What do syntactic structures depend upon? (Section 6); What do typical conceptual structures, patterns or configurations look like? (Section 7); How can the speaker recognise a specific syntactic structure? (Section 8). 4
Where do linguistic structures come from?
One can try to answer this question from a philogenetic point of view, or from the point of view of the process that takes place when translating thoughts (messages) into language (text). From a philogenetic point of view it seems that linguistic structures reflect perceptual structures. This is the 16 Due to memory constraints, sentences are built incrementally: planning and execution partially overlap. While uttering a partially planned conceptual structure, the next part of the message is planned: we think while we speak, and, while speaking, we think (Kempen & Hoenkamp 1986).
328
MICHAEL ZOCK
position held by many psychologists (Paivio 1971; Kempen 1977; Osgood, 1980; Anderson 1983; Miller & Johnson-Laird 1985), linguists (Fillmore 1977; Langacker 1983; Habel 1988) and computer scientists (Hill 1984; Sowa 1984; Arbib 1986). As a matter of fact, there are many similarities between language and perception, both from a structural and a procedural point of view. Both are compositional, and both have well-formedness and complete ness conditions (though in the sense of Gestalt psychology rather than in the mathematical sense). Furthermore, natural language and images are produced and perceived in a similar way, that is, globally. Both of them are to some extent holistic entities. We start to produce or to recognise a global structure (pattern) which we then fill in with details, that is, we tend to go from the general to the specific. The way how this might be done at the conceptual level is discussed in Zock (1996). What does this mean for sentence generation? It simply means, that rather than processing word-by-word, or concept-by-concept (local strategy), humans process larger chunks, trying to match entire conceptual structures on linguistic structures (global strategy). In other words, processing is done via pattern matching. 17 This is probably true on all levels: conceptual, syn tactic, lexical, and even phonological. Before turning to the problem of how people recognise linguistic struc tures on the basis of conceptual structures, we would like to comment on the relationship between conceptual structures and their linguistic counterparts, that is, words and syntactic structures. 5
Conceptual structures and syntactic structures are to a great extent parallel
Linguistic structures (order of words) and conceptual structures (order in which thoughts become available, i.e., spring into our mind), while not en tirely parallel, correlate to a large extent,18 that is, items belonging together conceptually tend to appear side by side in the surface structure. 19 This is 17 It is probably for this very same reason that people are able to recognise misspelled words, despite the speed of reading. We certainly don't look for every character, yet we are able to see the mistakes, especially if they occur at specific points. Strangely enough, it is for similar reasons that people overlook mistakes. We don't perceive what is, but we perceive what ought to be (Bruner 1973). 18 The regularities concerning their mappings are discussed in Section 7. 19 See Behagel's first law (cited in Vennemann 1975), Anderson's graph deformation principle (Anderson 1983), or Levelt's pioneering work on linearisation (Levelt 1981. 1982).
SENTENCE GENERATION BY PATTERN MATCHING
329
reasonable with regard to economy (memory). If conceptual and linguistic structures were not parallel to some extent, we would constantly be faced with a storage problem.20 For, whenever the word expressing a given con ceptual fragment cannot be inserted into the surface string, it needs to be held in working memory until it can be attached to the string. If translat ing a conceptual structure into linguistic form consists above all in finding words for parts of the graph (semantic network)21 and in ordering them, 22 it seems reasonable to try to maintain as much as possible the conceptual connectivity in the surface form (syntax). That is, the words should be uttered as much as possible in the same order in which the concepts for which they stand, have been generated. The question that arises now is, how do we put it all together, or, how do we compute syntactic structures? In order to answer this question let's take an example. Suppose you wanted to express the following message, message which could have been planned and expressed incrementally: [MAN]<-(AGT)-[CATCH]-(OBJ)->[FISH] [MAN]<-(AGT)-[MOVE]-(LOC)->[GROUND] [MOVE] <-(MAN)-[FAST] One way to start the process is by lexicalising, that is, by trying to find words that cover (express) the message planned.23 This kind of mapping is prob20 Of course, there are limits and exceptions (long distance phenomena, split infinitives, etc.) which make strict parallelism impossible. Language is linear, thought is rela tional. The representation of the latter being a graph, it is in principle possible to add information at any moment. For exceptions to this principle of adjacency, see Stockwell's discussion on the 'heavier element principle' and the 'topicalisation principle' (Stockwell 1977: 68-69 and 75-76). 21 Words hardly ever stand for single concepts, generally they stand for definitions. Yet definitions require a graph. This being so, it makes sense to express (or code) these definitions in terms of conceptual structures (graphs). In that respect there is no fundamental difference between the underlying meaning of words and sentences. Ac cording to our view, the process of lexicalisation consists in matching word definitions (conceptual graphs defining the meaning of the word) on an utterance graph, i.e., a structure containing the message to be conveyed. How such a message might be built has been discussed in (Zock 1996). 22 Obviously, there is more to determining surface form: computation of part of speech, insertion of function words, morphological operations (inflections, agreement), etc. 23 According to our view, lexicalisation is performed in two steps. During the first step only those words are selected that pertain to a given semantic field (for example, move ment verbs). At the next step the lexical expert selects from this pool the term that best expresses the intended meaning, i.e., the most specific term (maximal coverage). For more details see (Nogier & Zock 1992; Zock 1996).
330
MICHAEL ZOCK
ably done stepwise, as it is hard to imagine that a speaker is able to find sim ultaneously all the words of a very big conceptual chunk. This means that, having found a word for the chunk [MAN] <- (AGT) - [CATCH] - (OBJ) -> [FISH] ,24 the speaker tries to find a candidate for the remaining part of the message.25 The result of this stepwise consumption of the message graph is a lexicalised conceptual graph (LCG) whose links will then be replaced by func tional information (subject, direct object, etc.) and lexical categories (part of speech, step 3). On the basis of this Preliminary Syntactic Structure (PSS) a tree is built. In order to perform the final operations (inflection, etc.) morphological information is added. For more details see (Nogier 1991; Nogier & Zock 1992; Zock 1996). This entire mapping process is depicted in Figure 2. 5.1
Discussion
The careful reader has certainly noticed the following facts: (i) the same mechanism was used for lexicalisation and precomputation of syntactic structure: pattern matching;26 (ii) words were chosen prior to syntacticstructure; 27 (iii) the initial message graph may be considerably simplified 24 For the sake of simplicity we will ignore here the fact that a "man catching fish", and a "fisherman" are not necessarily the same. While the former may be an amateur, the latter is a professional. 25 Of course, other strategies are possible. Rather than trying to find the next word (breadth first), the speaker could work in depth, trying to finalize the surface form of the first lexical element, that is, determine its syntactic category, i.e., part of speech. In that case he would pursue lexicalisation only once the final form of the preceding element has been computed. Another alternative would be to start the process by de termining syntactic structure, inserting then lexical items into the computed syntactic slots. Which strategy is chosen under what condition remains an empirical question. 26 The assignment of part of speech to words (or, more precisely, to word stems) is per formed at the PSS level. This could be done via the mapping rules described in Figure 4. However, these rules might be insufficient, in particular if there are several candidates. Ultimately we do need a grammar in order to check at the different points of the chain the possibility of a given category. While the mapping rules specify how a given concept or conceptual chunk may be mapped onto a syntactic category or syntactic structure, formal grammars specify what categories can, or should occur at a specific point in time, that is, at a specific point in the chain. 27 Please note, that at this stage words are not inflected. What we get are the base forms of words. Note also that, unlike Kempen & Hoenkamp (1987), or Nogier (1991), who conflate the last two steps into one, we do not assume that syntactic information (syn tactic functipns, syntactic categories) is computed simultaneously with the rootforms. Part of speech and syntactic functions are determined later. There is one empirical finding though which is troublesome for our approach: when people fail to find the right word, they tend to come up with an alternative (synonym) which belongs to the same syntactic category as the one they were looking for.
SENTENCE GENERATION BY PATTERN MATCHING
331
Fig. 2: The lexicon as a mediator between the conceptual and linguistic levels28 CL: LM1: LM2:
conceptual level LCS: lexicalised conceptual structure lexical mapping for word1 PSS: preliminary syntactic structure lexical mapping for word2 FSS: final syntactic structure
(contraction) if the syntactic structure is computed via the lexicon (con traction of the message graph to a lexicalized conceptual graph). Actually, this is one of the principal reasons for computing words before syntactic structure: large parts of the message graph become reduced to a relatively small lexical graph (compare the graphs at the conceptual level and the LCS 28 For illustrative purposes we assume here serial processing. Yet there seems to be evidence that people compute words (and perhaps even syntax) in parallel. For pointers to the relevant literature, see (Levelt 1989).
332
MICHAEL ZOCK
level in Figure 2). 29 If the syntactic structure is computed before words are chosen, syntactic constraints may be used to choose among lexical alternat ives (synonyms). This may happen in the case of parallel structures, where one part constrains the other. Compare: (i) We were expecting the worst, but hoping for the best. (ii) * We were expecting the worst, but hoped for the best. In such a case one uses the same tense in both clauses. Obviously, one has to justify the fact that lexicalisation precedes the determination of syntactic structure. Basically there are three possible strategies: 1. Syntactic structure is determined prior to word choice. This strategy is implied in traditional generative grammars, where the syntactic tree is built top down. The words are inserted fairly late during the derivational process into syntactically specified slots. 2. Word choice precedes syntax (syntactic trees are built bottom up). 3. Words and their syntactic structure are computed in parallel.30 Of course, either of these strategies could be used, though, we don't believe very much in the first option for the following reasons. First and above all, the speaker wants to convey meanings. Yet, syntax conveys little meaning compared to words. Second, if syntactic structures are to be computed on the basis of conceptual graphs whose nodes are more elementary than words, it is hard, if not impossible, to compute the syntactic structures first: the message graphs are simply too big to make such a strategy very feasable. In addition, it doesn't really make sense to compute the syntactic structure at this point of the process, since large parts of the graph will be reduced at the next step (lexicalisation) anyhow. Last, but not least, syntactic structure depends to a large extent on the subcategorisation features of the words used (see Table 6). Hence, it cannot be computed entirely without knowing the words that are being used. Obviously, it would be nice to decide on this issue on the basis of empirical work. Unfortunately, for the time being we lack conclusive psycholinguistic evidence. For a good discussion though see Aitchison (1983, Chapter 11). 29 One may wonder though whether this is not an artificially induced problem, resulting from our way of modelling the process. 30 One could also think of a hybrid solution: depending on the situation, priority is given to syntactic or to lexical choice.
SENTENCE GENERATION BY PATTERN MATCHING 6
333
W h a t do syntactic s t r u c t u r e s d e p e n d u p o n ?
Syntactic structures depend basically on three sets of variables, or choices: conceptual, pragmatic and linguistic. C o n c e p t u a l choices. Different conceptual structures map generally onto different syntactic structures: CONCEPTUAL STRUCTURE [PERSON: #] < - a t t r - [SPEED: HIGH] [PERSON: #] < - a g n t - [MOVEMENT]
t
LINGUISTIC STRUCTURE
STRING
Pron+Copula+Adj
She is
Pron+V+Adv
She runs
fast fast
attr I [speed:high]
As we have shown already (Tables 3 and 5), there is no one-to-one map ping between conceptual structures and their linguistic correlates (parts of speech, syntactic structures). Different conceptual structures or relation ships may be expressed by the same kind of linguistic structure. 31 This is particularly obvious for genitives: Peter's car ... [possession] Peter's brother ... [family relationship] Peter's leg ... [inalienable possession, part of] Conversely, the same conceptual structure or relationship, for example the notion of possession, may map onto different linguistic structures or forms (paraphrase): This car belongs to the president, [verb] This is the car of the president. [preposition] This is the president's car. [case: genitive] This is his car. [possessive pronoun] The conceptual structure is by no means sufficient in order to decide on the linguistic structure. Pragmatic and linguistic information is also necessary. What can be assumed to be known? Is there a word for a given concept? Can this word be used in this particular way? Suppose we were to express the following idea: (PERSON) be (PROFESSION) (PLACE). In that case one can use in English a verb, if the person referred to is a teacher, but not if s/he is a professor or doctor: 31 This is probably for reasons of economy. The number of possible conceptual structures is enormous. If there were only one-to-one correspondences, we would have to learn a great many different syntactic structures. Furthermore, we would have to create a new syntactic structure every time we invented a new conceptual structure.
334
MICHAEL ZOCK Kathy is a teacher at Columbia.
She teaches at Columbia. ?/*She professes at Columbia John is doctor at the hospital *He doctors at the hospital. Pragmatic choices: Syntactic structures act as cues: different syn tactic structures express not only different content (conceptual structures), they also signal different discourse purposes (goals). Furthermore, they can reflect assumptions made by the speaker concerning the hearer's back ground, knowledge and beliefs. The way something is expressed depends on such factors as prominence — what should be highlighted or subordinated, — background knowledge and perspective. The value of these variables reflects in sentence type (simple vs. complex form, main clause vs. subordinate clause), determiner (definite vs. indefinite article ), part of speech (noun vs. pronoun) and voice (active vs. passive voice).32 The following two sentences make different assumptions about the listen ers' knowledge: (la) The man who had stolen the car escaped from prison. given: steal(man007, car205) new : escape(man007, prison03) (lb) The man who escaped from prison had stolen a car. given: escape(man007, prison03) new : steal(man007, car) In the first case they are supposed to know about the man's stealing of the car, whereas in the latter they are expected to be familiar with the man's escape from prison (see Clark & Haviland 1977). This means for the speaker, that s/he should express the known part in the subordinate clause and the new part in the main clause. Linguistic choices: of course, language may impose further constraints on the syntactic structure. For example, the choice of a particular verb (e.g., to tell) may prohibit a specific syntactic structure. Take for example the structure in Figure 3. Depending on the verb chosen different syntactic structures are necessary: John said to Mary that he was tired to say: John said to Mary: 'Ί am tired. " to tell: John told Mary that he was tired. * John told Mary: 'Ί am tired. " 33
32 For empirical evidence see (Clark & Haviland 1977; Olson 1970; Olson & Filby 1972; Osgood 1971, 1980; Tannenbaum & Williams 1968). 33 By convention we shall use a question mark for odd sounding forms, and an asterisk
SENTENCE GENERATION BY PATTERN MATCHING
335
The above example illustrates the interaction between lexical choices and syntactic structure. In other words, a syntactic structure cannot be chosen in isolation, or solely on conceptual grounds.34 Another example that nicely shows the far reaching consequences of lexical choice is the following. Suppose we were to express in French the following idea: HELP p a s t perfect(JOHN, MARY). Suppose furthermore that John and Mary were known. Depending on the chosen verb (aider, venir en aide, rendre service) various aspects of the surface form would change (see Table 6): the pronoun (la vs. lui), the auxiliary (être vs. avoir), the object's agreement marking on the verb (aidée vs. aidé). For more details see Zock (1994).
Table 6: "He has helped her" (consequences of the verb choice on clitics and auxiliaries) In the next section we will show how one can recognise certain fundamental syntactic structures on the basis of the formal characteristics of the concep tual structure. for ungrammatical sentences. The non-native speaker of English may be puzzled by this subtlety of English, yet it exists and is explainable. The reason why, ƒ am tired, following, John told Mary, sounds odd, is due to the fact that the verb tell means something like to report. A report being something that follows the event it describes or reports, it looks strange if the speaker switches from a reported event to a present event, i.e., direct speech. 34 For more details on how linguistic structures may vary as a function of pragmatic, conceptual and linguistic choices, see Zock (1988; 1994).
336 7
MICHAEL ZOCK Prototypical patterns
If we look at syntactic structures, we discover after a while strong correla tions with their conceptual counterparts: conceptual structures (see Table 3 and Figure 4). For example, nouns usually represent e n t i t i e s , verbs ex press a c t i o n s , adverbs stand for manner, etc.
Once we have discovered t h a t , we can use this knowedge the other way round, t h a t is, for generation. Hence, e n t i t i e s may m a p onto nouns, a c t i o n s / e v e n t / p r o c e s s e s may be expressed by verbs, etc. 3 6 Obviously, 35 The shaded nodes signal particularly relevant information for a given syntactic struc ture. 36 As we have pointed out already, this need not always be the case, actions may well map onto nouns (nominalisation), directions may be expressed by prepositions or a verb, etc. Also, our examples hold for English. While the ontologicial categories
SENTENCE GENERATION BY PATTERN MATCHING
337
what holds for concepts holds also for larger conceptual chunks. Hence, looking at the lexicalised conceptual structure one can predict to some ex tent not only the part of speech, but also the potential syntactic structure (see Table 5). Figure 4 encodes some typical patterns. By convention, we will use circles for the predicates (verbs, adjectives, adverbs) 37 and boxes for the arguments (nouns, propositions). It should be noted however, that the mapping approach, in order to become feasable, imposes special constraints on the knowledge representation, consistency: elements being morphologic ally different, yet playing semantically a similar role, should be coded the same way. This is the case with verbs, adjectives and adverbs (see Fig ure 4). Though syntactically different, they are conceptually alike, hence coded the same way (by a circle). All of them are predicates, the difference hinging in the nature and function of their arguments: e n t i t i e s for verbs or adjectives, actions, events, or processes for adverbs. In addition, in formation of the links are relevant. In order to decide whether predicates pointing to an entity should map onto a verb or an adjective, one must check whether the link is a case role (in the case of nouns), or an attribute (adjective). One might also appreciate the similarity between infinitives and that-clauses. The difference hinging on coreference and on the kind of verb: that-clauses require specific kinds of verbs. Figure 5 shows a simple conceptual structure (5A) and the way how one may get progressively (5B, 5C) to its linguistic counterpart (5D). Ob viously, there is more than one way to build the corresponding tree (top down, bottom up, bidirectionally).38 Anyhow, the language user having performed this kind of process a number of times, will gradually become able to go directly from the conceptual structure to the (prefinal) linguistic form. In the next section we will show how one can recognise one particular syntactic structure, relative clauses.
( e n t i t i e s , actions, a t t r i b u t e s , etc.) are to a large extent universal, the mappings are language dependent. 37 That's why we will not use them for indicating the links. Another deviation with regard to John Sowa's notation is the direction of the links for adverbs. 38 Please note that we have here a mixed strategy of pattern matching and incremental processing. It might also be worthwhile mentionning that 3B is a hybrid form: it contains conceptual and syntactic information.
338
MICHAEL ZOCK
Fig. 5: From conceptual to linguistic structures 8
Where do relative clauses come from, how can they be recog nised, and what do they depend upon?
These are actually three questions, for which, due to space constraints, we will provide only sketchy answers. Relative clauses add information. In that respect they are similar to adjectives. The information given may be crucial for the identification of the referent (restrictive relative clause) or not (nonrestrictive relative clause). This latter case is generally marked by a comma or a pause. A typical situation for a relative clause arises when some entity participates in more than one event: l i t t l e (boy: #5) & f a s t (run (boy: #5)). This is a conceptual condition which can be captured by a mapping rule: an entity being pointed at by two opposing arcs (see Figure 5). This formal
SENTENCE GENERATION BY PATTERN MATCHING
339
characteristic might be used by a speaker, recognising this structure as a potential relative-clause candidate.39 We use the word "potential", because the condition mentionned here above, though necessary, is by no means sufficient. Actually, the conceptual structures could be expressed in either of three ways: (1) by a simple sentence, (2) two independent clauses, or (3) a relative clause.40 (1) The little boy runs fast (2) The boy is little. He runs fast (3) The boy who is little runs fast Besides signalling the fact that an entity participates in more than one event, relative clauses signal relative prominence. Put differently, relative clauses factorise and highlight information. In addition, they are devices for increasing processing time and for allowing for spontaneity (expression of afterthoughts, i.e., thoughts that were not planned at the onset of articu lation). The extra time they allow for may be needed for encoding the rest of the main clause. Yet, as we shall see, there is more to the generation of relative clauses than just coreference. Take for example the following propositions (see Figure 6): (1) rob(man007, bank) (2) arrest(police, man007)
(3) escape(man007, prison) (4) admire(woman, man007)
When communicating these events one has to consider several factors. Chunk ing: shall all these events be expressed in a single sentence, that is, a series of independant clauses, a coordinated sentence, or a relative clause? In the latter case one has to pay attention to the role played by the coreferential element (man) in each clause. What role does the to-be-embedded element play (agent, p a t i e n t ) ? Does it play the same role in both events (that is, in the future matrix clause and subordinate clause)? The clauses (1,3) and (2,4) are symmetrical in the sense that in both cases the man plays the same role. In the first case (1,3) he is the agent, whereas in the second case (2,4) he is the p a t i e n t . This symmetry can, of course, have an effect on realisation. 39 As one can see, mapping rules, or the cues a speaker may be sensitive to, are not only language specific, but also dependent on the knowledge representation formalism. If messages are coded in terms of semantic networks or conceptual graphs, then nodes and links become of focal interest, if one uses first order logic (propositions), then identity of reference might be a crucial element to be looked for. 40 Please note, again there are subtle, though important differences between these forms at the pragmatic level, differences which we will not deal with here.
340
MICHAEL ZOCK
Fig. 6: Conceptual input Yet, there are still quite a few other factors playing a role: order of mention (linear order of clauses, topicalisation), communicative status (def inite vs. indefinite), presupposition (known, unknown), tense. Some of these decisions are prior to syntactic processing, and their consequences (commu nicative statuts) should be part of the input. Suppose we were to express the following events (see Figure 7): 41 Event-1 (El) rob (man, bank) Event-2 (E2) escape (man, prison) By varying systematically certain parameters such as chunk size (indepent vs. complex clause, coordination vs. subordination), order of events, topical isation, communicative status of the participants (definite vs. indefinite),42 tense, etc. we will notice, that certain structures are not possible, while oth ers, though grammatically correct, sound simply odd. It is by analyzing this data that we could get an answer to our question "what factors codetermine relative clauses". 41 For the sake of simplicity, these representations will not include information concerning time, tense, aspect, mood, etc. It should be noted though that this kind of information may play an important role, constraining the use of a particular structure. 42 For example, the communicative status of the common object has to be identical in the main clause and subordinate clause, as it would be incorrect to produce: A man who had robbed the bank escaped from prison. Put differently, one can't relativise input structures like rob(man:*, bank) & escape(man:#, prison) or rob(man:#, bank) & escape(man:*, prison) because the man, though being coreferential, does not have the same communicative status.
SENTENCE GENERATION BY PATTERN MATCHING
341
Fig. 7: A man robbing a bank and escaping from prison
Unfortunately, for reasons of space, we cannot perform such a systematic analysis here, though it would be worth the effort. 8.1
Discussion
Table 7 highlights several interesting problems. The conceptual input en coded in Figure 7 is clearly underspecified. Hence, the precise syntactic form is undecidable. This shows up in Table 7 by the large variety of pos sible forms, forms which express subtle differences though. It also shows up in the changes of tense which are not trivial at all. Let's have a closer look at some of the sentences. Sentence la, though grammatically correct, sounds odd as it seems to lack information, namely having robbed a bank, the man was put into jail This effect would be less striking in a coordinated structure, where the listener would get the feeling that the speaker expressed a list of actions performed by some man: A man has robbed a bank and escaped from prison. By comparing the sentences (lb, Id) and (2b, 2c), one can see to what extent 'communicative status' (definitive vs. indefinite) and 'tense' (see l a and lb - le) may affect the interpretation, hence, acceptability of the sen tence.
342
MICHAEL ZOCK I N D E P E N D E N T CLAUSES
(1)
O R D E R : El > E2;
TOPIC:
(la) A man has robbed a bank. He escaped from prison.
(2)
O R D E R : El < E2;
TOPIC:
SUBORDINATION man
(lb)
A man who had robbed a bank escaped from prison. (lc) ? A man who had robbed a bank had escaped from prison. (Id) The man who had robbed a bank escaped from prison. (le) ? The man who had robbed a bank had escaped from prison.
man
(2a) A man (had) escaped from prison. (2b) ? A man who (has) escaped from prison. He had robbed the bank. had robbed a bank. (2c) The man who (has) escaped from prison had robbed a bank. (3) ORDER: El > Ε2; TOPIC: bank (3a) A bank was robbed by a man. He had escaped from prison. (4)
O R D E R : El < E2;
(3b)
The bank was robbed by the man who had escaped from prison.
TOPIC: prison
(4a) From prison escaped a man. He had robbed a bank.
(4b) * From prison escaped the man who had robbed the bank. (4c) ? The prison from which the man who robbed the bank escaped ...
El > E2 means: e v e n t - 1 precedes e v e n t - 2 .
Table 7: Stylistic effects due to the variation of parameters like chunk size, order of mention, topicalisation, etc. T o p i c a l i s a t i o n is another factor. While we can describe the scene from the man's point of view, we can't start linearisation from the prison: (4b) being an incorrect sentence, while (4c) is incomplete. Syntactic constraints like p a s s i v i s a b i l i t y may play a role too. For example, the verb "to escape" can't be passivised. Presuppositions are another factor. The following sentences, presup pose specific information concerning temporal order. The man who had robbed the bank escaped from prison. The man who espaced from prison had robbed the bank. The bank was robbed by the man who had escaped from prison. Put differently, order of events may impose special constraints on syn tactic structure. The choice between coordination or subordination may de pend on conceptual information given with the input. Also, implicatures
SENTENCE GENERATION BY PATTERN MATCHING
343
may vary depending on linear order. Compare Levelt's well known example (Levelt 1989): She became pregnant. They got married. They got married. She became pregnant. Likewise, compare the following two sentences, which express basically the same events, yet with different emphasis. 1. Hitler has often been compared with Napoleon, although there are many differences between the two men. 2. Although there are many differences between Hitler and Na poleon, the two men have often been compared. While the first version focuses on the differences, the second stresses the similarities between the two men. Figure 8 is similar to the preceding one; again the two clauses are sym metrical. However, this time the entity to be relativised plays the role of the patient.
Fig. 8: A man arrested by the police and admired by a woman Depending on certain temporal or attentional givens (focus), the structure in Figure 8 can be expressed in the following ways: la) The man (whom) the woman admired was arrested by the police. lb) ? The man (whom) the police arrested was admired by the woman. lc) ? The man (who was) arrested by the police was admired by the woman. 2a) 2b)
The woman admired the man whom the police arrested. The woman admired the man (who was) arrested by the police.
3a)
The police arrested the man whom the woman admired.
It should be noted, that this time, linearisation can be started from any node, that is, from a linguistic point of view there are no focus constraints.
344
MICHAEL ZOCK
This is probably due to the fact that all the verbs used can be passivised. It may be worth mentioning however, that the passive voice occurs only in the main clause. Sentences (lb) and (lc), while not incorrect, sound distinctly odd. The next structure (Figure 9) is different from the former in that the "man" plays a different role in each event: the structure is asymmetrical. In one case the "man" plays the role of the agent, whereas in the other he is the patient
Of course, this fact may reflect in the surface structure. According to the role played, the "man" will surface as subject or as object of the main verb. Again there are constraints on the topicalisation, but for different reas ons. The last sentence (3) sounds odd, as it gives the impression that the police had arrested the man before his robbing the bank. In general, it is not very good to subordinate a clause that expresses a consequence. 1) The man who had robbed the bank was arrested by the police. 2) The police arrested the man who had robbed the bank. 3) ? The bank was robbed by the man whom the police arrested. 9
Discussion
What can be learnt from looking at these networks? In order to recognise a candidate for a given syntactic structure one must consider several factors: the type of concept (predicate/argument), its communicative status (def inite/indefinite), its role with regard to the whole (predicate dominating an argument or dominating another predicate), the nature (case role) and direction of the arcs (incoming vs. outgoing). Yet, several other points are worth mentionning:
SENTENCE GENERATION BY PATTERN MATCHING
345
1. It is not enough to look just at one predicate or argument (local strategy), one has to look at larger chunks. Typically, the formal characteristics of the surrounding predicates or arguments also play a role. The relevant information being spread all over, one has to look at entire conceptual configurations.43 2. The formal characteristics (conceptual conditions) of the underly ing conceptual structures are by no means sufficient for determining the syntactic form. They only suggest potential candidates. Other factors need to be taken into account, most prominently: the size of conceptual chunks to be verbalised,44 shared knowledge (definiteness), t o p i c a l i s a t i o n (active vs. passive voice, type of embedding), the subcategorisation f e a t u r e s or s y n t a c t i c requirements of a particular word,45 and, last but not least, the r e l a t i v e prominence of each clause (saliency, focus), i.e., what shall be put into perspect ive, that is, be expressed by a main or subordinate clause? Syntactic structures are generally the result of an interaction between concep tual, linguistic and discourse choices. 3. The conceptual structure taken as input needs to contain a lot more information than the graphs shown here. Otherwise it is not possible to decide on a communicatively adequate syntactic structure. 46 In conclusion, despite obvious correlations, the parallelism between concep tual structure and linguistic form is relative. There is no one-to-one map ping. While an Agent-Action-Patient structure is likely to be expressed in English by an S-V-0 pattern, we cannot tell solely on these grounds that the speaker will render this idea in an active form. Similarly, the choice between a that-clause and an infinitive cannot always be made on purely conceptual grounds. The structure building properties (syntactic charac teristics) of the verb must also be taken into account, all the more as the choice of a particular verb may turn out to be incompatible with the chosen 43 We believe that the idea of central vs. peripheral view (i.e., the idea that there is more to come) has some psycholinguistic reality. 44 According to the amount of information the speaker tries to integrate into a sentence frame he may end up with several independent clauses, or one complex, heavily em bedded clause (subordinate clauses). 45 Not all verbs can be nominalised or passivised. Some words constrain other words (collocations), etc. 46 A question that arises in that context is how complex a pattern may be, that is, how much information it can contain, without loosing one of its most fundamental characteristics: recognisability.
346
MICHAEL ZOCK
syntactic structure. In sum, one cannot strictly separate the syntax from the lexicon. 10
Conclusion
We have argued in this paper that human performance, that is, verbal flu ency as observed in spontaneous discourse production, is only explicable if one hypothesises global strategies combined with pattern-matching: the speaker operates on larger chunks. We have also pointed out that for reasons of economy (storage), conceptual structures and linguistic structures ought to be parallel, at least to some extent (principle of structure preservation). Finally, we have outlined a strategy for fast computation of syntactic struc tures. We have shown that the same mechanism, — pattern matching, — could be used simultaneously for choosing words and (pre)computing syn tactic structures. We have suggested that this be done in the following way. A given message is reduced to a lexicalised conceptual graph (lexicalisation). This graph serves as input for the (pre) computation of syntactic structure: ontological categories (entity, state, deep case roles, etc.) are replaced by syntactic categories (part of speech, syntactic functions). This preliminary syntactic structure is then checked against lexical sub categorisation features (for example passivisability of verbs) and the result is handed to the mor phological component for final operations (agreement, insertion of function words). We have claimed furthermore, that in order to precompute the syntactic structure, the user looks at the formal characteristics of the conceptual input. 47 It is by looking at cues like the type of concept (predicate vs. argument), its relative position with respect to the whole, type and direction of the links (ingoing vs. outgoing), etc., that s/he decides, i.e., precomputes, whether a given concept or conceptual chunk should map on, let us say, a noun, a verb, an adjective, an infinitival-, relative-, or that-clause, etc. We would like to take this opportunity to clarify here our position with regard to formal grammars and incremental processing. By suggesting to map lexicalised conceptual structures directly onto (final) linguistic forms, i.e., syntactic structures, we may have given the reader the wrong impression that one could do without a formal grammar. Obviously, this is not quite so. There is still assembly, and the legal combinations, i.e., pieces that can go together, are still specified by a formal grammar. For example, if 47 This makes generally sense only once lexicalisation has taken place.
SENTENCE GENERATION BY PATTERN MATCHING
347
we used unification, than we would simply try to unify larger chunks than in most other approaches. As one can see, we do not mean to bypass the grammar, we simply mean that, depending on the situation and the speaker's proficiency, grammaticalisation is performed on larger chunks. This last statement should also clarify our position with regard to in cremental processing. We do not mean to criticise the basic idea behind it, quite to the contrary. We simply challenge the view of word-to-word pro cessing. Indeed, there may be a whole spectrum of units, going from very small items (words or even below) to fairly large chunks (fixed expressions, patterns). The size of the unit, and the way to process them may vary depending on the user and the situation (cognitive states). One major criticism though towards formal grammars. According to our knowledge they do not make explicit the correspondances, or mappings between conceptual structures and linguistic forms (see Figure 1). Hence, formal grammars miss a link. Yet, this link is fundamental and lies at the heart of our approach. Obviously, many details are still lacking, especially concerning the size of the units, the recovery strategies (what to do in case of failure), and last, but not least the mapping rules. Actu ally, so far we have shown only the tip of the iceberg. We need to make a list of the possible conceptual structures and their linguistic counterparts, and we have to specify the mapping rules and their constraints. Despite all these shortcomings, and despite the fact that our approach still lacks formal treatment and the test of implementation, — though the P R O T E C T O R gen eration system (Nicolov, Mellish & Ritchie 1997) is a serious step in that direction,48 — it embodies in principle at least two interesting facts: pro cedural knowledge can be stated explicitely via the mapping rules; pro cessing can be speeded up by operating on larger chunks rather than atomic units. Acknowledgements. The author would like to express his gratitude to all those who were so kind to comment on the initial draft: Dominique Estival, Aravind Joshi, Guy Lapalme, Yves Lepage, William Levelt, Terry Patten, Alain Polguère, Ehud Reiter and Dan Tufis. Special thanks go to Nicolas Nicolov, who has devoted a considerable amount of time for the long discussions through which a lot of important theoretical points were clarified and who helped with editing the manuscript. 48 The use of D-Tree Grammars in PROTECTOR allow the system to operate on larger conceptual units.
348
MICHAEL ZOCK REFERENCES
Aitchison, Jean. 1983. The Articulate Mammal: An Introduction to Psycholinguistics. London: Hutchinson. Anderson, John. 1983. The Architecture of Cognition. Cambridge, Mass.: Har vard University Press. Arbib, Michael, E. Conklin & Jane Hill. 1986. From Schema Theory to Language. New York: Oxford University Press. Ausubel, David. 1980. "Schemata, Cognitive Structure and Advance Organisers: A Reply to Anderson, Spiro and Anderson". American Educational Research Journal 17:3.400-404. Bartlett, Frederik. Press.
1932.
Remembering.
Cambridge: Cambridge University
Becker, Joseph. 1975. "The Phrasal Lexicon". BBN Report No. 3081. Cam bridge, Mass.: Bolt Beranek & Newman. Bobrow, Daniel. 1968. "Natural Language Input for a Computer Problem Solving System" Semantic Information Processing ed. by Marvin Minsky, 33-145. Cambridge, Mass.: MIT Press. & G. Norman. 1975. "Some Principles of Memory Schemata". Repres entation and Understanding: Studies in Cognitive Science ed. by Daniel G. Bobrow & A. M. Collins, 31-149. New York: Academic Press. Bock, Kathryn, H. Loebell & R. Morey. 1992. "From Conceptual Roles to Structural Relations: Bridging the Syntactic Cleft". Psychological Review 99:1.150-171. Brown, Roger 1958. "Linguistic Determinism and the Part of Speech". Psycho logical Review 65:1.14-21. Bruner, Jerome. 1973. Beyond the Information Given: Studies in the Psychology of Knowing ed. by J. Anglin. New York: W. W. Norton. Burton, Richard. 1976. "Semantic grammar: an Engineering Technique for Constructing Natural Language Understanding Systems". BBN Report No. 3453. Cambridge, Mass.: Bolt Beranek & Newman. Chomsky, Noam. 1959. Review of Verbal Behavior by B. F. Skinner (New York 1957). Language 35:26-58. Clark, Herbert & E. Clark. 1977. Psychology and Language: An Introduction to Pscholinguistics. New York: Harcourt Brace Jovanovich. & S. Haviland. 1977. "Comprehension and the Given-New Contract". Dis course Production and Comprehension ed. by R. O. Freedle, 1-40. Norwood, N.J.: Ablex.
SENTENCE GENERATION BY PATTERN MATCHING
349
de Smedt, Koenrad. 1990. Incremental Sentence Generation: A Computer Model of Grammatical Encoding. Ph.D. dissertation (also NICI TR 90-01), Univer sity of Nijmegen, The Netherlands: Nijmegen Institute for Cognition Re search and Information Technology. van Dijk, Teun. 1977. "Semantic Macro-Structures and Knowledge frames in Discourse Comprehension". Cognitive Processes in Comprehension ed. by M. A. Just & P. A. Carpenter, 3-31. N.J.: Hillsdale. Fillmore, Charles. 1977. "Scenes-and-frames Semantics". Linguistic Structures Processing ed. by Antonio Zampolli, 55-82. Amsterdam: North Holland. Friedman, Daniel & M. Feldeisen. 1987. The Little LISPer. Cambridge, Mass.: MIT Press. Fries, Charles C. 1952. The Structure of English: An Introduction to the Con struction of English Sentences. New York: Harcourt Brace & World. Goffman, Erving. 1974. Frame Analysis. An Essay on the Organisation of Experience. Cambridge, Mass.: Harper & Row. Habel, Christopher. 1988. "Cognitive Linguistics: The Processing of Spatial Concepts". LILOG Report 45. Stuttgart, Germany: IBM. Harris, Zellig S. 1951. Methods in Structural Linguistics. Chicago: University of Chicago Press. Hendrix, Gary. 1977. "The LIFER Manual: A Guide to Building Practical Natural Language Interfaces". Technical Note 138, Menlo Park: SRI. Hill, Jane & M. Arbib. 1984. "Schemas, Computation and Language Acquisi tion". Human Development 27:282-296. Hovy, Edward. 1990. "Unresolved Issues in Paragraph Planning". Current Re search in Natural Language Generation ed. by Robert Dale, Chris Mellish & Michael Zock, 17-41. London: Academic Press. Kant, Immanuel. 1781. Critique of Pure Reason. Translated by Max Müller. Garden City, N.Y.: Anchor Books. Kempen, Gerard. 1977. "Conceptualising and Formulating in Sentence Produc tion". Sentence Production: Developments in Research and Theory ed. by S. Rosenberg, 259-274. Hillsdale, N.J.: Erlbaum. & E. Hoenkamp. 1987. "An Incremental Procedural Grammar for Sentence Formulation". Cognitive Science 11.201-258. Koffka, Kurt. 1935. The Principles of Gestalt Psychology. New York: Harcourt Brace & World. Lado, Robert. 1964. Language Teaching: A Scientific Approach. McGraw-Hill.
New York:
Langacker, Ron. 1983. Foundations of Cognitive Grammar I & II. Bloomington: Indiana University Linguistics Club.
350
MICHAEL ZOCK
Levelt, Willem J. M. 1981. "The Speaker's Linearisation Problem". Philosophical Transactions of the Royal Society, London, B295, 305-315. . 1982. "Linearization in Describing Spatial Networks". Processes, Beliefs and Questions: Essays on Formal Semantics of Natural Language and Nat ural Language Processing ed. by Stanley Peters & Esa Saarinen, 199-220. Dordrecht, Holland: Reidel. 1989. Speaking. Cambridge, Mass.: MIT Press. Mandler, Jean. 1979. "Categorial and Schematic Organisation in Memory". Memory Organisation and Structure ed. by C. Puff, 259-299. New York: Academic Press. McKeown, Kathleen. 1984. Text Generation. Cambridge: Cambridge University Press. Mel'cuk, Igor & A. Zholkovskij. 1970. "Towards a Functioning 'Meaning-Text' Model of Language". Linguistics 57:10-47. Miller, George. 1956. "The Magical Number Seven, Plus or Minus Two: Limits on our Capacity for Processing Information". Psychological Review 63:2.8197. & Philip Johnson-Laird. 1985. Language and Perception. Cambridge: Cam bridge University Press. Minsky, Marvin. 1975. "A Framework for Representing Knowledge". The Psy chology of Computer Vision ed. by Patrick Winston, 211-277. New York: McGraw Hill. 1985. The Society of Mind. New York: Simon & Schuster. Nicolov, Nicolas, Chris Mellish & Graeme Ritchie. 1997. "Approximate Chart Generation from Non-Hierarchical Representations". Recent Advances in Natural Language Processing ed. by Ruslan Mitkov & Nicolas Nicolov, 273294. Amsterdam & Philadelphia: John Benjamins. (This volume.) Nogier, Jean François. 1991. Génération Automatique Conceptuéis. Paris: Hermès.
de Langage et Graphs
_____ & Michael Zock. 1992. "Lexical Choice by Pattern Matching". Knowledge Based Systems 5:3.200-212. (Also in Current Directions in Conceptual Struc tures Research ed. by T. Nagle, J. Nagle, L. Gerholz & P. Eklund, 413-435, Berlin & New York: Springer Verlag, 1992.) Olson, David. 1970. "Language and Thought: Aspects of a Cognitive Theory of Semantics". Psychological Review 77:257-273. Osgood, Charles. 1971. "Where Do Sentences Come from?". Semantics: An Interdisciplinary Reader in Philosophy, Linguistics, and Psychology ed. by D. Steinberg & L. Jakubovits, 497-529. Cambridge: Cambridge University Press.
SENTENCE GENERATION BY PATTERN MATCHING
351
1980. Lectures on Language Performance, New York: Springer Verlag. Paivio, Allan. 1971. Imagery and Verbal Processes. New York: Holt, Rinehart & Winston. Patten, Terry, Michael Geis & Barbara Becker. 1992. "Toward a Theory of Compilation for Natural Language Generation". Computational Intelligence 8:1.77-101. Piaget, Jean. 1970. "Piaget's Theory". Carmichael's Manual of Child Psychology ed. by P. Mussen, vol.1, 318-323, New York: Wiley. Raphael, Bertram. 1968. "SIR: A computer Program for Semantic Information Retrieval". Semantic Information Processing ed. by Marvin Minsky, 146-226. Cambridge, Mass.: MIT Press. Rivers, Wilga. 1972. Speaking in Many Tongues: Essays in Foreign Language Teaching. Rowley: Newbury House. Roberts, Paul. 1962. English Sentences. New York: Harcourt Brace & World. Rösner, Dietmar. 1987. "The Automated News Agency: SEMTEX - A Text Gen erator for German". Natural Language Generation: New Results in Artificial Intelligence, Psychology and Linguistics ed. by Gerard Kempen, 133-148. Dordrecht: Martinus Nijhoff. Rumelhart, David. 1975. "Notes on a Schema for Stories". Representation and Understanding: Studies in Cognitive Science ed. by Daniel G. Bobrow & A. M. Collins, 211-236. New York: Academic Press. & A. Ortony. 1976. "The Representation of Knowledge in Memory". School ing and the Acquisition of Knowledge ed. by R. C. Anderson, R. J. Spiro & W. E. Montague, 99-133. Hillsdale, N.J.: Erlbaum. Schank, Roger & R. Abelson. 1977. Scripts, Plans, Goals and Understanding: an Inquiry into Human Knowledge Structures. Hillsdale, N.J.: Erlbaum. Simmons, Robert & J. Slocum. 1972. "Generating English Discourse from Se mantic Networks". Communications of the Association for Computing Ma chinery (CACM) 15:10.891-905. Skinner, Burrhus. 1957. Verbal Behavior. New York: Appleton-Century-Crofts. Sowa, John. 1984. Conceptual Structures: Information Processing in Mind and Machine. Reading, Mass.: Addison Wesley. Stockwell, Robert P. 1977. Foundations of Syntactic Theory. Engiewood Cliffs, N.J.: Prentice Hall. Swartout, William. 1983. "XPLAIN: A System for Creating and Explaining Expert Consulting Systems". Artificial Intelligence 21:3.285-325. Tannenbaum, Percy & F. Williams. 1968. "Generation of Active and Passive Sentences as a Function of Subject or Object Focus." Journal of Verbal Learning and Verbal Behavior 7:246-250.
352
MICHAEL ZOCK
Vennemann, Theo. 1975. "An Explanation of Drift". Word Order and Word Order Change ed. by C. Li, 269-305. Austin, Texas: University of Texas Press. Weizenbaum, Joseph. 1966. "ELIZA. A Computer Program for the Study of Natural Language Communication between Man and Machine". Communic ations of the Association for Computing Machinery (CACM) 9:36-45. Wilks, Yorick. 1975. "A Preferential Pattern-Seeking Semantics for Natural Language Inference". Artificial Intelligence 6:1.53-74. Winograd, Terry. 1972. Understanding Natural Language. New York: Academic Press. 1975. "Frame Representation and the Declarative-Procedural Contro versy". Representation and Understanding: Studies in Cognitive Science ed. by Daniel G. Bobrow & A. M. Collins, 185-210. New York: Academic Press. Zock, Michael. 1988. "Natural Languages are Flexible Tools, That's What Makes Them Hard to Explain, to Learn and to Use". Advances in Natural Language Generation: an Interdisciplinary Perspective ed. by Michael Zock & Gerard Sabah, 181-196. London: Pinter. 1990. "If you Can't Open the Black Box, Open a Window! A Psycholinguistically-Motivated Architecture of a Natural Language Generation Com ponent". Proceedings of COGNITIVA-90, 143-152. Madrid, Spain. 1994. "Language in Action, or, Learning a Language by Watching It Work". Proceedings of the 7th Twente Workshop on Language Technology: Computer-Assisted Language Learning, 101-111. Twente, The Netherlands. 1996. "The Power of Words in Message Planning". Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), 990-995. Copenhagen, Denmark. Zock, Michael, Gerard Sabah & C. Alviset. 1986. "From Structure to Process: Computer-assisted Teaching of Various Strategies for Generating PronounConstructions in French". Proceedings of the 11th International Conference on Computational Linguistics (COLING-86), 566-570. Bonn, Germany.
An Empirical Study on the Generation of Descriptions for Nominal Anaphors in Chinese C H I N G - L O N G Y E H * & C H R I S MELLISH**
* Tatung Institute of Technology ** University of Edinburgh Abstract In this paper, we propose a preference rule for the generation of de scriptions of nominal anaphors in Chinese. The rule emphasises using different forms of descriptions, full and reduced, to reflect the 'dis course structure' in the generated text. We performed experiments using a simple rule and then the preference rule on a set of descript ive texts. Comparing the results, the latter shows effectiveness to account for shifts of intentions. We finally show the implementation of the rule in our Chinese natural language generation system.
1
Introduction
In Chinese, anaphors can be classified as zero, pronominal and nominal forms, as exemplified in (1) by 1 t a i ( h e ) and nage reni (that person), respectively. (1)
a. Zhangsani jinghuang de wang wai pao,
Zhangsan frightened NOM to outside run Zhangsan was frightened and ran to outside. b. φ\ zhuangdau yige renj, (he) bump-to a person (He) ran into a person. c. tai kanqing Ie na ren
j
de zhangxiang,
he see-clear ASP that person GEN appearance He watched clearly that person's appearance. d.
renchu η a renj shi shei.
(he) recognise that person is who (He) recognised who that person is. We have established a rule t h a t includes a set of syntactic, semantic and discourse-oriented constraints to decide between the generation of zero, 1 We use a to denote a 'zero anaphor', where the subscript a is the index of the zero anaphor itself and the superscript b is the index of the referent. A single φ without any script represents an intra-sentential zero anaphor. Also note that a superscript attached to a noun phrase is used to represent the index of the referent.
354
CHING-LONG YEH & CHRIS MELLISH
pronominal and nominal anaphors (Yeh & Mellish 1994; Yeh 1995:42-77). 'Nominal anaphors' do not have unique forms like their zero and pronom inal counterparts. The description can be the same as the 'initial reference5, parts of the information in the initial reference can be removed, new inform ation can be added to the initial reference, or even a different lexical item can be used for a nominal anaphor. In this paper, we investigate the choice of appropriate descriptions for nominal anaphors in Chinese natural lan guage generation. Former related research in natural language generation (Dale & Had dock 1991; Dale 1992:187-193; Reiter & Dale 1992; Tutin & Kittredge 1992; Horacek 1995) focused on creating 'referring expressions' for entities to dis tinguish them from a set of objects that the reader is assuming to be at tending to. These algorithms can efficiently create descriptions to identify the 'intended referent' unambiguously. The resulting descriptions, however, only reflect the attentional aspect of discourse (Grosz & Sidner 1986:177). In this paper, we attempt to investigate the role of descriptions for nominal anaphors in another aspect of discourse, namely, intention (Grosz & Sidner 1986:177). We propose a preference rule for choosing different descriptions for nominal anaphors to reflect shift of intention in a discourse. To invest igate the effectiveness of the rule, we performed two experiments on three sets of Chinese text as the test data. The experiments were carried out by comparing the nominal descriptions in the test data with the corresponding ones created by using a simple rule and then the preference rule, assuming the same semantic structures and context. The comparison of the results shows that the preference rule is effective. 2
Analysis of nominal anaphors in t h e test d a t a
The surface structure of a Chinese nominal anaphor is a noun phrase which consists of a head noun optionally preceded by associative phrase, articles, relative clauses and adjectives (Li & Thompson 1981:103-126). The nominal descriptions investigated in the remainder of this paper are thought of as noun phrases of the above scheme without articles. A nominal anaphor is referred to as a 'reduced form', or a 'reduction', of the initial reference if its head noun is the same as the initial reference, and its modification part is a strict subset of the optional part in the initial reference; otherwise, if it is identical to the initial reference, then it is a 'full description'. Observing the nominal anaphors occurring in the test data, we can clas sify nominal descriptions as below, with examples shown in Table 1.
GENERATION OF NOMINAL ANAPHORS
355
Initial references A zuchiu (football) Β tie-tong (iron barrel) C tie-tong (iron barrel) D shuei (water)
Nominal anaphors zuchiu (football) tie-tong (iron barrel) tong (barrel) yuan-wan-zhong de shuei (water in the round bowl) E qian (money) neixie chaopiau (those notes) Table 1: Examples of nominal anaphors Total A B O D E 147 38 33 6 10 234 63% 16% 14% 3% 4% 100% Set2 248 35 39 25 13 360 67% 10% 14% 6% 4% 100% Set3 46 12 0 0 0 58 79% 21% 0% 0% 0% 100% Table 2: Nominal anaphors in the test data Data Setl
A. The initial reference is a bare noun, and the subsequent reference is the same as the initial reference. B. The initial reference is reducible, and the subsequent refer ence is the same as the initial reference. C. The initial reference is reducible and the subsequent refer ence is a reduced form of the initial reference without new information. D. The subsequent reference has new information in addition to the initial reference. E. Otherwise. The occurrence of the types of nominal anaphors in the test data, in terms of the above classification, is shown in Table 2. 3
A preference rule for nominal descriptions
The decision about what descriptions to use for initial references is a com plicated process (Dale 1992:105-106). In this paper, we only consider 'sub sequent reference'. Previous work on the generation of referring expressions focused on producing 'minimal distinguishing descriptions' (Dale & Had dock 1991; Dale 1992:186-195; Reiter & Dale 1992) or descriptions custom-
356
CHING-LONG YEH & CHRIS MELLISH
ised for different level of hearers (Reiter 1990). Since we are not concerned with the generation of descriptions for different level of users, we only look at the former group of work. The first group of work aims at generating descriptions for a subsequent reference to distinguish it from the set of en tities with which it might be confused. The main data structure in these algorithms is a 'context sef which is the set of entities the hearer is cur rently assumed to be attending to except the intended referent. Basically their algorithms can be regarded as ruling out members of the context set. These algorithms pursue efficiency in producing an adequate description which can identity the intended referent unambiguously with a given con text set. In his system (Dale 1992:173-175), Dale used the global focus space (Grosz & Sidner 1986:179-182), as the context set in his domain of small discourse. Following this idea, the context set grows as the discourse pro ceeds. Consider, for example, two nominal anaphors referring to the same entity occurring at different places in a discourse. According to the above algorithms, a single description would be produced for both anaphors, if the context sets at both places have the same elements. On the other hand, in general, a description with more distinguishing information is used for the latter, if more distractors are entered into the context set. Grosz & Sidner (1986:177-178) claim that 'discourse segmentation' is an important factor, obviously not the only, governing the use of referring expressions. If, the idea of context set were restricted to local focus space (Grosz & Sidner 1986:177-178), then the resulting descriptions would be to some extent sens itive in dealing with the local aspect of discourse structure. Although the algorithms would be refined due to the introduction of discourse structure, they, however, would essentially still serve the distinguishing purpose. The beginnings of 'discourse segments' in a sense indicate shifts of in tentions in a discourse (Grosz & Sidner 1986:178-179). In this situation, subsequent references may be preferred to be full descriptions rather than reduced ones or pronouns to emphasise the beginning of discourse segments, even if the referents have just been mentioned in the immediately previous clause. Some examples were used to illustrate this idea, for example, in (Grosz & Sidner 1986:180). A similar situation happens in Chinese dis course. First of all, let's have a look at a characteristic of discourse segment structure in Chinese written text. In Chinese written text, a sentential mark, '.', is normally inserted at the end of a 'sentence' 2 , which is 2 We use a quoted sentence to distinguish from the usual sense of sentence in English. The sentential mark also has two auxiliaries, question and exclamation marks, which are used to express 'sentences' with certain tones.
GENERATION OF NOMINAL ANAPHORS
357
a. fengzheng2 φ fangdau gaukong shangqu yiho, b. la fengzheng2 de xiaη·j zheme yeh la bu zhi, c. φ3 zhongshi xiang xia wan, d. zhe shi weisheme ne? e. yuanlai, buguan fang fengzheng2 de xiaηj you duome xi, f. φβ dou shi you zhongliang de, g. xianj de zhonglian shi youyu diqiu dui xianj you xiyin de lilianl er chansheng de, h. zheige liliang l hauxiang wuxing de shou, i. ɸk ba xiaηj xiangxi zhuai, j . xian j φ jiu la bu zhi le. k. qishi, fengzheng2 yeh you zhongliang, 1. yinwei feng m chui zhe fengzheng2, m. φm shi fengzheng2 xiang shang sheng, n. shuoyi fengzheng2 bingbu xiang xia chen. o. zheyang, φ zai fang fengzheng2 shi, p. piau zai kongzhong de xianj xingcheng yige wanchu de huxing. q. piau zai kongzhong de xianj yu chang, r. xianj wanchu de yu lihai, s. φj yu la bu zhi.
a. When flying a kite i in the sky, b. the string pulling the kite ij can't be pulled straightly. c. It·7 is always bent downwards. d. Why is that? e. So, however thin the string pulling the kite2·7 is, f. (it)j all has weight. g. The weight of the string j is because the attracting power of the earth on the string jl. h. This powerl is like a transparent hand. i. (It) l pulls the string·j down. j . The string·j then can not be pulled straightly. k. However, the kite2 also has weight. 1. Since the windm blows the kite i . m. (it) m makes the kite2 rise. n. Therefore, the kite2 does not fall down. o. So when flying a kite i , p. the string fluttering in the skyi forms a curved arc. q. The longer the string fluttering in the sky·7, r. the more curved the string i is, s. and the more difficult (it)·7 is pulled straightly. Fig. 1: A sample Chinese text and its
translation
358
CHING-LONG YEH & CHRIS MELLISH
Key: j.z: referent j in zero form. j.full: referent j in full noun phrase. j.other: referent j in another noun phrase. j.reduced: referent j in reduced noun phrase — — : "sentence" boundary. Fig. 2: Occurrence of referent ' j ' in the discourse in Fig. 1 a meaning-complete unit in a discourse, such as a to d, e to j , k to η, ο to ρ and q to s in Fig. 1 3; on the other hand, commas are inserted between clauses within a 'sentence' as separators (Liu 1984:79-80). In our previous study (Yeh & Mellish 1994; Yeh 1995:50-54), we found that a 'sentence' to a large extent corresponds to a discourse segment. A Chinese discourse, say a paragraph of written text, therefore consists of a sequence of 'sentences' and the corresponding intentions altogether form the intention of the discourse. 3 This is obtained from a scientific question-answer book which is used as a set of test data in (Yeh & Mellish 1994; Yeh 1995:44-45).
GENERATION OF NOMINAL ANAPHORS
359
Among the groups of initial and subsequent references in Fig. 1, we focus on the one indexed j, la fengzheng de xian (the string pulling the kite). After it is initially introduced in b, it then appears in zero and nominal forms alternatively in the rest of the discourse, as shown schematically in Fig. 2. At the beginning of the second 'sentence', it appears in a full description and then in four reduced descriptions in the rest of the 'sentence'. It is not mentioned in the third 'sentence'. When it is reintroduced into the fourth 'sentence', it is in another noun phrase, piau zai kongzhong de xian (the string fluttering in the sky), which is not reduced. Then, in the last 'sentence', it repeats the same patterns as in the second 'sentence'. Since there are no distracting elements for the string in the discourse, the use of full descriptions at the beginning of 'sentences', e and q, can be interpreted as emphasising that a new discourse segment, 'sentence', has begun. The accompanied reduced descriptions can then be explained as being intended to contrast with the emphasis at the beginning of 'sentences'. Note that a full description is used for the subsequent reference in ρ that is not the beginning of 'sentence' because it is the first mention in the 'sentence'. Thus, we would generalise the above interpretation as that a full description is preferred for a subsequent reference if it is at the beginning of a 'sentence' or the first mention in the 'sentence'; otherwise, a reduced description is preferred. In case distracting elements happen in a 'sentence', a sufficiently dis tinguishable description is required for a subsequent reference within the 'sentence' instead of a reduced one, even if it has been mentioned previ ously in the 'sentence', for example, yuanwang (the round bowl) in (2d) and fangwan (the square bowl) in (2e) 4. (2)
a. zhaolai tongyang daxiao de liangkuai tiepi, get same big-small NOM two iron-piece Get two pieces of iron of the same size. b. zhuocheng yige yuanwani he yige fangwanj. make one round-bowl and one square-bowl Make a round and a square bowl. c. ba yuanwanili zhuangman le shuei, BA round-bowl-in fill-full ASP water Fill water to full in the round bowl. d. ranhou ba yuanwanizhong de shuei manman daujin fangwanJli, then BA round-bowl-in GEN water slowly fill-in square-bowl-in Then fill the water in the round bowl into the square bowl.
4 This is also obtained from the same set of test data as (1).
360
CHING-LONG YEH & CHRIS MELLISH e. ni huei faxian fangwanj zhuangbuxia zheixie shui, you will find square-bowl fill-not-in these water You will find that the square bowl cannot be filled in the water. f. youxie shui hui liu chulai. have-some water will flow out-come Some water will flow over.
According to the above observations, we propose the following preference rule for the generation of descriptions for nominal anaphors in Chinese. If a nominal anaphor, n, is the first mention in a 'sentence', then a full description is preferred; otherwise, if η is within a 'sentence' and has been mentioned previously in the same 'sentence' without distracting elements, then a reduced description is preferred; otherwise a full description is preferred.
4
Experimental results
The experiment is described below. • For each nominal anaphor we are concerned with in a set of test data written by humans, repeat the following steps. — A nominal description is generated by using a rule, assuming the same semantic structure and context. — Then the resulting description is compared with the corresponding description in the text. — The comparison is matched if both sides are of either full or reduced descriptions of the initial reference; otherwise it is unmatched. At the end of the experiment, the numbers of matches were collected to show the effect of the rule. Following our previous work (Yeh & Mellish 1994; Yeh 1995:44-45), we employed the same sets of text, Sets 1, 2 and 3, as the test d a t a in this paper. Sets 1 and 2 consist of a number of scientific questions and answers for children and the other is a brief introduction to modern Chinese grammar. Since the aim of this work is to refine its predecessors (Yeh & Mellish 1994; Yeh 1995:42-77), in the following, we focus on nominal anaphors which were correctly matched by using the rule established in (Yeh 1995:66-67). The differences between the total nominal anaphors in Table 2 and the following are the unmatched cases by using the same rule. We started with a simple rule for the generation of nominal anaphor descriptions as below.
GENERATION OF NOMINAL ANAPHORS Data Setl
Matched yes
Set2
no yes
Set3
no yes no
A 137
C 0
D 0
Ε 0
Total 172
% 79
0 0 30 232 32 0
6 0
9 0
45 264
21 78
25 0 0
11 0 0
73 58 0
22 100 0
0 46 0
Β 35
361
0 12 0
37 0 0
Table 3: Result of using the simple rule on the test data Data Setl
Matched yes
Set2
no yes
Set3
no yes no
A 137
Β 28
C 26
D 0
Ε 0
Total 191
% 88
0 7 4 232 27 27
6 0
9 0
26 286
12 85
25 0 0
11 0 0
51 58 0
16 100 0
0 46 0
6 12 0
9 0 0
Table 4: Result of using the preference rule on the test data Leave the description of the initial reference unchanged for nom inal anaphors throughout the discourse. In other words, according to this rule, only full descriptions of the initial references would be produced. The result of the experiment using this rule is shown in Table 3. The types A to Ε in this table, and the following Table 6.4, are described in Sec. 2. The result summarises the fact that all of the nominal anaphors having full descriptions are correctly matched by using the simple rule, which amounts to 79, 78 and 100% of the nominal anaphors concerned. However, reduced descriptions and other two types of descriptions, D and E, would not happen in the generated texts. We then repeated the previous experiment by using the preference rule described previously and obtained statistics as shown in Table 4. As shown in the table, by using the new rule, in addition to the fact that the majority of the nominal anaphors using full descriptions are correctly matched, a considerable number of reduced descriptions are matched as well, giving overall matches of 88, 85 and 100%. If we only consider Types A, B and C, namely, full and reduced descriptions in the test data, the match rates become 94% (190/202), 94% (284/301) and 100% (58/58). Both groups of figures show that the preference rule is promising in the choice of reduced descriptions for nominal anaphors.
362 5
CHING-LONG YEH & CHRIS MELLISH Implementation
The preference rule is currently being implemented in the referring expres sion component of our Chinese natural language generation system (Yeh 1995:96-139) that generates paragraph-sized texts for describing plants, an imals, etc., growing in a national park. Basically, the main goal of our work is to generate coherent texts by taking advantage of various forms of anaphors in Chinese. The system, like conventional ones (McKeown 1985:11-13; Maybury 1990:4-6; Dale 1992:12-13). is divided into strategic and tactical components. Since we do not aim at inventing new concepts in content plan ning, we borrow the idea of text planning in Maybury's TEXPLAN system (Maybury 1990:101-131) as the basis of the former component. As for the tactical component, we have constructed a simple Chinese grammar in the PATR formalism (Shieber 1986:24-35), which is sufficient for our purpose at the current stage. On accepting an input goal from the user, the system invokes the text planner according to the operators in the plan library to build up a plan which is a hierarchical discourse structure to satisfy the input goal. After the text planning is finished, the decision of anaphoric forms and descriptions is then carried out by traversing the plan tree. Within the traversal, when a reference is met, if it is a subsequent one, then the program consults the rule developed in (Yeh & Mellish 1994; Yeh 1995:66-67) to obtain a form, zero, pronominal or nominal. If the nominal form is decided, then the preference rule in this paper is consulted to get a description. In the domain knowledge base, each entity, in addition to the informa tion for the head noun in the surface form, is accompanied with a property list that will be realised in the modification part in the surface noun phrase for the initial reference. For an initial reference, we build up its semantic structure by taking all the elements in the property list along with the sub stance of the entity, corresponding to the head noun in the surface noun phrase. To simplify the work, for the moment, only one element is stored in the property list. When a full description is decided for a subsequent reference, its semantic structure contains the same property and substance information as the initial reference. On the other hand, if a reduced descrip tion is decided, only the substance is taken into the semantic structure. In the future, we will extend the property list by allowing multiple elements in the list. In the implementation, we found that the segmentation of a text plan tree into 'sentence' units is essential for a successful implementation of the
GENERATION OF NOMINAL ANAPHORS
363
constraint of segment beginning in the rule for choosing an anaphoric form and the preference rule for nominal descriptions. Currently, we examine the decomposition field of a planning operator by hand to determine 'sentence' boundaries and fix this for all applications of the operator. 6
Conclusion
A rule for the generation of nominal descriptions based on empirical study is presented. The rule uses full and reduced descriptions to characterise shifts of intention in the generated discourse. The experimental results show that 88, 85 and 100% of the nominal anaphors in the test data can be captured by using this rule. We have implemented the rule in our Chinese generation system and obtained promising results. In the future, we will have the rule widely tested and evaluate the performance of the rule. REFERENCES Chao, Yuan Ren. 1968. A Grammar of Spoken Chinese. Berkeley, Calif.: Uni versity of California Press. Dale, Robert & Nicholas Haddock. 1991. "Content Determination in the Gener ation of Referring Expressions". Computational Intelligence 7:4.252-265. 1992. Generating Referring Expressions: Constructing Descriptions in a Domain of Objects and Processes. Cambridge, Massachusetts: The MIT Press. Grosz, Barbara J. & Candace L. Sidner. 1986. "Attention, Intentions, and the Structure of Discourse". Computational Linguistics 12:3.175-204. Horacek, Helmut. 1995. "More on Generating Referring Expressions". Proceed ings of the 5th European Workshop on Natural Language Generation, 43-58. Leiden, The Netherlands. Li, Charles N. & Sandra A. Thompson. 1979. "Third-Person Pronouns and Zero-anaphora in Chinese Discourse". Syntax and Semantics: Discourse and Syntax ed. by T. Givon, vol.XII, 311-335. New York: Academic Press. 1981. Mandarin Chinese: A Functional Reference Grammar. Berkeley, California: University of California Press. Liu, Yu-Cen. 1984. Zhuowen de Fang Fa (Approaches to Composition). Taipei, Taiwan: Xuesheng Chubanshe. [In Chinese.] Maybury, Mark T. 1990. Planning Multisentential English Text Using Commu nicative Acts. Ph.D. dissertation, Cambridge University, Cambridge, U.K.
364
CHING-LONG YEH & CHRIS MELLISH
McKeown, Kathleen R. 1985. Text Generation. Cambridge: Cambridge Univer sity Press. Reiter, Ehud. 1990. "Generating Descriptions that Exploit a User's Domain Knowledge". Current Research in Natural Language Generation ed. by Robert Dale, Chris Mellish & Michael Zock, 257-285. London: Academic Press. & Robert Dale. 1992. "A fast algorithm for the generation of referring expressions". Proceedings of Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), 232-238. Nantes, France. Shieber, Stuart M. 1986. An Introduction to Unification-Based Approach to Grammar. (= CSLI Lecture Notes, 4.), Stanford, Calif.: CSLI. Tutin, Agnès & Richard Kittredge. 1992. "Lexical Choice in Context: Generating Procedural Texts". Proceedings of Proceedings of the 14th International Con ference on Computational Linguistics (COLING-92), vol.11, 763-769. Nantes, Prance. Yeh, Ching-Long. 1995. Generation of Anaphors in Chinese. Ph.D. dissertation, University of Edinburgh, Edinburgh, Scotland. & Chris Mellish. 1994. "An empirical study on the generation of zero anaphors in Chinese". Proceedings of the 15th International Conference on Computational Linguistics, 732-736. Kyoto, Japan.
Generation of Multilingual Explanations from Conceptual Graphs KALINA BONTCHEVA
Bulgarian Academy of Sciences Abstract This paper presents an approach for generation of multilingual ex planations from conceptual graphs in restricted domains with highly conventional language. The generator is developed within a Know ledge-Based Machine Aided Translation project DB-ΜΑΤ. The sys tem's objective is to provide the translator with the necessary domain knowledge. The algorithms handle extended referents, complex type definitions, several explanation levels and provide coherent multisentential text. Complex graphs are broken into simpler subgraphs, which are ordered according to a selected schema. 1
Introduction
The DB-ΜΑΤ project (DB-MAT 1995) explores the approach, where the translator is supported by linguistic and domain knowledge, presented to him/her during the translation process (v. Hahn 1992; v. Hahn & Angelova forthcoming). Many translations are needed for technical documents in restricted domains and translator's familiarity with the terminology is of crucial importance for the quality of the resulting text. Therefore, an in tegrated MAT system was designed and implemented. The user accesses domain knowledge by highlighting a string in the source text and specify ing the type of query. The system has two distinct layers — language level and conceptual level. The representation of domain knowledge is lan guage independent and is based on conceptual graphs (cGs) (Sowa 1984, 1992). A sample graph from DB-ΜΑΤ Knowledge Base (KB) in the domain of oil separation is given in Figure 1. The system identifies the relevant domain information and produces a NL explanation (clarification). In DBMAT explanation denotes the answer generated from the internal concep tual representation. The result may vary between a single sentence and a paragraph-long text. Thus the translator can find the appropriate verbal description, if a term is missing in the target language, or even introduce a new term. This paper will address aspects of the generation process, including an outline of the strategic component (Query Mapper) and the main parts
366
KALINA BONTCHEVA
Fig. 1: Conceptual graph in graphical and linear notation of the tactical module (EGEN). Since the approach is mainly conceptually oriented, most of the generator layers are language-independent. All gener ation algorithms were designed after a careful analysis of technical texts in manuals, textbooks and encyclopaedia. 2
C o n c e p t u a l graphs: A brief introduction
Conceptual graphs (cGs) are finite, bipartite graphs. The nodes are either concepts or conceptual relations. The two kind of nodes are connected by pointed arcs. Concepts can have arcs only to conceptual relations and vice versa. Each n-ary relation has η — 1 incoming arcs and one outgoing arc. Every concept consists of a type label and a referent field (see Figure 1). All concept types form a type hierarchy, which is a lattice. There are four canonical formation rules which are used for derivation: copy — make an exact copy of a graph; restrict — replace a concept type by a subtype or specialise the referent field; simplify — remove all duplicate relations; join — join two graphs on identical concepts. A new concept type is introduced by a type definition which is a monadic lambda abstraction λau, where u is a conceptual graph, called differentia. t y p e POSITIVE(x) is [NUMBER:*x] - ( > ) - > [NUMBER: 0] . If we have a graph u containing a concept a and a type definition for a, then we can define the operation minimal type expansion. It consists of joining the graph u with the differentia on the concept a. A new relation is introduced by a relation definition which is n-adic lambda abstraction on the relation's arguments. relation AGNT(x,y) is [ACT: *x] - (LINK) - > [AGENT] - (LINK) - > [ANIMATE: *y] . The operation relation expansion replaces a conceptual relation and its at tached concepts with the graph from the relation definition, by making the
GENERATION OF MULTILINGUAL EXPLANATIONS FROM CGS 367 necessary restrictions of the concepts. The projection operation operates on two graphs u and υ and extracts a subgraph πν of u called a projection of υ in u. The properties of the projection operation are given in (Sowa 1984). 3
Our internal representation
Conceptual graphs are represented as tuples — graph identifier and rela tion list. The relation list contains triples — relation name, argument list and annotation field. The argument list contains concept/graph identifiers and is ordered according to the arc numbers, the last one being the out going arc. All concepts are organised in a concept table and have unique identifiers. The referent field is represented as a feature structure and uni fication algorithm is applied. DB-ΜΑΤ supports extended referents (Sowa 1993): generic, individual marker, set, generic set, counted set, universal quantifier, singular and plural questions, measures and nested graphs. A full description of the PROLOG implementation of conceptual graphs and all related algorithms is given in (Petermann, Euler & Bontcheva 1995). 4
Relevant system components
The domain knowledge (currently in oil separation) is encoded in a set of CGS called Knowledge Base (KB). It consists of a type hierarchy (lat tice), canonical graphs, type and relation definitions. The algorithm for knowledge extraction applies (maximal) join, projection and other CG op erations. The DB-ΜΑΤ lexicon contains both general lexica (everyday words, function words, etc.) and terminology. All lexicon entries contain the usual information (part of speech, gender, inflection class, etc.). Every domain specific entry has a link to the KB. The generator also obtains information about verb's transitivity, (un)countable noun, etc. Additionally, EGEN ex tracts all terms corresponding to a given concept and gives them as syn onyms. However, we keep track of the term which was originally highlighted by the user and it is always this term which is used throughout the explan ation. Separate German and Bulgarian morphological components have been implemented. They are used during generation and query analysis phases. The overall DB-ΜΑΤ architecture is discussed in more details in (v. Hahn & Angelova 1994b).
368 5 5.1
KALINA BONTCHEVA Multilingual generation of explanations The main objectives — subject information, coherence and multilinguality
The main task is to provide the translator with the necessary domain know ledge presented to him/her as a NL explanation in German/Bulgarian. At present several query (explanation) types are supported — definitions, re lated concepts, characteristics, examples, similarity and difference (in the last two cases the user is prompted for a second term). New query types can be added by the translator and will appear in the menu. The user can customise the Query Mapper strategy per query type, i.e., he/she can specify the relevant relations (v. Hahn & Angelova 1994b). After some experiments with a KB in oil separation, we found that often we must generate a multisentential text. As a result, algorithms providing a coherent output had to be designed and implemented. Due to the specific domain terminology and the established language conventions, the structure of domain-oriented texts can be captured by a set of predefined schemata. Therefore DB-ΜΑΤ supports three schemata — one for identification (used for definitions in the corresponding LSP — Language for Special Purposes ), one for similarity and one for difference. They are rather similar to those introduced in (McKeown 1985). The extracted graphs are ordered according to the selected schema thus forming a well-structured explanation. For instance, the definition schema first introduces the supertype(s) come all functions, attributes and parts. Analogies and examples are given last. Analogies appear only in case of iterative explanations, when a discourse history is available and the new term can be compared to another one, already introduced to the translator. Most of our efforts were concentrated on the definition schema for several reasons (mainly user-oriented): • if the translator is not familiar with a term, most frequently he/she asks about its definition; • if there's a terminological gap in the target language, the translator will need a definition of the term in order to make a paraphrase or introduce a new term. Another serious challenge faced by the generator was the difference between concepts in the KB and their NL utterance. Often concepts are expressed by compound terms and one-word terms are represented by complex con ceptual structures. Additionally, some concepts can be expressed in one language, while there is no corresponding term in the other. The map-
GENERATION OF MULTILINGUAL EXPLANATIONS FROM CGS 369 ping between concepts and lexicon entries (i.e., existing language terms) is given by P R O L O G terms, specifying the German/Bulgarian "names" of the KB concept types. If there is no corresponding term (i.e., no available lexicon entry for that language), then the term is missing. For instance, for C0R_PLATE we have — lex-kb-g 1 (id5 2 , COR-PLATE') . — while the cor responding term with b is missing (since there is no such term in Bulgarian). If we want to find the Bulgarian utterance of a graph containing a concept of type C0R_PLATE, we take the type definition of C0R_PLATE and perform a minimal type expansion. This step is applied iteratively until all concepts can be mapped to legal lexicon entries. Since this operation may lead to over-generation in the case of very complex type definitions, the generator takes a simple subgraph containing only few different relations and uses it instead of the complex one. A predefined precedence relation is used in such cases, the CHAR and ATTR relations being the most preferred. The latter is due to our studies which proved that characteristics and attributes tend to be used when compound terms and phrases are formed.
5.2
Query mapper — the strategical component
Every query type (e.g., definition, characteristics) has a corresponding set of query graphs which define all conceptual relations relevant to the given query. The Query Mapper identifies the relevant knowledge pool by extract ing the projections of the query graphs in all CGS from the KB. Depending on the detailness level and query type, our algorithms also extract knowledge inherited from superconcepts. The first step of the Mapper's algorithm is the interpretation of the highlighted sequence. At that phase some typic ally multilingual problems should be resolved — processing of terminological gaps and phrase explanations (Winschiers & Angelova 1993). Another possible case is partial term highlighting — the selected se quence has no independent domain meaning, but is a part of one or more complex terms. Then the Query Mapper displays a list of all relevant com plex terms, thus enabling the translator to make a new choice. The two special queries — similarity and difference, also change the Mapper's overall strategy (Winschiers & Angelova 1993; v. Hahn & An gelova 1994a). The algorithms rely on the powerful CG inheritance by using the type hierarchy and the CG formation rules. 1 g stands for German and b for Bulgarian. 2 Where id5 is a unique identifier of a German lexicon entry for"Wellplate"
370
KALINA BONTCHEVA
5.3
EGEN — the tactical component
A very important asset of the CGS proved to be their non-hierarchical struc ture, since the generation may start from any concept node. Therefore, the generator may select the subject and the main predicate from a linguistic perspective. Additionally, the encoded semantics is almost free from any language commitments. Hence the CGS are rather suitable for building language independent Knowledge Bases (KB), constructed for a particular level of domain familiarisation. However, different explanation levels can be maintained with the help of the well-defined operations. 5.3.1
Input
EGEN has as input a list of CGs (the relevant knowledge pool, passed by the Query Mapper), language, focus list, query type and iterative call flag. Language specifies the explanation's language. Focus list contains the high lighted concept(s), which should become the global focus of the generated explanation. Usually this list contains one concept, but in case of sim ilarity/difference question, there will be more items. Additionally EGEN maintains some discourse information using iterative call flag. This flag is set, when the user has highlighted a term into the explanation window, asking for further clarification. In this case EGEN should preserve all con cepts introduced in the previous explanation into the discourse stack and use proper referring expressions. 5.3.2
Explanation levels
We have made certain steps towards modelling of user's domain knowledge. So far the translator is provided with the following explanation levels (i.e., levels of domain familiarisation) and he is free to select one of them: • Minimal — all complex terms are fully explained (for each complex term, having type definition in the KB, insert a generated explanation using this definition). For instance, if EGEN encounters the concept type SHELL SEPARATOR, then it also inserts an explanation about it at the place of its first occurrence in the context. The type definition is verbalised completely. • Average — reduced complex term explanations. Again the type defin itions are used, but the definition graph is not processed completely. Only relations included into a user-defined set are verbalised. © Expert — no additional explanations.
GENERATION OF MULTILINGUAL EXPLANATIONS FROM CGS 371 These ideas can be elaborated further to track all terms familiar to the user and use them in consecutive sessions. As a result, EGEN may introduce comparisons with familiar terms. Additionally, in case of terminological gap EGEN may apply the respect ive algorithm and provide the user with opportunity to enter his paraphrase or newly created term into the lexicon. That information will be userspecific and the next time EGEN has to express that term, it will use the new entry (if the same user is working with the system). In this way DBMAT will provide the translator with a convenient way to introduce his own terminology into the lexicon and use it consistently afterwards. This will prevent the user from introducing two or more different phrases denoting one and the same missing term (which results into translation ambiguity). 5.3.3
Utterance forming
Since the KB was not designed under linguistic objectives, all CGS need some pre-processing before they can be verbalised. The system maintains a basic set of relations (like OBJ, AGNT, INST, etc.), which are used actively by the generation algorithm. All new relations are introduced by relation definitions and the event concepts3 have corresponding case frames4. EGEN distinguishes between type definitions and case frames, since the definitions carry the concept's domain semantics, while the case frame is used for purely NL purposes. The pre-processor checks each relation from the input CGS. If a relation is not a basic one, then relation expansion is performed and the resulting graph is checked upon the corresponding case frame. The resulting CG is verbalised instead of the original one. In this way, all input CGS are transformed to CGS for which the generation algorithm could be applied. In EGEN's design and implementation the guidelines given in section 5.4 (Sowa 1984) are followed. We try mapping concepts onto nouns, verbs, adjectives or adverbs and relations into "function words" or syntactic ele ments. However, we have extended Sowa's algorithm to cover a wider range of conceptual graphs: • Extended referents are handled consistently — e.g., measures, sets, disjunctive sets, etc. • Relevant properties are grouped together — e.g., if there are several concepts linked by the ATTR relation to a common concept, then the attributes are ordered according to their "distance" in the type hier3 Event concepts are all concepts having ACTION or EVENT as their superconcept. 4 Canonical graphs that show the expected configuration of concepts and relations.
372
KALINA BONTCHEVA
archy. For instance, if there are several attributes specifying physical dimensions and other object characteristics, then all dimensional at tributes will be grouped together. On the contrary — if they remain mixed up, the resulting text will sound unnatural. • Conjunction is introduced when there are two or more concepts linked by matching relations to a given concept — e.g., if there are two attributes of a concept then they are verbalised in a conjunction. • The grammar output is not a word sequence, but a sentence tree with a root — the starting category S and leaves — the generated sentence. In this way some post-processing could be applied before the sentence is realised as a word sequence. • EGEN has a rule for relative clauses — if a concept has more than one adjacent OBJ or AGNT relations, then a relative clause is generated (Zock 1996). The selection of the appropriate connecting word (which, who, where) depends on the concept's place in the type hierarchy. For instance, where is used for PLACEs and who — for PERSONS. The referents are processed by the grammar rules and their value determines the number and the article, although they might be overridden with the information from the lexicon. In principle, generic referents are verbalised as an indefinite article, unless this is an uncountable noun; individuals - as proper names; definite - as a definite article; sets - as NP with all elements. Collective and disjunctive sets are distinguished and stylistic rules decide whether the elements should become adjectives or be enumerated (see the example below); generic sets - as plural, indefinite; counted sets - as a number and plural; and measures - as a number and the respective unit (e.g., 5 mg/1). [PHYSICAL STATE: disj {MEMBRANE, DROPS, COLLOID, EMULSION, SOLUTION} ] physicalische Eigenschaft: membranartig, tropfenfömig, kolloid, emulsionsartig oder gelöst5. 5.4
Sample output
The given example (see Figure 2) results from a user's request for definition of Dispersion in German. The respective type definition is extracted to gether with another relevant graph. After the definition schema is applied, the graphs are ordered as shown. Definite article is used for Dispersion in the second sentence, since the concept is already present in the current context. 5 Or is generated to convey the disjunctive sense.
GENERATION OF MULTILINGUAL EXPLANATIONS FROM CGS 373
typedef DISPERSION(x) is [FRAGMENTATION](IN)-> [ENVIRONMENT] (0F)-> [DISPERSION PHASE]->(ATTR)->[HARD] . [DISPERSION] <-(CHAR) <-[WASTE WATER] . Output Eine Dispersion gehört zu dem Zerteilungsgrada einer festen dispersen Phase in einem Dispersionsmittel.b Die Dispersion ist ein Kennzeichen von Abwasser.c a b c
The supertype Type definition of DISPERSION Characteristics Fig. 2: Sample knowledge pool and final German output with comments
6
Implementation
The current demo version of the generator is implemented in LPA P R O L O G for Macintosh. There is a running prototype of the system where explana tions are generated for the basic terminology in oil separation. 7
Conclusion
This paper presents our approach for generation of multilingual explana tions from CGS. The described algorithms can be applied only in restricted domains, where terms and expressions denote existing objects or phenom ena, i.e., all domain knowledge is language-independent. Another serious limitation are the predefined schemata, which are dictated by the highlyconventional technical language, but are not applicable to other domains. EGEN handles arbitrary complex CGS, including propositions, statements and situations. In the future EGEN will be extended to cope with corefer ence links and negation. Our method can also be extended to account better for user's domain knowledge, text coherence and discourse structures.
374
KALINA BONTCHEVA
Acknowledgements. I am particularly obliged to Dr. Angelova, Prof. v..Hahn and all those people involved in the development of DB-ΜΑΤ, without which this work would not have been possible. Information about the DB-ΜΑΤ system is available on the world wide web at http://www.informatik.uni-hamburg.de/ Arbeitsbereiche/NATS/projects/db-mat.html. REFERENCES v. Hahn, Walther. 1992. "Innovative Concepts for Machine Aided Translation". Proceedings VAKΚI, 13-25. Vaasa, Finland. & Galja Angelova. Forthcoming. "Knowledge Based MAT". To appear in Computers and AI, Bratislava. & Galja Angelova. 1994a. "Providing Factual Information in MAT". Pro ceedings of Int. Conf. Machine Translation: Ten Years On, 11-1-11-6. Cranfield, U.K. & Galja Angelova. 1994b. "System Architecture and Some System-Specific Components in Knowledge Based MAT". Technical Report, Project DBMAT, Rep. 1. Hamburg: Hamburg University. McKeown, Kathleen. 1985. Text Generation: Using Discourse Strategies and Fo cus Constraints to Generate Natural Language Text. Cambridge: Cambridge University Press. Petermann, Heike, Lutz Euler & Kaiina Bontcheva. 1995. "CGPro — a Prolog Implementation of Conceptual Graphs". Memo, FBI-HH-M-251/95. Ham burg: Hamburg University. Sowa, John. 1984. Conceptual Structures: Information Machine. Reading, Mass.: Addison Wesley.
Processing in Mind and
1992. "Conceptual Graphs Summary". Conceptual Structures: Current Research and Practise ed. by T. Nagle et al., 3-51. New York: Ellis Horwood. 1993. "Relating Diagrams to Logic". Conceptual Graphs for Knowledge Representation (Lecture Notes in AI 699) ed. by Guy Mineau, 1-35. Berlin: Springer Verlag. Winschiers, Heike & Galja Angelova. 1993. "Solving Translation Problems of Terms and Collocations Using a Knowledge Base". Technical Report, Project DB-ΜΑΤ, Rep. 3. Hamburg: Hamburg University. Zock, Michael. 1997. "Sentence Generation by Pattern Matching: The Problem of Syntactic Choice". Recent Advances in Natural Language Processing ed. by Ruslan Mitkov & Nicolas Nicolov, 317-352. Amsterdam & Philadelphia: John Benjamins. (This volume.)
ν CORPUS PROCESSING AND APPLICATIONS
Machine Translation: Productivity and Conventionality of Language JUN'ICHI T S U J I I
UMIST Abstract The linguistics-based machine translation (LBMT) has been a dom inant framework in MT research since the beginning of the eighties. However, I argue that several assumptions on which the research in LBMT has been based do not hold in translation of actual texts. In particular, I discuss why the notions of possible translation and compositionality of translation, both of which have their roots in mono lingual studies of syntax and semantics, have been wrongly promoted by theoretical linguists and how these notions have (mis)lead the re searchers in MT in wrong directions. Then, I discuss how we should proceed in the future and what types of research should be pursued. In conclusion, I illustrate what an ideal architecture for MT systems should look like. 1
Introduction
There had been strong interest and high expectations in Machine Trans lation throughout the 80s in the research community, commercial industry and among potential users of MT systems. However, the interest and high expectations seem to be waning somewhat in the 90s. This is partly because quite a few commercial products have been brought onto the market and MT systems have become common to the general public as well as these communities. This is also because people's expectations become more mod est and reasonable. People are beginning to realise now that MT systems are not very special but are simply ordinary information processing tools. It is generally a good thing that people have a clear picture and reas onable expectation of this new technology. However, it is also the case that the current MT technology does not meet the initial expectations that people had and that the current MT systems do not cover the demands of a potentially very large translation market. There is also acute frustration about the fact that theoretical research in MT has not contributed to the development of MT systems at all. Though it generally takes some time for the results of research to be reflected in commercial products, it is, nonetheless, frustrating.
378
JUN'ICHI TSUJII
In this paper, I will discuss what went wrong with theoretical research in MT and what is lacking in the current MT technology to bridge the gap between research and development. 2
Disappointment
While there are many successful applications of MT systems, there is also disappointment about the current state of MT systems among people who have either invested in the field or been involved in research and develop ment. Investors are disappointed because the market for MT systems as they are now is much smaller than they thought. Users are disappointed because the quality of translation produced by systems does not meet their stand ards. Researchers are disappointed because their methodologies failed to deliver what they thought they could. Part of the disappointment is due to the fact that their expectations at the beginning of the 80s were unrealistically high. However, even those who claim that current MT systems are quite successful (actually, I am one of them) may admit that the success of the current MT is a fairly restricted one. In order to widen the range in which MT systems can be used and in order not to repeat the same mistakes in the future, we have to learn lessons from past experience, in particular the experiences of these last 10 years. The following are the lessons I think we can draw from past experience. 1. Linguistics in a narrow sense is not as useful as we expected. 2. MT systems as we conceived of at the beginning of the 80s do not meet actual market demands. 3. There is no such thing as a universal MT system. These lessons may sound all too familiar to those who were engaged in MT in the 50s and 60s (unfortunately, I was not). The first one, for example, simply says that linguistics alone is not able to solve most of the problems MT systems encounter. As Bar-Hillel claimed long time ago, translation requires understanding, which in turn requires real-world knowledge. However, I do not claim that Therefore, in order to improve quality of MT we have to integrate understanding or processing based on real world knowledge with MT. This has been claimed since a long time ago and serious attempts have been made during the 80s, which were equally unsatisfactory. We cannot be so naive now.
MACHINE TRANSLATION
379
I start with two myths in theoretical research on MT, which have influenced the way of thinking in the whole research community and which I believe has lead the research field in a wrong direction. 3
Myth-1: Compositionality of translation
Language is infinite. The infiniteness of language is the main cause of dif ficulties in NLP applications including MT. The linguists who have been involved in MT since the beginning of the 80s have emphasised the import ance of how to cope with the infinite nature of language. The solution they propose is compositionality of translation. Like com positionality of meaning in mono-lingual theories, it associates translation with linguistic structures of some sorts. That is, translations (or meanings) of complex expressions are determined by their parts, and the relationship between the complex expressions and their parts are, for example, determ ined by their phrase structures. Though most of MT systems use more abstract levels of representation than phrase structures, the basic scheme remains the same. The strict form of compositionality of translation seems to be based on the following two assumptions. [ASP 1] Translation equivalence by identity of meaning: Assuming the existence of meanings which are independent of context, translation equi valent expressions in different languages have the same meanings. [ASP 2] Independent status of structure equivalence: Assuming that a com plex expression in one language can be decomposed into its sub-expressions with a constructor 1 , the translation can be constructed from translation equivalences of the sub-expressions, by using the constructor of the tar get language which is translationally equivalent to the constructor of the source. Like Montague's semantic theory, they assume that translation equivalence of constructors in two languages can be established, regardless of sub-expressions which are combined by them. Most of the transfer-based MT systems assume, to varying degrees, that [ASP 2] is the case. The transfer phase descends down the structural 1 Constructors can be syntactic, semantic, etc. In the case of syntax, constructors can be individual phrase structure rules or grammatical functions such as SUBJ, OBJ, etc. In semantics, they can be deep cases or thematic relations such as Agent, etc.
380
JUN'ICHI TSUJII
description of a source sentence from top to bottom and at each level, it decomposes a complex expression into its sub-expressions and constructor. Then, it ascends from bottom to top to construct a target sentence, at each level of which it composes a complex expression by using a translation equivalent constructor. If [ASP 2] is really the case, the transfer phase is a simple recursive process as described above. However, developers of MT systems which are being used for actual translation know, through their experience, that the transfer phase cannot be so neat as described above. There are many cases where the independent status of constructor equi valence is challenged and equivalences of constructors are affected by sub expressions to be combined. Such cases are abundantly observed in such examples as terminological expressions, lexical gaps, idioms, pseudo-idioms, speech patterns (Alshawi 1991), etc. While we know that the naming or labeling of real world entities by words are arbitrary and simply the conventions of individual languages, conventionality of language use is much more pervasive than we thought. In other words, conventionality permeates the other aspect of language use, i.e., productivity, which compositionality of translation emhasises. These conventionalised expressions cause difficulties for compositional theories in general, but they are more serious in translation, because two languages have their own different conventions. There are basically two alternative ways of treating problems. One is to admit empirical facts and demote the status of constructors. MT sys tems based on lexicon-oriented views, perhaps inadvertently, took this path. They use mono-lingual constructors only as descriptors to define a trans lation equivalent pair which contains a specific word or words. In their frameworks, there is no such thing as a constructor equivalence (or struc ture equivalence). As a result, structure transfer is performed as part of lexical transfer. The other alternative is to ignore empirical facts and push [ASP 2] to the extreme. If [ASP 2] is the case, one of its logical consequences is the possibility of discovering a set of universal constructors. While to establish a set of universal lexical items is hard simply due to the sheer size of vocabulary, the number of constructors seems fairly small, whether they are syntactic, semantic or pragmatically motivated ones. E U R O T R A seems to have taken this line of reasoning and reached the idea of simple transfer. As in lexicon-oriented MT systems, structure transfer is eradicated, but for a very different reason. In this framework, the status of constructor
MACHINE TRANSLATION
381
equivalence obtains supreme independence and is actually represented as universal constructors. Though lexical transfer can change the structure, this is treated as an exception. R O S E T T A also maintained [ASP 2] and tries to co-ordinate the con structors of two languages (Appelo 1987; Landsbergen 1989). The result seems to be proliferation of constructors which cannot be justified monolingually. The results of the two attempts, EUROTRA and ROSETTA, show that [ASP 2] is empirically wrong and that frameworks based on this assumption do not work. [ASP 1] is explicit in a naive interlingual approach, in the sense that the meaning which guarantees translation equivalence of expressions in vari ous languages is explicitly represented at the level of interlingua. However, all sentence-based MT systems implicitly share this assumption. As the R O S E T T A group rightly claims, the meaning which guarantees the transla tion equivalence need not be explicitly represented, but when one defines two expressions (structures and/or words) as translation equivalents, one assumes that the meanings of the two are the same. More precisely, when compositionality of translation decomposes trans lation equivalents of complex expressions into translation equivalents of sub-expressions, it implicitly assumes that the meanings of these expres sions (the sub-expressions as well as the complex expressions) are the same as those of their corresponding target expressions and that the meanings which matter can be established independently from the contexts. Thus, translation equivalence of a larger expression can be reduced to a collection of translation equivalences of smaller expressions, the equivalences of which are established regardless of the larger units of expressions. However, [ASP 1] is very doubtful from an empirical point of view. It is rather normal than exceptional in human translation that extra phrases or words are added or that phrases or words in the source disappear in the target. Obviously, in human translation, two translation equivalent sen tences do not have the same meanings which are established independently of context. More seriously, as we will see in Section 5, the context independent meaning postulated in [ASP 1] is not so crucial in translation, but con text dependent interpretation plays the decisive role in translation. As we discuss later, this context dependent interpretation, together with con ventionality of language use, affects translation in subtle ways and makes compositionality of translation an irrelevant straightjacket.
382 4
JUN'ICHI TSUJIÍ M y t h - 2 : Possible translation
Since the early 80s, research in MT has been getting more and more similar to research in theoretical linguistics of a certain type. Both heavily rely on human intuition. As linguists (of a certain type) do, researchers in MT tend to ignore phenomena occurring in real translation by human translat ors, pick up artificial examples they produce and thus put disproportional emphases on certain specific problems. Though it is problematic in some cases, grammatical judgement by intu ition works and has played a major role in monolingual research on syntax. However, the same methodology has not worked for MT research so well. To judge correctness of translation and thus define translation equivalence of a single sentence without considering context has turned out to be much more problematic. The same sentence can and should be translated differently, depending on the context in which it appears 2 Because it is generally difficult to circumscribe the context in which a sentence appears, the judgement become subjective, i.e., the judgement de pends on the context which a reader of that sentence happens to come up with. Researchers in the linguistics-based MT paradigm tried to dissociate translation from context by introducing the distinction of possible transla tion and good or correct translation, the distinction which reminds us of the distinction of competence vs. performance. They argue that, given a context, only a subset of the possible translations are correct ones, and that one has to concentrate on possible translation in theoretical research. However, the distinction seems more problematic and fragile than the distinction of competence and performance. Firstly, as S.Nirenburg rightly pointed out, it avoids problems related with ambiguities, which pose real difficulties in actual MT systems. More seriously, unlike grammatical judgement (in which native speakers have to say only yes or no), one has actually to generate all instances of possible translations of a given expression in every conceivable context. Without serious empirical investigation, it is very difficult, if not impossible, to generate all possible translations for a given sentence in a contextual vacuum. As a result, while ideally a set of possible translations has to be determined independently of a theory or a particular system, those who are engaged in MT development or MT theory, not translators, determine such a set by themselves. The consequence is that the definition of possible 2 I may sound that I emphasise the context dependency of translation. But this is not my intention in this paper. See Section 5.
MACHINE TRANSLATION
383
translation becomes a theory internal concept. In short, a set of possible translations is defined as a set of translations which a given system (or theory) produces but from which a system (or theory) cannot choose correct ones (due to the lack of context, etc.). A set of possible translations as such have nothing to do with a set of translations which actually appear as translations in real texts (see Section 5). In short, the concept of possible translation provides a convenient excuse for researchers to play with toys, and contributes to cutting theoretical MT research off from its empirical basis. It has lead the whole research in a wrong direction. It is also obvious that, due to this excuse, researchers have been able to ignore the obvious fault of [ASP 1]. On the other hand, [ASP 1] gives the illusion that translation problems can be discussed without referring to context (because translation equivalent relationships can be defined in terms of context independent meanings), and reinforces the myth of possible translation. 5
E x a m p l e s : M e t o n y m i c n a t u r e of language and t r a n s l a t i o n
Let us see several examples to illustrate the points I have made. [Fact 1] (Kitamura & Matsumoto 1995) reported that only 236 JapaneseEnglish word pairs are registered in one of the most comprehensive bilingual dictionaries for human use, out of 948 word-pairs which their alignment pro gram discovered from real texts (24%). [Fact 2] We examined the manual of UNIX to find that, among 15 Japan ese equivalents for the English verb to match listed in an English-Japanese dictionary, only two Japanese equivalents (taiousuru and icchisuru), appear as translations for 125 occurrences of the word. These facts show that even possible translations given by lexicographers, who are more empirical than linguists or computer engineers, do not reflect actual translations produced by translators. To see why the discrepancies like [Fact 1] arise, let us consider the following simple example. [Example 1] (by Jiping Sun, UMIST) English : I will go to see my GP tomorrow. Japanese : Watashi(-I)-wa Asu(tomorrow) Isha(-GP)-ni Mite (-check) Morau(beneficiary causative).
384
JUN'ICHI TSUJII
Literal Translation of Japanese: I will ask my GP to check me tomorrow (and I will benefit from the action) While a compositional translation of English into Japanese such as: Watashi(-I)-wa Asu(-tomorrow) Isha-(GP)-ni ai(-meet)-ni iku(-go) is possible, this translation implies that the speaker will meet his/her GP to discuss about something unusual (like mis-diagnosis, fees etc). What is happening in this example is that, although the two languages, English and Japanese, describe the same situation (an aggregation of ac tions), they verbalise different actions in the aggregation. Which aspect of a complex reality a language verbalises is somewhat fixed, and when one does not follow the convention, additional meanings are conveyed. The process of human translation of the above example is roughly de scribed by using the KBMT framework and a Schankian type of represent ation, as follows. [Step-1: Understanding] The first phase is to understand what situation is described. The result of understanding would be, though naive, repres ented by something like: AGGR-1 [the speaker GOES to some place like a hospital], [s/he and her/his GP MEET] [the GP CHECKs him/her], etc. This step uses knowledge about conventions in English that the expression "go to see one's GP" is used to describe a situation, which can be described by AGGR-1. [Step2: Paraphrase] This phase is to choose which part of AGGR-1 is to be verbalised in Japanese, following the convention of Japanese. That is, a human translator knows that in Japanese a situation like AGGR-1 is described by verbalising the part of it, i.e., [the GP CHECKS the speaker], and using a beneficiary causative to express the speaker's initiative (in Eng lish, this part is expressed implicitly by [I go] and [I see my GP]). The actual process would be more complicated. The understanding step in human translation is a more dynamic and flexible process which associates the compositional meaning of to go to see one's GP with a typical situation (like AGGR-1) of someone visiting his/her GP. This interpretation process definitely uses general knowledge and context, either by inference or by association. Because of the dynamic nature of the interpretation phase.
MACHINE TRANSLATION
385
if the context indicates that an unusual incident happened between the speaker and his/her GP, human recognises it and the same compositional meaning of the sentence would be linked with another aggregation. Because the GP may not check him/her up medically in such circumstances, Japanese translation would be different. Or, the paraphrasing phase would be equally more dynamic. That is, if the context suggests that the speaker's action of going is crucial, then the second phase has to choose a different construction in which go is verbalised as the main verb and ask GP to check him/her is realised as a subordinate clause. Though it may sound trivial, [Example 1] illustrates how human trans lation is performed through understanding of what is actually described. However, it is not my intention to emphasise the context dependency (or dynamic) nature of human interpretation or paraphrasing. My point here is the interaction of metonymic nature of language and conventionality of language use, which makes compositionality of translation irrelevant to ac tual translation and which is revealed even in the translation example given in [Example 1]. That is, a sentence in one language describes metonymically part of the complex reality which it intends to describe, and which part of the same reality is explicitly expressed depends on individual languages. It is obvious that understanding results such as AGGR-1 are completely different from the meanings intended in [ASP 1], which are established independently of context and which can be computed from context-independent meanings of individual words like go, see and GP. As a result, we have rather strange translation pairs, strange from the compositional view of translation, such as a pair of [X sees Y] and [Y check X] even in a normal situation. Such a correspondence can hardly be imagined when one tries to enumerate possible translations of to see. 3 However, as [Fact 1] indicates, it seems that such ad-hoc correspond ences are rather the norm than exceptions. In reality, it does not make any sense to discuss correspondences in terms of linguistic structures of two sen tences, because the two sentences describe different parts of the reality and 3 Though this example is somewhat similar to the well-known example (miss - manquer in English and Frech) - and I feel some continuum -, this correspondence is very specific unlike (miss, manquer). If doctor is replaced by my lawyer, then we have to use discuss or consult instead of check and the causative construction is no longer approriate.
386
JUN'ICHI TSUJII
their compositional meanings are consequently very different. The basic assumption of compositional translation does not hold. The example given by M.Kay, validate a ticket and invalidate a ticket in French and German, illustrates a similar point that different conventions of verbalisation shared by the two speech communities result in un-conceivable translation pairs like ([X validate Y], [X invalidate Y]). The same state of affairs is expressed metonymically in Japanese by focussing on a particular action like punch a ticket, and thus we end up with equally un-conceivable pair like ([X validate Y], [X punch Y]), which no one expects to be in a bi-lingual dictionary. So far, we have discussed rather general examples. [Fact 2], however, indicates a different aspect to problems of possible translations. That is, given the fixed context of Unix4, a set of possible translation which lexico graphers enumerate without context is simply too large and thus makes the problem of disambiguation unnecessarily difficult. Furthermore, if a specific context such as Unix manuals is fixed, we have much more conventions which texts have to conform to. Here again, different languages follow different conventions. [Example 2] Maruyama (1992) observed drastic structure changes are often required in a production manual of mechanical devices such as follows. Japanese: Buhin(-parts)-no Iro(-colour)-ha Hyo(-Table)-2-ni yoru-mono—to suru. English : See Table-2 for the colours of the parts.
While the Japanese sentence literally means As for the colour of parts, one is supposed to follow Table-2, there are no corresponding expressions follow and be supposed to in the English translation. On the other hand, see appears in English. This is because the manual in the two languages follow different conven tions to express the same information, i.e., the colour of the parts are listed in Table 2. Maruyama said that there are many such significant structure changes in the manual he examined. Again, it seems ridiculous to claim that the two sentences have the same compositional meanings. I use the term context in a broader sense, which includes the communicative environ ment where a text is prepared.
MACHINE TRANSLATION 6
387
Conceptual design of a simple MT system
In the previous section, I claim that human translation is based on under standing of what is described or what is intended. However, this does not imply, for example, that an MT system as an engineering system has to represent understanding results such as AGGR-1 explicitly and simulate the human process of interpretation and paraphrasing. First of all, though I used an informal Schankian script to represent understanding results, it is not at all clear how actually we can represent them in a computationally sound way. As Kay's example illustrates, the symbolism at this level is not trivial at all (how can we represent a ticket is valid or a ticket is not valid without referring to the whole social system associated with tickets ?). Secondly, even if one could represent them, how can one relate them with the target-oriented paraphrasing ? The target language may have some general principles that determine which part of a complex reality should be expressed explicitly. Unfortunately, we know almost nothing about this pro cess. Unless the process which manipulates objects like AGGR-1 and which can change dynamically the interpretation and the paraphrase is realised computationally, to represent them explicitly does not contribute to MT. Thirdly, there are many cases like [Example 2] and Kay's example, where conventions about how to verbalise are really specific to individual events or the information to be described. In other words, they have char acteristics similar to terminological expressions, and we have to treat them in the same way as we treat terminological expressions. Neither structure of expressions nor the internal structure of understanding results are not crucial for translation. Let us assume here the extreme. That is, conventionality, not productiv ity of language plays a dominant role in translation and that the role of translation is to transfer conventions of one language to the correspond ing conventions of another language. And assume that almost all linguistic expressions have characteristics similar to terminological expressions. In other words, complex expressions in one language are related with expres sions in another language, regardless of their internal linguistic structures. As correspondences between terminological expressions (terms) are often expressed through language independent concepts, let us use objects like AGGR-1 as links without analysing their internal structures. Then, the correspondences in the two examples would be represented as follows:
388
JUN'ICHI TSUJII
ENG:[e(X) go to see e(X's) GP] ↔AGGR-1[X]↔
JPN:[j(X)-wa j(X)-no isha-ni mite morau] ENG:[See e(X) for e(Y)] ↔AGGR-2[X,
Y]↔
JPN:[j(Y)-ha j(X)-ni yoru mono-to suru] I use an informal notation in the above such as e(X) and j(X) which means translation of X in English and Japanese, respectively. X is a variable in AGGR-1. Unlike pure terminological terms, the expressions to be related in the above examples contain variables like X and Y. In order to make the corres pondence of the first one more general, one can introduce another variable with its own restriction such as: ENG:[e(X) go to see e(X's) e(Y)] ↔ AGGR-1[X, Y]↔ condition[Y is a medical-profession] JPN:[j(X)-wa j(X)-no j(Y)-ni mite morau] This rule can be compared with another rule, which is concerned with a similar but different situation like to go to see one's lawyer and which leads to a different translation in Japanese, like: ENG:[e(X) go to see e(X's) e(Y)] ↔ AGGR-3[X, Y]↔ condition[Y is a legal-profession] JPN:[j(X)-wa j(X)-no j(Y)-ni soudan-ni iku] Here we have a word soudan (consult in English) in Japanese and the be neficiary causative is no longer used. As you easily see, these correspondence rules look like transfer rules which are used in actual commercial MT systems. They speculate corres pondences which can hardly be justified on a purely linguistic basis. They are also very specific in the sense that they contains many individual words like «see, go miru, soudan and morau. Furthermore, they introduce rather ad-hoc classification of nouns like medical-profession and legal-profession. While researchers with theoretical orientations considered rules such as these awkward and ad-hoc and started to tide them up, it seems to me that they actually reflect certain essential aspects of translation. What was
MACHINE TRANSLATION
389
essentially wrong with their attitude is that they took these exceptions to general rules or simply ignored empirical facts altogether in translation. The framework is obviously naive in many points. In particular, it may need to have structural annotations in ENG and JPN in order to use these correspondences for constructing larger expressions which contain them as parts. It may also be desirable for the structural annotations to be done at a certain abstract level, in order to allow some freedom in the generation phase, etc. However, structural annotations as such play only the roles of descriptors for defining correspondences between complex expressions, and do not have the independent status on which translation equivalence is defined. 7
Other frameworks and future directions
I have illustrated how rules in an MT system I have in mind look like, and indicated that they have more similarities with rules in traditional transferbased systems than with those in linguistically motivated proto-type MT systems. The reason for this is that linguistically motivated research has made several wrong assumptions on translation, notably those related with compositionality of translation and possible translation, which miss the true nature of translation. However, several other research frameworks like EBMT (Example-based MT) (Nagao 1984; Nagao 1992; Sumita 1991; Furuse 1992; Jones 1992) SBMT (Statistics-based MT) (Brown 1992) and KBMT (Knowledge-based MT) (Nirenburg 1989) have been proposed and have attracted more and more interest of the research community. These frameworks have taken orientations' distinctly different from linguistically motivated MT (LBMT) and do not have the defects I discussed in this paper. EBMT and SBMT which use translation produced by human translators as a major source of knowledge, for example, will never be separated from empirical facts in translation. In the following, I will summarise my points by referring to these frame works as well as the traditional transfer paradigm. [1] If the nature of transfer rules are like the ones I described, then recursive transfer which heavily relies on the structure of a source sentence may not work well. The straightjacket imposed by the recursive transfer has to be re laxed. The transfer process is more like the process of solving jigsaw puzzles.
390
JUN'ICHI TSUJII
[2] The framework in Section β and EBMT share many things in common. The major difference is whether one introduces variables and their condi tions explicitly or not. The framework in Section 6 assumes a phase of Knowledge Preparation in which individual correspondence rules are iden tified (e.g., how many distinct AGGRs have to be established) and parts to be represented by variables are identified. In this phase, one has to examine translations given by human translators. The obvious advantage of EBMT is to avoid this Knowledge Preparation phase and just use examples as transfer rules. However, as several groups in the EBMT camp admit, the careful scrutiny of examples is vital for success of EBMT, which implies that EBMT also has to have a knowledge prepara tion phase of some sort. [3] In Section 6,I argued as if disambiguation between AGGR-1 and AGGR3 are to be made simply by referring to the properties of Y. However, there is no guarantee that disambiguation is possible only by examining the in ternal structures of expressions to be transferred, although most traditional transfer-based systems assume that this is the case. As we saw in Section 5, the same expression go to see one's GP has to be related with different AGGRs, depending on the context in which the expression appears. Whether disambiguation has to be performed by a ra tionalistic way (through explicit understanding of what a text describes, like in KBMT) or otherwise remains to be seen. The key problem here is how to characterise context which affects selection. EBMT, for example, provides an empiricists' alternative to the problem. It is also plausible to apply methods, statistical or connectionist ones, which have been proven to be effective for sense disambiguation of lexical items. [4] I treat AGGRs as un-analysable wholes, which directly connect ex pressions of two languages. It might be possible, as the KBMT camp and (Dorr 1994) have done to a certain extent, to analyse the internal struc tures of AGGRs and discuss (and implement) the processes of interpretation and paraphrase, which dynamically associate linguistic forms with AGGRs. However, considering our current understanding of these processes, it would be too ambitious to map a source text to AGGRs and then generate a target text.
MACHINE TRANSLATION
391
[5] Although I have not discussed in this paper, I share with many people the belief that there is no such thing as a universal MT system. The belief has two consequences. One consequence is that every MT system has to be tuned towards spe cific subject domains and text types. In my model, for example, a set of correspondence rules have to be prepared for every different sublanguages (Ananiadou 1990). Consequently, the knowledge preparation phase plays a crucial role, and I believe that the technology of automatising this phase will be vital for broadening the range of MT application (Tsujii 1992; Tsujii 1993). The other consequence of this belief is that the architecture of MT sys tems has to be diversified. Which architecture, Transfer-based MT, EBMT, KBMT, LBMT or the framework illustrated here, is most appropriate is highly dependent on the complexities of translation required by a given sublanguage and the functionality of a system required in a specific applic ation. REFERENCES Alshawi, Hiyan, D.J. Arnold, R. Backofen, D.M. Carter, J. Lindop, K. Netter, S.G. Pulman, J. Tsujii & Η. Uszkoreit. 1991. ΕυRοTκΑ-6/l Final Report. Technical Report produced for the Commission of the European Communit ies, Luxembourg. Cambridge: S.R.I. International, Cambridge Computer Science Research Centre. Ananiadou, Sofia. 1990. "Sublanguage Studies as the Basis for Computer Sup port for Multilingual Communication". Proceedings of Terminology and Plan ning (Termplan-90), 10-13. Kuala Lumpur. Appelo, Lisette, C. Fellinger & J. Landsbergen. 1987. "Subgrammars, Rule Classes and Control in the ROSETTA Translation System". Proceedings of 3rd Conference of the European Chapter of the Association for Computational Linguistics (EACL'87), 118-133. Copenhagen, Denmark. Brown, Peter F., S. A. Delia Pietra, V. J. Delia Pietra, J. D. Lafferty & R. L. Mer cer. 1992. "Analysis, Statistical Transfer, and Synthesis in Machine Trans lation". Proceedings of the 4th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-92), 83-100. Montreal, Canada. Furuse, Osamu & H. Iida. 1992. "Transfer-Driven Machine Translation". Pro ceedings of the 2nd International Workshop on Fundamental Research for the Future Generation of Natural Language Processing (FGNLP). Manchester: Centre for Computational Linguistics, UMIST.
392
JUN'ICHI TSUJII
Maruyama, Hiroshi & H. Watanabe. 1992. "Tree Cover Search Algorithm for Example-Based Translation". Proceedings of the 4th International Confer ence on Theoretical and Methodological Issues in Machine Translation (TMI92), 173-184. Montreal, Canada. Jones, Daniel. 1992. "Non-Hybrid Example-Based Machine Translation Archi tectures". Proceedings of the 4th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-92), 35-43. Montreal, Canada. Landsbergen, J. 1989. The Power of Compositional MT. Eindhoven, The Neth erlands: Philips Research Laboratories. Nirenburg, Sergei. 1989. Translation 4:1.5-24.
"Knowledge-based Machine Translation".
Machine
Nirenburg, Sergei, J. Carbonell, M. Tomita & K. Goodman. 1992. "Machine Translation: A Knowledge-based Approach". San Mateo, Calif.: Morgan Kaufmann. Nagao, Makoto. 1984. "A Framework of a Mechanical Translation between Ja panese and English by Analogy Principle". Artificial and Human Intelligence ed. by A. Elithorn & R.Banerji, 173-180. Amsterdam: North-Holland El sevier. Nagao, Makoto. 1992. "Some Rationales and Methodologies for Example-Based Approach". Proceedings of the 2nd International Workshop on Fundamental Research ¡or the Future Generation of Natural Language Processing (FGNLP), 82-94. Manchester: Centre for Computational Linguistics, UMIST. Sumita, Eiichiro & H. Iida. 1991. "Experiments and Prospects of ExampleBased Machine Translation". Proceedings of 29th Meeting of the Association for Computational Linguistics (ACL'91), 185-192. Berkeley, California. Tsujii, Jun'ichi, S. Ananiadou, I. Arad & S. Sekine. 1992. "Linguistic Knowledge Acquisition from Corpora". Proceedings of the 2nd International Workshop on Fundamental Research for the Future Generation of NLP (FGNLP), 6181. Manchester: Centre for Computational Linguistics, UMIST. Tsujii, Jun'ichi & S. Ananiadou. 1993. "Knowledge-based Processing in MT". Knowledge Building and Knowledge Sharing ed. by K. Fuchi & T. Yokoi, 68-85. Amsterdam, The Netherlands: IOS Press.
Connectionist F-structure Transfer Y E - Y I WANG & A L E X WAIBEL
Carnegie Mellon University Abstract A traditional transfer system in machine translation maps between language structures and an intermediate representation. Our con nectionist transfer system maps from f-structures of one language to f-structures of another language. It encodes the intermediate rep resentation implicitly in neural networks' activation patterns. The system is learnable, therefore it does not need any effort in hand crafting the representation and mapping rules. Experiments show the system has good scalability and generalisability performance. 1
Introduction
Most of the current machine translation systems adopt an indirect strategy that maps between languages and an intermediate representation. The interlingua model (Nirenburg et al. 1987) uses a language-independent intermediate representation. Design of the representation requires crosslinguistic expertise. The intermediate representation in a transfer model (White 1987) is language-dependent. Its design is relatively easier. How ever, multiple such representations are required for a multi-lingual trans lator. Both models rely upon hand-crafted mapping rules, which demand tremendous human effort. The difficulties appeal for automatic learning mechanisms for interme diate representations and mapping rules. Chrisman (1991) proposed a con nectionist confluent influence system that acquired the distributed interlanguage representation of sentences during its learning to achieve the tight coupling between the representations of sentences in two different languages. The approach was hard to scale up for larger tasks or to generalise for unseen inputs, mostly due to its over-simplified representation of sentences. We present here a connectionist mapper. It can learn the transfer from a source language (English) LFG f-structure (Bresnan 1982) into its cor responding target language (German) f-structure. It does not need explicit intermediate representation or mapping rules. Instead, the connection pat terns of the neural networks implicitly encode the rules and representation. The domain of our task was the Conference Registration Telephony Con versations. It covered a wide range of topics related to conferences, such as
394
YE-YI WANG & ALEX WAIBEL
registration, cancellation, hotel reservation, conference information inquiry, etc. The lexicon for the task contained about 400 English and 400 German words in root forms. About 300 pairs of f-structures of the English and German sentences were available from symbolic parsers. A machine translation system for the Conference Registration task con sisted of three parts: a parser deriving the f-structure from an input source language sentence, a mapper generating a target language f-structure from its source language counterpart, and a text generator producing a target lan guage sentence from its f-structure. According to our experience, mapping between f-structures was the most difficult part, which required the hand crafting of an intermediate representation and the rules that map between f-structures and the intermediate representation. An automatic transfer system is thus desirable. The system should have the following properties: Learnability: The system should be able to learn the structure transfer automatically from paired samples. It should not require hand-crafting of any explicit representations and mapping rules. Scalability: With limited retraining, the system should be able to deal with larger tasks with an expanded lexicon. Generalisability: The system should have satisfactory performance on unseen inputs. 2
F-structure representations
An f-structure is a structured functional representation of a sentence or a phrase. It is composed of a head, terminal features, and sub-structures. For the f-structure in Figure la, *SEND is the head. The contents in the inner brackets are the sub-structures, whose grammatical relations or roles1 are labeled next to the brackets. The rest parts in Figure l a are the ter minal features. A sub-structure can be referred to with its grammatical relation or its phrasal category ('NP, VP, ...). Thus the sub-structure [subj *YOU] can be called either a SUBJECT sub-structure or an NP sub-structure. The SUBJECT, RECIP and OBJECT sub-structures are the three immedi ate sub-structures of the top level f-structure in Figure la, because there is no intervening structure between these sub-structures and the top level f-structure. The DET sub-structure is an immediate sub-structure of the OBJECT sub-structure. If A is an immediate sub-structure of B, then Β is the parent structure of A. We will use Grammatical relation interchangeably with the term role.
CONNECTIONIST TRANSFER
395
A symbolic f-structure cannot be presented to a neural network directly. Figure Ic-f illustrates how an f-structure can be coded as a network's input. Below are the terms used for the representation.
Fig. 1: F-structure representation: (a) an f-structure. (b) abbreviation. (c) lexical vector. (d) terminal feature vector. (e) HF-vector. (f) f-structure represented by HF-vectors. A lexical vector is used to code a lexical item. Assuming that every lexical item is an entry in a two-dimensional space instead of a one-dimensional word list, we need two indices to specify the position of a lexical item in the space. Lexical vector is a 0-1 vector with exactly two elements being 1 (being activated). The positions of the two activated elements in the vector specify the two indices for an item in the 2D lexicon (Figure lc) 2 . The terminal feature vector of an f-structure codes the terminal features of the f-structure. Each element of the vector corresponds to a 2 Viewing the lexicon as 2D reduces the length of the vector used to represent a lexical item from η to 2
396
YE-YI WANG & ALEX WAIBEL
feature-value pair like (TENSE*PRESENT). The vector, again, is a 0-1 vec tor with the activated elements indicating that their corresponding featurevalue pairs are terminal features of the f-structure (Figure Id). Since there are altogether around 60 different values for all the features used in the f-structure, the length of the terminal feature vector is around 60. The HF-Vector of an F-structure is the concatenation of the lex ical vector of the head and the terminal feature vector of the f-structure (Figure le). Thus an f-structure can be represented by its HF-vector and its sub structures' HF-vectors (Figure If). 3
The mapper
A mapper is a push-down transducer that consists of: 1. a symbolic controller that assigns an f-structure transfer task to a neural network and interpreting the network's output. According to the interpretation, it recursively assigns the sub-structure transfer tasks to the related networks, and assembles these networks' results to the target f-structure; 2. seven neural networks that map phrasal f-structures between two lan guages. Each network is constructed for a phrasal category in the target language: IP (sentences), VP, NP, AP, PP, DP (determiners), and MP (miscellaneous, for phrases like "hello", "oh", etc.). 3.1
Phrasal networks
A phrasal network has four layers: input, feature, hidden, and output layers (Figure 2a). The input layer consists of three parts: Slots of the HF-vectors for an input f-structure and its context (par ent) structure. Each slot corresponds to a fixed role. An input f-structure may have sub-structures of arbitrary depth, but the networks must have fixed number of input slots. Therefore we cannot include all sub-structures' HF-vectors in the networks' input. Instead, we 'peel off the shell' of an f-structure — only include the HF-vectors of the immediate sub-structures and their immediate sub-structures in turn for the input f-structure, and the HF-vectors of the immediate sub-structures for the context f-structure. Pre-analysis of the samples reveals the possible roles of the sub-structures that can occur at these levels in f-structures for the seven phrasal categories, and slots are then added to the input and
CONNECTIONIST TRANSFER
397
Fig. 2: Phrasal Network Structure: (a) the architecture of a phrasal network. (b) details of the lowest two layers. The unshaded slots represent the input f-structure. The shaded ones represent the context f-structure.
feature layers of the corresponding phrasal networks to take as input the HF-vectors of the sub-structures with those possible roles. Grammatical relation of the input source structure in its context 3 . This input is a 0-1 vector with exactly one activated element indicating the grammatical relation of the input structure. Lexical vector of the head of the output f-structure's parent struc ture (p-head). Sometimes, one input f-structure may be responsible for the generation of multiple target f-structures at different levels. For example, [sentence G O O D B Y E ] C o r r e s p o n d s to b o t h [sentence A U F [obj W I E D E R H Ö R E N ] ]
and its sub-structure [obj WIEDERHÖREN] in the training samples. This input serves as a stack pointer, indicating the level at which the output 3 Slot position only indicates the role of sub-structures, not the role of the input struc ture, since the HF-vector of the input f-structure with different roles always occupies the first slot.
398
YE-YI WANG & ALEX WAIBEL
f-structure should be generated. The HF-vectors at the input layer are the local representations for the words and features in an f-structure. The activation patterns of the slots at the feature layer can be viewed as the automatically learned distributed representation of the input HF-vectors (Miikkulainen 1989). The input slots have one-to-one connections to the feature slots (Figure 2b). The slot-slot connections share weights in such a way that the connection from the ith unit in slot A at the input layer to the jth unit in slot A at the feature layer has the same strength as the connection from the ith unit in slot Β at the input layer to the jth unit in slot Β at the feature layer. The weight sharing makes the same HF-vector at different input slots result in the same pattern in their corresponding feature slots. The output layer of a phrasal network has three parts: T h e H F - v e c t o r of the f-structure to be generated. From this vector the head and the terminal features of the target f-structure can be recovered. The Sub-Structures' Input Specifiers. It consists of slots of 0-1 vectors. Each slot has at most one element being activated. And each slot corresponds to a sub-structure of a specific role of the target f-structure. The role of the sub-structure is implied by the position of the slot in the output layer. Each vector of sub-structures' input specifier is of the size (|| input layer slots || 4- 1). For an output slot in sub-structures' input spe cifier, if it has one activated element, then the sub-structure with the corres ponding role should be included as a part of the desired output f-structure. The position of the activated element in the slot indicates the input sub structure (as specified by the slot number in the input layer) that is the counterpart of (and therefore is responsible for the generation of) the tar get sub-structure, or nil when no input sub-structure is a counterpart of the output sub-structure. If a network does not activate any element in an output slot, then the slot's corresponding sub-structure should not be expected as a part of the desired target f-structure. T h e Sub-Structures' Categories. It consists of slots of 0-1 vectors. There can be at most one element being activated in each slot, specifying one of the seven phrasal categories for the corresponding target sub-structure. According to a network's output, the controller can build sub-structures recursively by assigning subsequent sub-structure mapping tasks to the net works of the categories specified in the output of sub-structures' categories at the output layer. The input f-structures of those mapping tasks are specified in the output of sub-structures' input specifiers. By combining the recursively built sub-structures and the head and the terminal features from
CONNECTIONIST TRANSFER
399
the output HF-vector, the desired target f-structure can be produced. 4
A n example
The following example illustrates how the system works. I would like to register for the conference Source Sentence: Source F-structure: [sent [subj I] WOULD [xcomp [subj I] LIKE [xcomp [subj I] REGISTER [pp_adj FOR [obj [det THE] CONFERENCE]]]]] Ich wuerd mich gerne zur Konferenz anmelde Target Sentence: (0) IP network Input: source: [sen [subj I] WOULD [xcomp [subj I] LIKE [xcomp REGISTER]]] Output: heaNlL subsentence <WOULD VP> 4 (l)5 features: (MOOD *DECLARATIVE) F-structure assembled by the controller: [sentence [subj P R O N O U N ] W E R D E
[xcomp [subj PRONOUN] [adj GERNE] ANMELDEN [obj PRONOUN] [pp.adj FÜR [obj [det DER] KONFERENZ]]]] (1) VP network Input: source: [sent [subj I] WOULD [xcomp [subj I] L I K E [xcomp R E G I S T E R ] ] ] context: NIL role: sentence p-head:6 NIL Output: head: WERDE subs: subj (2) xcomp(3) features: ((CAT V) (PERSON 1) (MODAL +) (FORM FIN) ...) F-structure assembled by the controller: [ [subj PRONOUN] WERDE [ [subj PRONOUN] [adj GERNE] ANMELDEN [obj PRONOUN] [pP-adj FÜR [obj [det DER] KONFERENZ]]]]
400
YE-YI WANG & ALEX WAIBEL
In step (0), the controller first activates the IP network with the source input f-structure. There is no context input for the IP network, since the sentential f-structures are the top level f-structures in our task. From the network's output, the controller knows that the head of the IP is NIL7. It also generates the sentential feature (MOOD *DECLARATIVE). And it interprets the output as that the only sub-structure of the sentence is a German VP, whose English counterpart is the (non-proper) sub-structure with the head WOULD8. Therefore it builds the target f-structure framework [ NIL (MOOD *DECLARATIVE) [sentence *]], and activates the VP network in step (1). Upon receiving the VP sub-structure returned from step (1), it combines that sub-structure with the f-structure framework, and collapses the NILheaded f-structure to form the assembled f-structure shown as the output in step (0). In step (1), the input source was determined in step (0), since the sen tence sub-structure's head was "WOULD" according to the IP network's sub-structure's input specifier in step (0). The context input is NIL because the source f-structure does not have a parent f-structure. The input role has the value sentence because the slot position of the output sub-structure in step (0) implies the grammatical relation of the sub-structure is sentence. The input p-head is NIL because the head of target f-structure in step (0) was NIL as specified by the output HF-vector there. The VP network maps the input f-structure to its German counterpart by specifying (a) the head of the German VP structure WERDE and the terminal features of the German VP structure in the output HF-vector, and (b) the input specifiers and the categories of the sub-structures of the target German VP f-structure. To build detailed sub-structures for this VP f-structure, the controller will activate the NP network with the input of the English sub-structure with the head "I" and the VP network with the input English sub-structure with the head "LIKE" in the subsequent steps, and 4 The sub-structure's input specifier and category are combined into a tuple here. 5 The number in the parenthesis indicates the subsequent step of network activation for this sub-structure. 6 P-head is the head of the target f-structure's parent structure 7 NIL-headed f-structure happens only when there is only one sub-structure or when there is an xcomp sub-structure. The NIL-headed f-structure must collapse into the only sub-structure in the first case, or into the xcomp sub-structure in the second case. All terminal features and other sub-structures are moved into the collapsed-into sub-structure during collapsing. 8 The network actually specifies the slot at the input layer instead of the lexical item WOULD.
CONNECTIONIST TRANSFER
401
combine the sub-structures returned from these subsequent steps into the f-structure framework [[subj *] WERDE [xcomp *]]. The combined structure is then returned to step (0) to be integrated into the top level f-structure framework there. 5
Training, testing and performance
From the 300 sentential f-structure pairs, we extracted all the German NP sub-structures, their grammatical relations and their parent structures' heads. We labeled their English counterparts 9 . These were all the inform ation required for the training of the NP network. About 700 samples for the NP networks were created this way. The training samples for the other networks were prepared in the same way. The NP network had the most samples, while the MP network had the least of 89 samples. Stand ard back-propagation was used to train the networks. We also tried the information-theoretical networks (Gorin et al. 1991) to generate the head of a target structure in the HF-vector, which required less training time and achieved comparable performance as the network trained with pure backpropagation algorithm (Wang 1994). The training took 500 to 2000 epochs for different networks, and the training time ranged from one hour to three days on DEC Station 5000. The mapper achieved 92.4% accuracy on the training data 10 . Learnability: The connectionist f-structure transfer described above did not require any hand-crafted rules or representations. The structure transfer was learned automatically. By clustering the distributed represent ations of words learned by the networks, i.e., the activation patterns of a feature slot when a lexical item was presented to its connected input slot, we had some interesting findings about what was learned by the networks. One of them was that the feature patterns for English nouns in the DP network were clustered into three classes, which reflected the three genders of German nouns: the German translations of the words in each class were roughly of the same gender. Another finding was about the classification of verbs. When we clustered the feature patterns for verbs in the VP networks, we found some intransitive verbs like register in the same class as most of the transitive verbs. This seemly strange classification is not odd at all if we consider the fact that the German translation for register, "anmelden", is a 9 A n NP's counterpart i s not necessary to be a n NP. A source language f-structure is said to be accurately mapped if the generated target language f-structure is exactly the same as desired in the sample.
10
402
YE-YI WANG & A L E X WAIBEL
transitive verb. These two independent findings reveal the networks' ability to discover some linguistic features of the target language and use it in the representation of an entity of the source language which does not possess those features. This is exactly what a symbolic transfer are supposed to do: using an intermediate representation which reflects the linguistic features of the two languages in question (even if one of the languages may have degen erated form for a specific feature,) and thus being able to make a 'transfer' at both the lexical and structural level into corresponding structure in the target language. Our system learned the intermediate representation auto matically, although it was not expressed explicitly in symbolic forms but encoded in the networks' activation patterns. Because the development of this representation was integrated into the process of automatic learning of f-structure mapping, it tended to include in the intermediate representation the important language specific linguistic features which were directly rel evant for the ultimate purpose of structure transfer. In the other words, the learning of the intermediate representation was focused on the purpose of improving the transfer performance. This is one of the biggest advantage of this approach over the hand-crafted intermediate representation. Scalability: We did a preliminary scalability experiment. We extended the source and target language lexicon by 2%, and made 30 new f-structures with these new lexical items. Trying to scale up from what was already learned, we froze all but the input-feature connections, trained the network for about 40 epochs with the new data, then fine-tuned all the connections with old and new data for a few epochs. In doing so, we let the networks first learn the new words to derive their distributed representations, and then learn the structure mapping for the new data later. This approach was based on the observation that a big portion of the new English words were translated to some German words already in the lexicon, which in turn was translated from some English words in the old training data. These old Eng lish words were mostly the synonyms of the new English words. By freezing the other connections and training only the input-feature connections, we hoped the networks to be able to develop the distributed representation for a new word similar to the already-learned representations of its synonyms. This approach greatly reduced the learning time for new words, since the one layer back-propagation was much fast than the full-blown learning. The mapper with the new phrasal networks that were retrained this way achieved 83.3% accuracy on the new data, without affecting the performance on the old data.
CONNECTIONIST TRANSFER
403
Generalisability: A separate set of data was used to test the gener alisation performance of the system. The testing data was collected from people not associated with our researches. The data was compared with the training corpus, and the sentences that appeared in the training data were removed. An LR parser parsed the sentences to English f-structures. The English were translated into German manually, and the translations were parsed by a German LR-parser. We picked the most probable struc ture when a parsing result was ambiguous. There were 154 f-structure pairs after we eliminated the wrongly-parsed sentences. The mapper achieved 61.7% accuracy on the testing data. Considering the limited number of training samples, this performance was encouraging. Previous research as in (Chrisman 1991) did not generalise to deal with unseen data. 6
Discussion
The application of the connectionist transfer described in this paper has its restrictions. First, it requires well-formed f-structures for both the input and output sentences. This greatly limits the applicable domain of the approach to well-structured 'clean' languages. It is difficult to use this approach for spoken language where performance data like ungrammatical utterances, noises, false starts are pervasive. Another restriction is that this approached can only achieve satisfactory performance when the input and output languages are similar, in the sense that the translation equivalents in the two languages mostly have similar recursive f-structures. Although the system can deal with structurally dif ferent input/output sentences, like the aforementioned example of [sentence GOODBYE] and [sentence AUF [obj WIEDERHOEREN]], we believe that the performance would drop significantly if drastic structure differences between translation equivalents are very common for the two languages in question. Fortunately, as shown by our data, the structural difference between English and German is not so drastic to ruin our system's performance. Although we had done some scalability experiment, it is unclear how the system will perform if we increase the lexicon significantly instead of by 2%. Because of the limitation of available data, we found it very difficult to conduct scalability experiments with much more expanded lexicon. We hope that with stable incremental performance, the system can be gradually and easily retrained to deal with more complicated problems.
404 7
YE-YI WANG & A L E X WAIBEL Conclusion
Aiming at the difficulties in symbolic transfer, we have proposed a connectionist transfer system t h a t maps between f-structures of two languages. It can discover meaningful linguistic features by learning. Its performance is promising with respect to learnability, scalability and generalisability. REFERENCES Bresnan, Joan. 1982. The Mental Representation Cambridge, Mass.: MIT Press.
of Grammatical
Relations.
Chrisman, Lonnie. 1991. "Learning Recursive Distributed Representations for Holistic Computation". Connection Science 3.4:345-366. Gorin, Allen L., Steve E. Levinson, A. N. Gertner &E. Goldman. 1991. "Adoptive Acquisition of Language". Computer Speech and Language 5:101-132. Miikkulainen, Risto & Michael G. Dyer. 1989. "A Modular Neural Network Ar chitecture for Sequential Paraphrasing of Script-Based Stories". Proceedings of the International Joint Conference on Neural Networks. IEEE. Nirenburg, Sergei, Victor Raskin & Allen B. Tuker. 1987. "The Structure of Interlingua in TRANSLATOR". Machine Translation: Theoretical and Meth odological Issues ed. by Sergei Nirenburg, 90-113. Cambridge: Cambridge University Press. Wang, Ye-Yi. 1994. "Dual-Coding Theory and Connectionist Lexical Selection". Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACĽ94), student session, 325-327. White, John S. 1987. "The Research Environment in the METAL Project". Ma chine Translation: Theoretical and Methodological Issues ed. by Sergei Niren burg, 225-246. Cambridge: Cambridge University Press.
Acquisition of Translation Rules from Parallel C o r p o r a Yuji Matsumoto and Mihoko K i t a m u r a Graduate School of Information Science Nara Institute of Science and Technology
Abstract This article presents a method of automatic acquisition of transla tion rules from a bilingual corpus. Translation rules are extracted by the results of structural matching of parallel sentences. The struc tural matching process is controlled by the word similarity dictionary, which is also obtained from the parallel corpus. The system acquires translation equivalences of word-level as well as those of multiple word or phrase-level.
1
Introduction
The major issues in Machine Translation are the ways to acquire transla tion knowledge and to apply the knowledge to real systems without causing unexpected side-effect phenomena. Hand-coding of transfer rules suffers from the problems of enormous manual labour and the difficulty of main taining their consistency. Example-based translation (Sumita 90; Sato 90) is supposed to be a method to cope with this problem. Unlike transfer-based approaches, the idea is to carry out translation by referring to translation examples t h a t give the best similarity to the given sentence. T h e key technique is to define t h e similarity between the given sentence and the examples and to identify the ones with the best similarity. Robustness and scalability are the claims of this approach. However, there are at least two important prob lems t h a t haven't been answered. One is "knowledge access bottleneck," which concerns the selection of the most similar example. Similarities are usually defined only for fixed and local structures, such as predicate argu ment structures and compound nominals. T h e units of translation cannot always be such fixed structures and may vary according to the language pairs. Similarity should be defined in a more flexible way. T h e other is "knowledge acquisition bottleneck." In example based translation, the par allel examples have to be aligned not only at sentence-level but word or
406
YUJI MATSUMOTO & MIHOKO KITAMURA
phrase-level. Although the sentence-level alignment can be done automat ically using statistics, e.g., (Utsuro et al. 94), the word-level alignment is not an easy task especially when the system tries to cover wide syntactic phenomena. This paper presents a method of automatic acquisition of translation rules from a parallel corpus of English and Japanese. Translation rules in this paper refer to word selection rules and translation templates t h a t represent word-level and phrase-level translation rules. A translation tem plate are regarded as a phrasal translation rule. Since translation rules may change according to the target domain, this method shed a light on an easy and effective way for developing domain dependent translation rules by accumulating a parallel corpus. 2 A c q u i s i t i o n of T r a n s l a t i o n R u l e s Figure 1 shows the flow of the acquisition of translation rules. Following three types of resources are assumed: 1. A Parallel corpus of the source and target languages. 2. G r a m m a r s and dictionaries of the source and target languages. 3. A machine readable bilingual dictionary. The automatic acquisition of translation rules is composed of the follow ing three processes: C a l c u l a t i o n of w o r d s i m i l a r i t i e s Calculation of the similarities of word pairs of the source and target languages based on their co-occurrence frequencies in the parallel corpus. S t r u c t u r a l m a t c h i n g Structural matching of the dependency structures obtained through parsing of parallel sentences. A c q u i s i t i o n of t r a n s l a t i o n r u l e s Acquisition of translation rules based on the structural matched results. We focus on a bilingual corpus of Japanese and English and assume t h a t sentence-level alignment has been done on the corpus. In case they are not aligned, we can have them aligned using an existing alignment algorithm such as (Kay & Röscheisen 93) (Utsuro et al 94). 2.1 Calculation
of Word
Similarities
We define the similarity of a pair of Japanese and English words by a numerical value between 0 and 1. We use the following two resources for
ACQUISITION OF TRANSLATION RULES
407
Figure 1: The flow of translation rules acquisition
obtaining the similarity: • a machine readable bilingual dictionary • a bilingual corpus of Japanese and English As for the former, we assign value 1 to the translation pairs appearing in the bilingual dictionary. As for the latter, we use the basic calculation method of the similarity proposed by (Kay & Röscheisen 93). Unlike their method, we preprocessed the corpus by analyzing them morphologically to obtain the base form of the words. The similarity of a pair of Japanese and an English words is defined by the numbers of their total occurrences and co-occurrences in the corpus. The similarity of a Japanese and English
408
YUJI MATSUMOTO & MIHOKO KITAMURA
English: Companies compensate agents. Japanese: The best score = 1.55
Figure 2: A result of structural matching
, where fj and fe are the total word-pair, is defined by sim(wj,WE) = numbers of the occurrences of the Japanese word wj and the English word WE, and fje means the total number of co-occurrence of wj and WE, that is, the number of occurrences they appear in corresponding sentences. 2.2 Structural matching of parallel sentences Corresponding Japanese and English sentences in the parallel corpus are parsed with LFG-like grammars, resulting in feature structures. We do not use any semantic information in the current implementation. When a sen tence includes syntactic ambiguity, the result is represented as a disjunctive feature structure. A feature structure is regarded as a directed acyclic graph (DAG). In the subsequent process of structural matching, we use the part of the DAG that relates with content words (such as nouns, verbs, adjectives and adverbs). The resulting DAG represents a (disjunctive) dependency structure of the content words in the sentence. We start with a pair of dependency graphs of Japanese and English sentences and find the most plausible graph matching between them. We use the word similarities described in the previous section in the matching process. The similarity of word pairs is extended to the similarity of subgraphs in the dependency structures. A sample result of structural match is shown in Figure 2. The basic definition and algorithm follows (Matsumoto et al. 93), though the similarity measures of words and subgraphs are refined. When the corresponding subgraphs (nodes in circles pointed by a bidi-
ACQUISITION OF TRANSLATION RULES
409
rectional arrow in Figure 2) consist of single words, the word similarity is used for their similarity. When any of the subgraphs contains more than one content word, we placed the following criterion: The higher the sim ilarity of a word pair the finer their corresponding subgraphs should be. This means that mutually very similar words should have an exact match whereas mutually dissimilar words, when they are matched against each other by the structural constraint, are better included in coarse subgraphs. To achieve this criterion, we defined the following formula for calculating mutual similarity between subgraphs: Let s and t be subgraphs matched against and Vs and Vt be the sets of contents words in s and t. We can assume, without loss of generality, that \VS\ is not greater than \Vt\ (Vs and Vt can be switched if it is not the case). Let Dp be the set of pairs of elements from \VS\ and \Vt\ defined by an injection (one-to-one mapping) p : |V s → \Vt\. Dp = {(a,p(a)) \a є VS} Then, the average similarity of words between |Vg| and |Vį| is defined as follows:
To achieve the above criterion, we put a threshold value Th (0 < Th < 1) where a similarity value higher than Th is supposed to indicate that they are mutually similar. The following formula of similarity between two subgraphs realizes the criterion in that the total similarity is bent toward the threshold value according to the size of subgraphs. Dividing the difference of AverageSim and Th by the size of subgraphs works as a penalty for graphs that are mutually similar and as a reward for graphs that are mutually dissimilar.
The branch-and-bound algorithm is employed for the search of the graph matching that gives the highest similarity value. Figure 2 shows an example of dependency structures and the result of the structural matching, in which the corresponding pairs are linked by arrows. Here the best score is the total similarity of the most similar graph matching. The threshold is set at 0.15.
410
YUJI MATSUMOTO & MIHOKO KITAMURA
2.3 Acquisition of translation rules After accumulating structurally matched translation examples, the ac quisition of translation rules is performed in the following steps. We assume a thesaurus for describing the constraints on the applicability of the acquired rules. Suppose we concentrate on a particular word or a particular phrase in the source language graphs that appear as a subgraph in matching graphs. We refer to the subgraph as t 1. Collect all the matched graphs that contain the same subgraph as t 2. Extract the graph t and its children together with the correspond ing part of the target language tree. Some heuristics are applied in this process: Corresponding pairs of pronouns are deleted, and zero personal pronouns in Japanese sentences are recovered. 3. The child elements are generalized using the classes in the thesaurus, which is identified as the condition on the applicability of the rule. The system acquires two types of translation rules that represent wordlevel and phrase-level translation rules. When the top subgraph consists of a single content word, we regard that the corresponding subgraphs give a a word selection rule. On the other hand, when the top subgraph consists of more than one content word, we regard it as a phrasal expression, and call it a translation template. Figure 2 shows an example of phrasal-level correspondence, "compensate : Since we assume the translation is influenced by the adjacent elements, i.e., the words that directly modify the word in the subgraph, we generalize the information in the collected matches so as to identify the exact contexts in which the translation rule is applicable. From the set of partial graphs that share the same parent nodes, trans lation rules in the form of feature structures are obtained. In the experiment described below, we focus on acquiring JapaneseEnglish and English-Japanese translation rules related with verbs, nouns and adjectives. 3 Experiments of translation rule acquisition We used Torihiki Jouken Hyougenhou Jiten (Collection of JapaneseEnglish expressions for business contracts, 9,804 sentences) (Ishigami 92)
ACQUISITION OF TRANSLATION RULES
Simirality
wE abnormal accessory accountant accumulative accurate address adjudge administrative adopt advancement advancement afterward agent
1 0.923077 0.941176 1 0.769231 0.764977 1 0.8 1 1 0.8 0.8 0.935583
411
fe
fi
fje
2 14 9 2 5 111 2 3 2 4 4 2 1004
2 12 8 2 8 106 2 2 2 4 6 3 952
2 12 8 2 5 83 2 2 2 4 4 2 915
Table 1: Examples of word similarity word
make business exclusive
sentence 184 254 114 309 191 127
parsing 183(99.5%) 245(96.5%) 103(90.4%) 309(100%) 191(100%) 127(100%)
matching 180(97.8%) 242(95.3%) 99(86.8%) 298(96.4%) 179(93.7%) 116(91.3%)
word-level 115(63.9%) 144(59.5%) 68(68.7%) 184(61.7%) 92(51.4%) 27(23.3%)
phrase-level 65(36.1%) 97(40.1%) 31(31.3%) 113(37.9%) 87(48.6%) 88(75.9%)
Table 2: Statistics of parsing and matching results and EDICT 19941 and Kodanska Japanese-English dictionary (Shimizu 79) (93,106 words) as the base resources. We also used an electronic version of Japanese thesaurus (called Bunrui-Goi-Hyo, BGH) (NLRI 94) and Roget's Thesaurus (Roget 11) for specifying the semantic classes. The current system works only with simple declarative sentences. 3.1 Acquisition of translation rules Total of 948 word pairs of Japanese and English are obtained by the method for the calculation of word-word similarity between two languages described in Section 2.1. Some examples of the similarity obtained in the 1
EDICT 1994 is obtainable through ftp via monu6.cc.monash.edu.au:pub/nihongo
412
YUJI MATSUMOTO & MIHOKO KITAMURA
experiment are shown in Table 1. We get a number of domain specific terms about business contracts, such as "agent: and "accountant: ," which are not found in the ordi nary bilingual dictionaries. Out of the 948 word pairs we obtained, only 236 appear in EDICT or Kodansha Japanese-English dictionary. Acquisition of word pairs from domain specific parallel corpora is very important, since many domain specific word pairs often do not appear in ordinary bilingual dictionaries. However, it should also be noted that the repetitive occur rences of the same expression causes a slight error in the similarity of the pairs. We selected several Japanese and English words of frequent occurrence and collected structurally matched results. Some of the results for those words are shown in Table 2. For example, out of 184 occurrence of Japanese verb " ", 183 sentences were successfully parsed (meaning that the cor rect parse was included in the possible parses), and 180 sentences succeeded in structural matching, in which 115 sentences had the top subgraph with a single content word, and 65 sentences had the top subgraph with more than one content word. To acquire word selection rules, the results are classified into the groups according to the translated target words. A word selection rule is acquired from each target word by generalizing the child nouns by the classes in the thesaurus. The word selection rules for are summerized in the upper part of Table 3. For instance, the table specifies that is translated into "give" when its subject is either of the semantic classes, substance, school, store and difference and its object is either of the class of difference, unit and so on. Phrasal translation rules are treated in the same way. Such examples of are shown in the lower part of Table 3. For instance, the Japanese phrase is translated into "X compensate Υ", if X and Y satisfy the semantic constraints described in the table. 3.2 The translation rules The translation rules described above are converted into the following data structure in our machine translation system. t r _ d i c t ( index, source feature structure, target feature structure, condition).
413
ACQUISITION OF TRANSLATION RULES
nominative (ga) objective (wo) dative (ni) [substance], [difference], [unit], [substance] [school], [store], [chance], [feeling], [store], [school] [number], [start end] [range seat track] [difference] [cause] [change] affect(8) [trade] [propriety] [store] [school] confer (6) [range seat track] [difference], [school], [school], [feeling] furnish (3) [store] [range seat track] render (1) [difference] [care] aíFord(l) [harmony] provide(l) [difference] t h e n u m b e r of word occurrence is in p a r e n t h e s e s . T h e n a m e of semantic classes in t h e t h e s a u r u s is in s q u a r e brackets. Engllish verb give(58)
Japanese patterns [21 [store,school,cause,...] [l][store,school [1][store,school]
English patterns
[1] affect [2] (17) 2) (2)
"21 [store,school] 21[store,school] 1 store,school 1 store] [2] [store] (1) [3][substance] [1][store [21[store] t h e n u m b e r of word occurrence is in p a r e n t h e s e s .
(1)
[1] [1] [1] 1
compensate [2] assent to [2] authorize [2] furnish [2] with [3]
Table 3: Acquired translation rules of index The index word of the translation rule. source feature structure A feature structure of the source language. target feature structure A feature structure of the target language. condition The semantic condition for the rule described by a set of seman tic classes for the variables appearing in the source feature structure. In the condition, checksum/2 is a Prolog predicate for checking the semantic classes of the variables (semantics classes are expressed by the class numbers in the thesaurus). Identifying the most suitable semantic classes in the thesaurus is by no means an easy task. In the current implementation, we use the semantic classes at the lowest level in the Japanese thesaurus BGH, which has 6 layers. This leads the description of the semantic condition to be a list of the lowest level semantic classes. Therefore, in our current implementation the translation rules compiled with few translation examples are far from
414
YUJI
MATSUMOTO & MIHOKO KITAMURA
complete. Some of the final form of translation rules are represented as follows:
[ p r e d : a s s e n t ( v e r b ) , subj:X, true ) .
to:Ζ ] ,
[ pred : g i v e ( v e r b ) , s u b j : X , o b j 1:Y, obj2:Z J , ( checksem(X,[11000,11040,11600, . . . ] ) , checksem(Y,[11642,11910,13004,...]), checksem(Z,[11000,11040,12630,...]) ) ) .
[ p r e d : r e f e r e n c e( n o u n ) true ) .
],
4 Discussion and Related Works Our machine translation system based on the acquired translation rules has the following characteristics: The system uniformly deals with word selection rules such as "confer and phrasal translation rules such as X compensate Y. Even it there is no translation rule to apply, the system uses the bilingual dictionary as the default. Translation pairs in the dictionary are regarded as word selection rules with no condition. Since all the translation rules are acquired from translation examples, manual compilation of translation rules is made minimal. Also, since the structural matching results used to obtain the translation rules are sym metric, both English-Japanese and Japanese-English translation rules are acquired, making two-way translation possible. Another important characteristic is that ambiguity (ambiguous transla tions caused by multiple applicable translation rules and ambiguous struc tural analyses) are resolved by putting priority to the translation rules with more specific information. The frequency information of translation pairs is also used for deciding the priority among the translation options. The parsing and generation phases share the grammars and dictionaries that are used in the acquisition phase of the translation rules. This assures no contradiction among the parsing, generation and translation rules.
ACQUISITION OF TRANSLATION RULES
415
On the other hand, the following issues should be considered: The quality of the translation rules depends on the quality of the the saurus. There are some unadmissible word selection and phrasal rules ac quired in the experiment. For example, the word selection rule, " X[human] Y[problem] (" " means advocate)" was paired with "make Y[problem] to X[human]," which is not a good translation rule. Rather, "make an objection to X[human]: X[human] " should be considered as an appropriate idiomatic expression. Idiomatic expressions like this example should be distinguished from normal word selection rules. The proposed method is suitable to formal domains. An experiment with colloquial expressions reveals much more difficulties in acquiring "good" translation rules. Moreover, the current method cannot cope with expres sions that necessitate contextual information. The method should be augmented so as to deal with complex sentences. We do not think that a direct augmentation of the structure matching algo rithm is applicable to complex sentences. Some two-level technique should be developed, the first level is to find an appropriate decomposition of com plex sentences and the proposed structural matching is applicable at the second level. A similar work for acquiring translation rules from parallel corpora is discussed in (Kaji 92), in which a bottom-up method is used for finding cor responding phrases (i.e. partial parse trees). We use dependency structures, which we think, is a critical point, since word order is not normally preserved between Japanese and English sentences while dependency between content words is preserved in most of the cases. (Watanabe 93) proposed a method of using matched pairs of dependency structures of Japanese and English sentences for improving translation rules. The algorithm of finding the structural correspondence is different from ours. Our method uses a more finer similarity measure that is learned from parallel corpus. As for the translation rule acquisition, their objective is to improve existing transfer rules whereas our objective is to compile the whole translation rules altogether. 5
Conclusions
The translation rules obtained by the proposed method can be integrated into an existing machine translation system. Generally, translation may differ depending on the domain. Our system is easily adapted to any domain provided that sizable parallel corpora of that domain are accumulated.
416
YUJI MATSUMOTO & MIHOKO KITAMURA
improve the acquired translation rules both in quality and quantity, we need to enlarge the scale of the parallel corpora. Another possible way to improve the translation rules is to give the post-edited translation results back to the acquisition phase. By doing this, missing translation rules are gradually acquired. REFERENCES Ishigami, Susumu. 1992. Torihiki Jouken Hyougenhou Jiten. Tokyo: Interna tional Enterprise Development Co. Kaji, Hiroyuki, Y. Kida & Y. Morimoto. 1992. "Learing Translation Templates from Bilingual Text". Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), vol.11, 672-678. Nantes, France. Kay, Martin & M. Röscheisen. 1993. "Text-Translation Alignment". Computa tional Linguistics 19:1.121-142. Matsumoto, Yuji, H. Ishimoto & T. Utsuro. 1993. "Structural Matching of Parallel Texts". Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL'93), 23-30. Columbus, Ohio. National Language Research Institute. 1994. Bunrui-Goi-Hyo [Word List by Semantic Principles]. Tokyo: Syuei Syuppan. Roget, Peter M. 1911. Rogeťs Thesaurus. New York: Crowell. Sato, Satoshi & M. Nagao. 1990. "Toward Memory-Based Translation". Pro ceedings of the 14th International Conference on Computational Linguistics (COLING-90), vol.III, 247-252. Helsinki, Finland. Shieber, Stuart M., G. van Noord, R.C. Moore & F.C.N. Pereira. 1990. "A Semantic Head-Driven Generation Algorithm for Unification-Based Formal isms". Computational Linguistics 16:1.30-42. Shimizu, Mamoru & N. Narita. 1979. Japanese-English Dictionary. Tokyo: Kodansha Co. Sumita, Eiichiro & H. Iida. 1991. "Experiments and Prospects of Example-Based Machine Translation". Proceedings of 29th Annual Meeting of the Association for Computational Linguistic (ACĽ91), 185-192. Berkeley, California. Utsuro, Takehito, H. Ikeda, M. Yamane, Y. Matsumoto & M. Nagao. 1994. "Bilingual Text Matching Using Bilingual Dictionary and Statistics". Pro ceedings of the lįth International Conference on Computational Linguistics (COLING-9Ą), vol.11, 1076-1082. Kyoto, Japan. Watanabe, Hideo. 1993. "A Method for Extracting Translation Patterns from Translation Examples". Proceedings of the 5th Int. Conf. on Theoretical and Methodological Issues in Machine Translation (TMI-93), 292-301. Kyoto, Japan.
Clause Recognition in t h e Framework of Alignment HARRIS V.
PAPAGEORGIOU
Institute for Language and Speech Processing — ILSP National Technical University of Athens — NTUA Abstract In this paper we explore the possibility of achieving reliable clause identification of unrestricted text by using POS information ( as the output of a unification rule-based part of speech tagger ) a CMTAG module trying mainly to fix errors from earlier processing and a lin guistic rule-based parser. Identification of simple/complex clauses is considered here as a basic component in the framework of bilingual alignment of parallel texts. One of the important points of this work is the ability for processing very long sentences. Parser is capable of analysing and labeling clause structure. The system is applied to an experimental corpus. The results we have obtained are very promising. 1
Introduction
Recent research in bilingual alignment has explored mainly statistical meth ods. While these aligners achieve surprisingly high accuracy when perform ing at the sentence level (Brown 1991; Kay 1988; Gale 1991; Chen 1993) it remains an open issue how to generalise these techniques for alignment of phrases at the subsentence level because of the inherent assumptions of these methods. A number of recent proposals to the identification of subsentential translations have been developed that tackle the problem at dif ferent levels (Utsuro 1994; Kupiec 1993; Kaji 1992; Grishman 1994; Dagan 1993; Church 1993; Matsumoto 1993; Smadja 1992). However the detection of clausal, embedded translations in bilingual parallel corpora remains a difficult problem due to the fact that there is considerable divergence from the desirable one-to-one clause correspondence (Santos 1994). Papageorgiou (1994) describes a generic alignment scheme which is based on the principle that semantic content and discourse function are preserved by translation. At the sentence level, the aligner obtained performance comparable to that of statistical aligners. In this paper we are mainly concerned with the first basic step in clause alignment, that is clause recognition. Even though identification of simple/ complex clauses is considered here as a basic component in the framework
418
HARRIS V. PAPAGEORGIOU
of bilingual alignment of parallel texts, some cross points will be made concerning partial parsing. 2
Previous work
Automatic detection of clause boundaries is a prerequisite for clause align ment. It is also a major issue in parsing. According to Koskenniemi (1990): "Clause boundaries are easier to determine if we have the correct readings of words available". And conversely, it is more convenient to write constraint rules for disambiguation and head-modifier relations if one can assume that the clause boundaries are already there (Koskenniemi 1992). Two different approaches have been extensively recorded in the literat ure: regular expression methods and stochastic methods. The former use regular expression grammars and constraints expanded into a deterministic FSA for clause recognition. Ejerhed (1988) uses a regular expression method and her system: • looked for ηομη phrases in a preliminary stage; • concentrated on certain characteristics present in the beginnings of clauses; and • assumed that the recognition of any beginning of a clause automatic ally leads to the syntactic closure of the previous clause. Another assumption often used by researchers involved the construction of the grammar: the text was expected to being fully and correctly disambig uated (Ejerhed 1988; Coniam 1991). Several errors confusing VBD (past tense) and VBN (past participle) as well as IN (preposition) and CS (sub ordinating conjunction) which were made during the tagging process, led to incorrect recognition of clauses in many cases. A systematic failure of the system described by Ejerhed (1988) was its incapability to capture clauses beginning with a CC (coordinating conjunc tion) followed by a tensed verb, as in the example: [The Purchasing Departments are well operated and follow generally accepted practices]. The same applies also for cases where a preposition is followed by a whword, i.e., before (WDT WPO WP$ WRB WQL) as in the example: [The City Executive Committee deserves praise for the manner in] [which the election was conducted]. In (Koskenniemi 1992) constraint rules are handcoded specifying what kinds of clues must be present in order to put in a clause boundary. In the case of
CLAUSE RECOGNITION
419
non-finite constructions, a non-finite verbal skeleton is constructed starting with certain kinds of non-finite verb and ending with the first main verb to the right. A distinction is made between finite and non-finite clause constructions by using different tags. This approach increases the amount of ambiguity and burdens the work of disambiguation process. Another feature of the system is that a first level of centre embedding has taken into consideration, as in the example: @@The man ... @( who came first @ ) got the job @@.
Technical constraints for feasible clause bracketing have not allowed a second or third level of embedded clauses. As for the stochastic approach, training material is needed in order to fine-tune the parameters of the model (Ejerhed 1988; Ramshaw 1995). In (Ejerhed 1988), the training material included markers for beginning and end of clauses. The system was also trained to recognise tensed verbs. Res ults recorded were surprisingly good. However, a comparison of the nature of the errors in a sample of a regular expression approach and a sample of the stochastic recogniser revealed that while the finitary approach errors are sys tematically due to under-recognising clause boundaries, the stochastic pro gram errors are due both to over-recognising and under-recognising clause boundaries. These qualitative results give preference to finitary methods since under-recognising clauses is not actually a problem given that they can be easily recovered using simple Dynamic Programming techniques. On the other hand overgeneration coupled with errors by the stochastic module (due to wrong predictions of clause openings or/and closings) makes it more difficult for the alignment algorithm to reconstruct the clause structure of the sentence. 3
The model
The methodology adopted here is surface oriented and stepwise. An inspira tion has been the so called CASS (Cascaded Analysis of Syntactic Structure) described in (Abney 1990). The goal is to recover syntactic information effi ciently and reliably, by sacrificing completeness and depth of analysis. The full framework of automatic clause recognition is depicted in Figure 1. The preprocessing module is a single deterministic FSA which partitions input streams into tokens. In the context of alignment, sentence and word boundaries as well as numbers, dates, abbreviations, paragraph boundaries and various sorts of punctuation are extracted. The rules of the text gram mar were designed to capture the introduction of text sentences and also to
420
HARRIS V. PAPAGEORGIOU
Fig. 1: The model architecture.
define text adjunct formulations (Nunberg 1990). The tagging analysis is done by the well-known transformation-based tagger (Brill 1993a; Brill 1993b; Brill 1994). The initial state was a trigram tagger tuned and trained on a small portion of a (different but of the same type) pre-tagged corpus. Training the contextual-rule tagger was done on a small amount of text of the CELEX database (the computerised documentation system on Community Law) about 70,000 words. CMTAG (Clause Marker TAGging) is an essential step supplementing the tagging by reducing ambiguities concerning possible clause markers and by enriching the annotation of the text with information about certain types of clauses. Its role is: • to extend parts of speech over more than one orthographic word (quite similar to IDIOMTAG module of CLAWS tagger). This is done only for compound subordinators such as: so as/that, in order to/for/that, as if and for complex prepositions such as: according to, by means of, due to. • to discriminate a non finite verbal skeleton from a finite construc tion: for this purpose we insert tags for cases like: to/TO the/DT Treaty/NNP establishing/VBG-F the/DT European/JJ... Examples of finite verbal constructions are: — They [ are going to adopt ] ... — They [ would not have been investigating ] this .. . — I [ would like to make ] some questions ... In the last example, we include "to make" in the finite verb chain, an interpretation that will be validated by the clause alignment al-
CLAUSE RECOGNITION
421
gorithm. Examples of non-finite verbal constructions are: — the distillation [ indicated ] in Article 39 thereof is decided on ... — [ Given ] the improvements in market conditions .. . • to reduce ambiguity by imposing constraints to the tagged corpus. Here, we try to fix errors of mis-analysed complementisers by insert ing clause markers before possible conjunctions: for example if there is a verbal construction before the next candidate conjunction or punc tuation and after a possible subordinator as in the following case: . . . voluntary distillation as/RB provided for in Articles 38, . . . "as" that was tagged RB(adverb)is converted to as/CS. • to label certain types of clauses depending on clause opens. It is a twolevel module distinguishing adverbial clauses, relative, non-finite and coordinate clauses as in (Quirk 1985). At a second stage, we predict a subcategorisation of adverbial clauses into eight types, following (Collins 1992). This information will be explored if it is worthwhile by the clause alignment algorithm. Actually, it does not affect the following syntactic processing. Finally the proposed grammar constructs the clause analysis for input sen tences. First, we introduce a few definitions: subord for a set of comple mentisers, punct for a set of punctuation marks. • subord (CS|WDT|WRB|WP|WP$) • punct (- │, | : | . | ; | ") The syntactic analysis consists of a set of rules trying to match the input against the rule patterns. The first pattern defines complete subordinate clauses as consisting of an optional coordinating conjunction (CC) followed by an obligatory subordinating conjunction (subord) followed by optional nominal elements followed by an obligatory verbal construction (as this was defined in CMTAG description) followed by optional nominal elements ex cept anything listed in subord or punct sets above or a new verbal construc tion. This rule pattern is expressed as: (CC)?{subord}({nomelements}│{punct})*(fvskeleton){nomelements}* The second pattern defines non-finite clauses starting with a non-finite verbal skeleton followed by optional nominal elements except anything listed in subord/punct lists above or a new verbal construction. The expression representing this pattern is: (nfvskeleton){nomelements}* The third pattern defines coordinate clauses introduced by an obligatory coordinating conjunction (CC) followed by an optional adverb followed by a verbal skeleton followed by the same ending as in the previous cases.
422
HARRIS V. PAPAGEORGIOU
(CC)(adv)*((fvskeleton)│(nfvskeleton)){nomelements}* There are six other rule patterns capturing basic clause fragments (verb phrase fragments, noun phrase fragments and adjuncts) and trying to identify the role of noun phrase fragments (SUBJ,OBJ..). Finally, the third and last part consists of grammar rules and actions trying to construct clauses(main clauses and embedded clauses) from the previous non-terminals identified by the second part. For example, a simple rule is that a main clause can be a noun phrase fragment followed by a sequence of clause(s) (identified by the second part) and a verb phrase fragment, as: [ the total quantity of table wine ] [ for-which each producer may submit one or more delivery contract declarations for approval by the intervention agency ] [ should be limited ] [ to an appropriate percentage of the quantity of table wine ]. where the result is two clauses: [ the total quantity of table wine should be limited to an appropriate percentage of the quantity of table wine] [ for-which each producer may submit one or more delivery contract declarations for approval by the intervention agency ]. Regulation R0086 R0104 R0746 R1111 R1117 R1120 R1369 R1425 R1486 R1516 TOTAL
Sentences Clauses Identified Clauses Wrong-place 22 46 44 1 11 23 23 68 134 127 8 30 40 39 1 24 36 35 54 33 54 2 41 25 39 1 64 39 1 63 25 43 45 2 38 81 79 9 315
562
548
25
Table 1: Test Samples after syntactic analysis. 4
Results
The proposed model was applied to a test suit of ten regulations of the CELEX database. (Table 3 shows the results obtained only for the English
CLAUSE RECOGNITION
423
corpus though the experiment was done for the English-Greek language pair set of sentences). The total success rate of the current system is about 93% (out of 562 clauses the system identified 548 cases and in these there was a clause marker in a wrong place, in twenty-five cases). Under-recognition errors made by the parser were due to inherited er rors made by the tagger ( confusing NN and VB ) and propagated to the subsequent modules. The proper way to deal with these cases is probably to establish a NP filter to correct the most common errors (similar to that in (Abney 1996) ). Wrong-place errors are mainly due to the incapability of the system to identify correctly the clause opens in coordinate constructions where a CC is followed by a noun-phrase followed by a subordinate clause followed by a verb phrase fragment. Differences between success rates in different regulations are partly ex plained by the structural complexity of very long sentences that are char acteristic of the samples. 5
Conclusions
We have introduced a method for recognising clauses for subsentence align ment purposes. The method is robust with respect to wrong-place errors and over-recognising errors (totally about 4%) if we ignore underrecognition errors that might be handled by the alignment algorithm. Some improve ments might be achieved by inserting 'repair' filters before parsing. Work on the alignment algorithm is being currently carried out. Acknowledgements. I want to thank greatly Stelios Piperidis for his extens ive work on the Greek compound conjunctions. Special credit is due to Penny Lambropoulou for commenting on the work. REFERENCES Abney, Steven. 1990. "Rapid Incremental Parsing with Repair". Proceedings of the 6th New OED Conference, 1-9. University of Waterloo. 1996. Forthcoming. "Part-of-Speech Tagging and Partial Parsing". To appear in Corpus-Based Methods in Language and Speech ed. by Ken Church, S. Young & G. Bloothooft. Dordrecht: Kluwer. Brown, Peter F., J.C. Lai & R. Mercer. 1991. "Aligning Sentences in Parallel Corpora". Proceedings of the 29th Annual Conference of the Association for Computational Linguistics, 169-176. Berkley, Calif.
424
HARRIS V. PAPAGEORGIOU
Brill, Eric. 1993a. "Automatic Grammar Induction and Parsing Free Text: A Transformation-based Approach". Proceedings of the DARPA Speech and Natural Language Workshop, 237-242. 1993b. "A Corpus-Based Approach to Language Learning". PhD thesis. Philadelphia: University of Pennsylvania. 1994. "Some Advances in Transformat ion-Based Part-of-Speech Tagging". Proceedings of the 12th National Conference on Artificial Intelligence (AAAI94) , 722-727. Church, Ken Ward. 1993. "Char_align: A Program for Aligning Parallel Texts at the Character Level". Proceedings of the 31st Annual Conference of the Association ¡or Computational Linguistics, 1-8. Columbus, Ohio. Coniam, David 1991. "Boundary Marker: System Description" NERC report. Chen, Stanley. 1993. "Aligning Sentences in Bilingual Corpora Using Lexical Information". Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, 9-16. Columbus, Ohio. Collins Cobuild. 1992. Collins Cobuild English Grammar. Harper Colllins, London. Dagan, Ido, . Church & W. Gale. 1993. " Robust Bilingual Word Alignment for Machine-Aided Translation". Proceedings of the Workshop on Very Large Corpora, 1-8. Columbus, Ohio. Ejerhed, Eva. 1988. "Finding Clauses in Unrestricted Text by Finitary and Stochastic Methods". Proceedings of the 2nd Conference on Applied Natural Language Processing, 219-227. Austin, Texas. Gale, William A. & Ken Church. 1991. "A Program for Aligning Sentences in Bilingual Corpora". Proceedings of the 29th Annual Conference of the Association for Computational Linguistics, Berkley, Calif., vol 2.177-184. Grishman, Ralph. 1994. "Iterative Alignment of Syntactic Structures for a Bi lingual Corpus". Proceedings of the Second Annual Workshop on Very Large Corpora, 57-68. Kyoto, Japan. Kaji, H., Y. Kida L· Yasutsugu Morimoto. 1992. "Learning Translation Tem plates from Bilingual Text". Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), 672-678. Nantes, France. Kay, Martin & M. Roscheisen. 1988. "Text Translation Alignment". Computa tional Linguistics 19:1. 121-142. Kupiec, Julian. 1993. "An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora". Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, 17-22. Columbus, Ohio. Koskenniemi, Kimmo. 1990. "Finite State Parsing and Disambiguation". Pro ceedings of the 13th International Conference on Computational Linguistics (COLING-90), vol 2.229-232. Helsinki, Finland.
CLAUSE RECOGNITION
425
Koskenniemi, Kimmo, P. Tapanainen & A. Voutilainen. 1992. "Compiling and Using Finite State Syntactic Rules". Proceedings of the 14th International Conference on Computational Linguistics, 156-162. Nantes. Matsumoto, Yuji, H. Ishimoto, T. Utsuro & Makoto Nagao. 1993. "Structural Matching of Parallel Texts". Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, 23-30. Columbus, Ohio. Nunberg, Geoffrey. 1990. "The Linguistics of Punctuation" Center for the Study of Language and Information CSLI Lecture Notes, Number 18. Papageorgiou, Harris, L. Cranias & S. Piperidis. 1994. "Automatic Alignment in Parallel Corpora". Proceedings of 32nd Annual Conference of the Association for Computational Linguistics, 334-336. Las Cruses, New Mexico. Quirk, Randolph, S. Greenbaum, G. Leech & Svartvik J. 1985. A Comprehensive Grammar of the English Language. London: Longman. Ramshaw, Lance A. & M.P Marcus. 1995. "Text Chunking using TransformationBased Learning". Third Workshop on Very Large Corpora Cambridge, Mass. Smadja, Frank. 1992. "How to Compile a Bilingual Collocational Lexicon Auto matically". AAAI -92 Workshop on Statistically-Based NLP Techniques, 65-71. San Jose, Calif. Santos, Diana. 1994. "Bilingual Alignment and Tense". Proceedings of the Second Annual Workshop on Very Large Corpora, 129-143. Kyoto, Japan. Utsuro, Takehito, H. Ikeda, M. Yamane, Y. Matsumoto & Makoto Nagao. 1994. "Bilingual Text Matching using Bilingual Dictionary and Statistics". Pro ceedings of the 15th International Conference on Computational Linguistics, 1076-1082. Nantes, Paris.
Bilingual Vocabulary Estimation from Noisy Parallel Corpora Using Variable Bag Estimation DANIEL B. J O N E S
&
HAROLD SOMERS
UMIST, Manchester Abstract This paper describes a fully automatic bilingual lexicon extraction process which can be used for estimating the bilingual vocabulary of two languages from the analysis of raw and noisy parallel corpora. No sentence alignment or other pre-processing is required. The result is a set of possible translations for each source language. Each translation is assigned a probability based on the distributional nature of the source word in relation to the target word across the parallel corpus.
1
Introduction
Extracting useful information about language behaviour from corpora is of great interest in theoretical terms as it calls into question the exact role of linguistics in Language Engineering. If the process of information extraction is also a fully automatic one, i.e., it requires no human intervention either at an initial stage (for example, tagging words with their grammatical parts of speech), or during the process's execution time, then the information extraction mechanism is of additional practical importance as it can be applied to texts under a wide variety of circumstances and for a wide variety of needs. This paper describes a fully automatic information extraction process which can be used for estimating the bilingual vocabulary of two languages from the analysis of raw and noisy parallel corpora. The vocabulary is estimated in the sense that a word in the target language text is said to be a translation of a word in the source language text given an estimation of their distribution within the parallel corpus. The corpus itself has to be presented in parallel, i.e., the source language corpus has been translated into the target language and both versions are available to the process. The approach described here does not require the parallel corpus to be 'clean': no pre-editing of the text as a whole or pre-alignment of sentences is necessary.
428 2
DANIEL . JONES & HAROLD SOMERS Related work
Similar work in this area has been carried out by a variety of researchers (Catizone et al. 1989; Kay & Röscheisen 1993; Gale & Church 1991; Fung & Church 1994; Fung & McKeown 1994; Jones & Alexa 1994). Most (if not all) of these approaches have been developed in order to bootstrap NLP (and in particular Machine Translation) systems with lexical and phrasal alignment information from parallel corpus material. The main characteristic of these approaches is they require no (to some degree or other) linguistic description or pre-processing as they rely on distributional-statistical data of word occurrence. When preprocessing is used it is itself an automatic process. Gale and Church, for example, require sentences in parallel corpora to be aligned before vocabulary estimation can be achieved to a reasonable degree of accuracy. The sentence alignment is done automatically. It is appealing, though, when preprocessing is not used at all. One ad vantage is that overall run-time of an alignment process will be much faster. Given the computationally expensive nature of this type of statistical pro cessing, any saving in processing time can be very advantageous. Fung & Church propose a method called K-vec which can be used for estimating bilingual vocabulary from 'noisy' corpora, i.e., parallel bilingual corpora which have not been pre-edited nor sententially aligned. We find the ap proach appealing in particular for the reason stated above and also for its inherent simplicity. The following section briefly describes K-vec and an extension of it which we call Variable Bag Estimation (VBE). 3
Methodology
3.1
General estimation of distribution
The method of alignment used by VBE is quite simple in principle. Firstly, positional estimation of possible translation alignments is carried out using the same process proposed by Fung & Church (1994). Briefly, this involves dividing the source and target language corpus into portions 1 . Once this is done, the presence or absence of a word in each portion is noted. For example, if we are considering the possible alignment of a source word SWi 1
Fung & Church suggest square root of the length of the corpus (counted in words) as a suitable value for the portion size; the number of portions, Ki must be the same for both corpora.
VOCABULARY ESTIMATION FROM PARALLEL CORPORA
429
with a target word TWj, the distribution of SWi and TWj over the corres ponding portions of the source and target corpora is calculated. The result is a binary vector of length for each of SWį and TWj, e.g., Vi= [1,0,1,1,...] Vj = [1,1,0,1,...]
(1) (2)
The likelihood of SWi and TWj being in a translation relation is then based on a comparison of the two vectors Vi and Vj. Once the vectors have been computed, 2 × 2 contingency matrices are calculated for the pair of vectors showing the number of portions which contain (a) both SWi and TWj, (b) SWi but not TWj, (c) TWj but not SWi and (d) neither SWi nor TWj. The word pairing is then assigned a mutual information score and a significance score from the values in the contingency matrix. The mutual information score I is based on co-occurrence probabilities, and is given by: (3) where (4) (5) and (6) The significance score is given by: (7) We tested Fung & Church's algorithm on an English-French parallel cor pus 2 . By way of example, estimations for the translation of the English word years are given in Table 1. The table is numerically sorted on the I column 2
The corpora contained 23136 and 24377 words respectively, and were taken from the ACL-European Corpus Initiative CD-ROM. The material was that of the announce ment text of the European Community's Esprit programme.
430
DANIEL . JONES & HAROLD SOMERS TW système d'un secteurs tout terme marché etats entre mais long comme données nécessaire on activités doit années travail gestion ressources ainsi communautaire conseil niveau secteur
% sera cours projet esprit techniques tous leurs peut produits ont 2 base elle phase seront traitement objectifs
I 0.596585 0.585938 0.573865 0.573865 0.521397 0.477003 0.458388 0.425473 0.40394 0.351472 0.251937 0.235995 0.213969 0.204631 0.142019 0.0295442 -0.0270394 -0.040845 -0.0814871 -0.107959 -0.184581 -0.192848 -0.23349 -0.23349 -0.23349 -0.280796 -0.292384 -0.311493 -0.311493 -0.370994 -0.385493 -0.414062 -0.455883 -0.455883 -0.455883 -0.555418 -0.602724 -0.648528 -0.921546 -1.04085 -1.23349 -1.29238 -1.69292
t 0.957939 1.05552 0.868297 0.868297 0.90991 0.689608 0.720176 0.807663 0.646115 0.529619 0.423933 0.36963 0.308215 0.349873 0.265165 0.0453256 -0.0423042 -0.0574324 -0.129934 -0.155405 -0.305193 -0.350321 -0.304279 -0.392823 -0.351351 -0.480452 -0.449324 -0.417409 -0.417409 -0.655712 -0.750294 -0.743342 -0.643668 -0.743243 -0.743243 -0.939189 -1.03716 -0.983056 -1.5487 -1.49544 -1.9111 -2.04965 -3.15809
Table 1: Estimates for possible translations of the English word years
VOCABULARY ESTIMATION FROM PARALLEL CORPORA
431
with the largest values first, as large I scores are 'better' than low scores. It can be seen that the best estimate of a translation relation lies with système, whereas the correct translation (années) is ranked only 17th. This is evidently not a good result, and certainly not as positive as the results hinted at by Fung & Church. Yet they are typical of results that we have obtained in several experiments with this algorithm, with a variety of corpora and language pairs including German and Japanese. Why could this be? One possible explanation is the size of our corpora being too small. Although the algorithm itself is independent of corpus length, it is possible that with too small a corpus the distributions of the words are too similar. It should be noted too that Fung & McKeown (1994:82) also report the poor performance of the K-vec algorithm with Japanese-English and ChineseEnglish parallel corpora: K-vec segments two texts into equal parts and only compares the words which happen to fall in the same segments. This assumes a linearity in the two texts [...]. The occurrence of inserted or deleted paragraphs is another problem which leads to nonlinearity of parallel corpora. In fact, it does not need a whole paragraph to skew the corpus: just an extra sentence near the beginning of the text can mean that many of the words you would expect to occur in portion actually occur in portion i + 1, as we discovered with a manual inspection of the corpus. Fung & McKeown (1994) tried to overcome this weakness by proposing a new algorithm, DK-vec, which compares 'recency vectors' for word pairs, comparing the amount of text between each occurrence of the word, the idea being that each such vector will have a distinctive trace, rather like a speech signal, so that techniques developed for matching such signals can be used. 3.2
Variable bag estimation
Our own approach similarly attèmpts to capture the generalisation that words which are translations of each other will appear at roughly the same equivalent places in the text. Figure 1 depicts this in a graphic, though simplistic, way. In the case shown in Figure 1, there are three instances of bilingual and three instances of bilingue. As they also appear in the same portions of the text (and nowhere else), it is logical to regard them as probable translations of one another. In order to simplify matters we can imagine the system assuming that its highest scoring estimation of a translation of SW is the TW which
432
DANIEL . JONES & HAROLD SOMERS
This paper describes a fully automatic | bilingual | lexicon extraction process which can be used for estimating the Į bilingual Į vocabulary of two languages from the analysis of raw and noisy parallel corpora. No sentence alignment or other pre-processing is required. The result is a set of possible translations for each source language. Each translation is assigned a probability based on the distributional nature of the source word in relation to the target word across the bilingual corpus.
"Cet articleHécrit-ua processus d'extraction de lexique bilingue entièrement automatique qui "peut servir a deviner le vocabulaire bilingue] d'après ľ analyse des corpus parallèles bruts et bruyants sans appariement des phrases ou autre pré-edition. Le processus donne un ensemble de traductions possibles pour chaque mot de la langue source. Avec chaque traduction est associée une probabilité laquelle est basée sur la nature distributionelie du mot source par rapport au mot cible et ce, à travers le corpu bilingue .
Fig. 1: Graphic depiction of source and target word alignment has either the highest I and/or t score. This assumption is based on the fact that those portions which contain TW are physically in or near the same position as SW is found in the source language version of the corpus. However, even in 'well behaved' corpora this is very often not the case and in noisy texts which have not been pre-processed to remove or canonicalise white space, punctuation, markup etc., the position is made more difficult and less accurate than it might otherwise be. The results in Table 1 show that although the correct translation is often given a relatively high score, it may well not be sufficient to distinguish it as the alignment to select over any of the others. Given this state of affairs, is there anything that can be done to improve the performance of the process? The problem is the generality of the por tions into which the corpus is divided. In particular, the size of the portion is arbitrary, and, in addition, as mentioned above, the quality of word align ment depends on the degree of 'skewing' exhibited by a parallel corpus, i.e., the degree to which the word positions of the texts are offset by interme diate material. However, it is logical that these factors can be alleviated if the portions are not fixed but variable in size and therefore content.
VOCABULARY ESTIMATION FROM PARALLEL CORPORA
433
A direct method of incorporating this approach into Fung & Church's work would be to simply start with a portion size based on splitting the corpus into = chunks as before, iteratively increasing or decreasing the size of the portions, and examining the relative performance of each value of K. It would be interesting to see, for example, whether or not some values of produced better results than others with respect to certain language pairs and/or sublanguages. However, it is not clear how one could use such a system in a fully automatic bootstrapping process as the value of would not be self-determining but a matter for ongoing empirical study. A more practical implementation of a variable portion size process in volves creating a minimal-sized portion of one word and gradually increasing its size until the significant matching target word appears. Intuitively, what happens is this: assuming the 'bags' are centred at roughly the right places, but assuming also that the texts are not perfectly aligned word-by-word, as the bags grow in size, more and more words are 'sucked' into them. Re member that there are several bags spread throughout the text. At first, the bags will apparently contain random words, some of them occurring in several of the bags, but, significantly, occurring just as often outside the bags. At some point, amongst a lot of rubbish, crucially the word we are looking for — the translation of the source word — will be found in nearly all the bags, and hardly anywhere else. This is when the process stops. The advantage of this is that the termination of the process is not arbitrary but is based on determining which words in the local context of the initiating points are unique to that context and not the rest of the corpus. The crucial question is: Where are the initiating points? The simplest (and most naïve) method of finding these points would be to use the same locations in the target text as the source words. In other words, if the source word under consideration occurred as the 11th, 50th, and 200th word, the initiating points in the target corpus would be the same (or rather, the equivalent taking into account the relative lengths of the two corpora). However, experiments have shown that this approach is too crude and does not provide very good results. A much better approach is to estimate the initiating points in the target text from the information used in K-vec's general estimate of distribution. If we take each entry in Table 1 in turn, the corresponding portions in which each word occurs can be used as anchor points from which the initiating points can be estimated more accurately. An initiating point IP can be determined from: IP = ((kj - 1) × K') + (K' x IPe)
(8)
434
DANIEL . JONES & HAROLD SOMERS
where kj represents the portion which contains the target word, K' is the number of words per portion (roughly equal to K, since is the square root of the length of the corpus), and IPe is an offset factor for estimating the initiating point for that portion. A neutral value for this would be 0.5 which would place IP at the mid-point of the portion containing the target word3. For example, if techniques in Table 1 occurred in (amongst others) portion number 23 and there were 150 words per portion, the IP (assuming IPe = 0.5) for the target text with respect to an alignment for the source word years would be: 3375 = ((23 - 1) x 150) + (150 χ 0.5)
(9)
Thus, the IP in this case is located at the 3375th word in the target language corpus. 4
Experiments
The use VBE makes of Fung & Church's corpus portioning information makes VBE a form of filtering process on the I and t values produced by that process. VBE can therefore be thought of as a post-filtering process which seeks to support (or otherwise) the estimations made by the K-vec process. Experiments were carried out to compare results obtained from the Kvec process with those of VBE using the English-French parallel corpus mentioned above. The VBE process used the portioning information pro duced by K-vec in order to determine its IP values by applying equation (8). Table 4 shows the results of running VBE with IP values derived from the K-vec alignment portions information for the source language word years. The first column (TW) shows the target word aligned with years. Associated with each target word is a number which is the bag size at which the likely target word emerged. As outlined in section 3.2, a bag4 is created which contains all the words at all the IPs. When the VBE process begins the bag is very small and will only contain the words which fall exactly at the IPs. However, at every iteration of the process, more words are added which are neighbours of the IP words. Neighbour words are regarded as 3
4
Any other value for IPe (between 0 and 1) would suggest a starting point nearer the start or end of the portion, and could be used in connection with a different rate of leftward or rightward expansion of the bag. For example, one might want to experiment by starting the search at the front of the portion, and expanding only rightwards. Intuitively there are several bags, one for each occurrence of the candidate target word; but for computational simplicity, all the bags can be combined as a single 'super-bag'.
VOCABULARY ESTIMATION FROM PARALLEL CORPORA
435
words immediately to the right and left of the IP. Thus, the bag increases in size as these neighbour words are added at each iteration until there emerges a word contained in the bag which does not occur outside it. This is the proposed translation. BAG SIZE
220 230 240
LIST OF T W S
années, niveau, on, projet, secteur, système ainsi, d'un, entre, mais, nécessaire, peut activités, comme, communautaire, cours, doit, gestion, leurs, long, ressources, secteurs, sera, techniques, tous, tout, travail
Table 2: VBE estimates for possible translations of the English word years using IP values derived from a previous K-vec process using the same source language word The VBE algorithm can be briefly described as follows: 1. Determine all IP values from the target language corpus partitioning information created by the K-vec process. There will be one IP for each instance of the target word under consideration. 2. At each IP create a bag from the word located at the IP plus words to the left and to the right of the IP. The value of is based on the degree on granularity required. 3. Check if the words in the bag(s) only occur in the bag(s) and nowhere else in the target language corpus. If this is true stop. If it is false increase the size of by an increment factor, e.g., 10 or 20 words5 and go to step 2. This algorithm is applied to each target language word estimated to be a translation of the source by K-vec. Once all these words have been processed by VBE the word or words with the smallest bag sizes are considered to be the most likely translations of the source word. Small bags score more highly than large bags as they indicate a greater degree of positional equivalence of the source and target word(s). If the source and target texts were identical, any given word in the source text at position χ would find its translation at position χ in the target. This is the essential principle used by VBE except that the variable bag size accommodates the practical consideration of source and target corpora being far from identical when dealing with real languages. 5
In practice, the increment factor cannot be too small as this slows down the process. On the other hand, a large increment factor results in a coarser grain-size in the results.
436
DANIEL . JONES & HAROLD SOMERS
The results shown in Table 4 were obtained by taking each TW from Table 1 and applying the VBE procedure as outlined in Section 3.2. The comparison between the two tables is quite striking as VBE has scored the correct trans lation (années) as most likely translation along with five other candidates (niveau, on, project, secteur, and système).
5
Results and observations
Table 5 shows another example of a VBE alignment. The English word under consideration is software. The correct translation, logiciel, is ranked joint 3rd. In this particular case, K-vec ranked logiciel joint 17th as a probable translation of software so again, VBE has preferred to give the correct translation a higher ranking. BAG SIZE
210 220 230 240
LIST OF T W S
résultats communauté commission, développement, domaine, leur, logiciel, programme, projets, se, technologies aux, ce, ces, cette, d'une, est, il, l'industrie, ou, pas, plus, recherche sont, systèmes, technologie, travaux
Table 3: VBE estimates for possible translations of the English word software At the time of writing an exhaustive comparison has not been made between K-vec and VBE. However it can be said that VBE does not always confirm K-vec's results and will promote certain target words making them more probable translations. As Table 4 demonstrates, even when VBE does correctly rank the proper alignment in first place, it is not the only candidate. In fact it is often the case that there are multiple alignments for any given score. However, this is largely due to the level of granularity used by the program when increasing bag size. As already mentioned, finer degrees of granularity are more computationally expensive: it is a lot quicker to increase bag sizes by 10 or 20 words at each iteration instead of, say, just two words at a time. However, this is a practical consideration and further experiments are required to establish to what extent candidate alignments are indeed distinguished from each other due to increased granularity.
VOCABULARY ESTIMATION FROM PARALLEL CORPORA 6
437
Conclusions
Although the VBE approach is fully automatic, it is not necessarily foreseen as forming part of a completely automatic MT system. We believe that a bottleneck in the creation of MT capabilities for differing language pairs is the creation of information to allow such systems to perform at a useful level of translation quality. Approaches like that of VBE can facilitate the bootstrapping of MT products more quickly than training linguists to pro duce from scratch the information required by systems having to translate between perhaps very different language groups. Hybrid systems, for instance, which use information from both human and machine sources would seem to be a sensible way to incorporate information which can be determined from automatic and human sources. REFERENCES Catizone, Roberta, Graham Russell & Susan Warwick. 1989. "Deriving Trans lation Data from Bilingual Texts". Proceedings of the 1st International Ac quisition Workshop, Detroit, Michigan. Fung, Pascale & Kathleen McKeown. 1994. "Aligning Noisy Parallel Corpora Across Language Groups: Word Pair Feature Matching by Dynamic Time Warping". Technology Partnerships for Crossing the Language Barrier: Pro ceedings of the 1st Conference of the Association for Machine Translation in the Americas, 81-88. Columbia, Maryland, U.S.A. & Kenneth Ward Church. 1994. "K-vec: A New Approach for Aligning Parallel Texts". Proceedings of the 15th International Conference on Com putational Linguistics (COLING-94), vol.11, 1096-1101. Kyoto, Japan. Gale, William A. & Kenneth W. Church. 1991. "A Program for Aligning Sen tences in Bilingual Corpora". Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACĽ91)> 177-184. Berkeley, Calif. Jones, Daniel & Melina Alexa. 1994. "Towards Automatically Aligning German Compounds with English Word Groups in an Example-based Translation System". International Conference on New Methods in Language Processing, 66-71. Manchester, U.K. (To appear in New Methods in Language Processing ed. by Daniel Jones & Harold Somers. London: University College Press.) Kay, Martin & Martin Röscheisen. 1993. "Text Translation Alignment". Com putational Linguistics 19.1:121-142.
A HMM Part-of-Speech Tagger for Korean with Wordphrasal Relations J U N G H. SHIN,* YOUNG S. HAN** & K E Y - S U N CHOI**
* Korea R&D Information Center, KIST **Korea Advanced Institute of Science and Technology Abstract This paper describes a Korean tagger that takes into account the type of wordphrases for more accurate tagging. Because Korean sen tences consist of wordphrases that contain one or more morphemes, Korean tagging must be posed differently from English tagging. We introduce a hidden Markov model that closely reflects the natural structure of Korean. The wordphrases contain more syntactic in formation such as case role than the words in English. Consequently the wordphrasal information will make better prediction leading to higher tagging accuracy. The suggested tagging model was trained on 476,090 wordphrases and tested on 10,702 wordphrases. The ex periments show that the new model can tag Korean text with 96.18% accuracy that is 0.38% higher than the English tagging method. 1
Introduction
The problem of determining part-of-speech categories for words can be transformed to the problem of deciding which states the Markov process went through during its generation of sentence. A category of a word usu ally corresponds to a state (Charniak et al. 1993). In the last few years the tagging systems based on hidden Markov model (HMM) have produced reasonably accurate results in English and other Indo-European languages (Charniak et al. 1993; Jelinek & Mercer 1980; Kupiec 1992; Merialdo 1994). A sentence in Indo-European languages is composed of words. A Korean sentence consists of wordphrases and a wordphrase is a com bination of morphemes. The patterns of morphemes that form a wordphrase are diverse and the relationship between two morphemes can differ in usage patterns according to the wordphrase they belong to. With the differences between Korean and English, it is not the best way to apply the HMM to Korean as it is used to model English. In this paper, we propose a HMM based tagging method that captures the wordphrasal
440
JUNG H. SHIN, YOUNG S. HAN & KEY-SUN CHOI
information of Korean sentences. The proposed model makes use of both wordphrasal relations and morpheme relations of each wordphrase. The proposed method requires the types of wordphrases be known be fore constructing a HMM. To classify the patterns of wordphrases, we first devised an algorithm to extract the categories of the wordphrase from the tagged text. Once a network skeleton for each wordphrase is drawn, a com plete HMM is derived by combining the wordphrasal networks. Section 2.1 outlines the characteristics of the Korean. In Section 2.2, we describe the proposed model and the method to construct the model. In Section 3, experimental results on the improvement of accuracy by proposed methods are described. In Section 4, we discuss the significance and the limitation of our work. 2
Wordphrase based Hidden Markov Model
Numerous variations and extensions of hidden Markov models are repor ted in the literature, but few works are known on designing HMMs for agglutinative languages such as Korean. In the following we first discuss the characteristics of Korean that give motivation to the proposed design method of HMM. After the proposed method is introduced, it is shown by means of experiments that our method outperforms other approaches 2.1
Characteristics of Korean
Unlike English and other Indo-European languages, a Korean sentence is not just a sequence of words, but a sequence of wordphrases. A wordphrase is composed of content morphemes and one or more function morphemes though the function morphemes are often omitted. Function morphemes are placed usually after content morphemes. The function morphemes play richer role in the sentence than the function words that indicate for example the number or person in English sentences. The notable difference is that function morphemes make it explicit the role of its content morphemes in the sentence (Nam 1985). The roles deliver information on deep cases as well as syntactic cases. Because there can be more than one segmentation of a wordphrase, the number of morphemes and their corresponding ρart-of-speeches can also be different for the ambiguous segmentations. Namely, the same wordphrases can be subdivided into different forms and categories in the different num ber of morphemes. For example, 'kamkinun' is analysed into two patterns
AN HMM POS TAGGER FOR KOREAN
441
'kamki' + 'nun' and 'kam' + 'ki' + 'nun. This makes morphological ana lysis and automatic tagging particularly difficult in Korean (Lee et al. 1994). The wordphrases are complex enough to require a fairly lengthy gram mar to generate themselves, but it turned out that notable patterns of wordphrases could be identified. The dependency among wordphrases may be summarised into patterns. Contrary to English phrases that are hard to separate from the sentence, Korean wordphrases are easily identified be cause blanks are the delimiters. The information of phrasal dependency that is also more transparent in Korean than in English should contribute to the accuracy of hidden Markov model. In the following a method to design a hidden Markov model that makes use of the phrasal dependency is introduced. 2.2
Construction of Hidden Markov Model
Automatic tagging for Korean using hidden Markov models has been known in two directions. In one approach, the Markov network is applied in the same manner as is done for English. The network states represent tags, but no phrasal dependency is taken into account. Typical works in this direction are found in Lim et al. (1994) and Lee et al. (1994). In the other approach, the network is a graph of phrasal dependencies but morpheme level dependencies are not considered (Lee et al. 1993). These two methods lack the information that each other has. Our proposal combines the two methods to achieve the better precision of the model using both morpheme and phrasal dependencies. Figure 1 shows the steps to design a hidden Markov model in the proposed method. The first step is to extract wordphrase patterns from a sample of texts. For each wordphrase pattern a morpheme level Markov network is constructed from the observation on the sample texts, and the co-occurrence dependencies between wordphrase patterns can be obtained at the same time. The co occurrence dependencies are made into a graph such as the one in Figure 4. A wordphrase pattern is denoted by two symbols in which the first one indicates the type of content morphemes and the second one represents the type of function morphemes. Table 1 shows the symbols used in our testing. The tags of content and function morphemes are highly simplified giving a minimal set of symbols so that the number of wordphrases patterns may be minimised. The more wordphrase patterns means the larger network that requires larger corpus to train and longer time to run. From the tag sets in Table 1, 48 different wordphrase patterns can be
442
JUNG H. SHIN, YOUNG S. HAN & KEY-SUN CHOI
Fig. 1: Designing a hidden Markov model using morpheme and wordphrase relations composed. Based on the Korean standard grammar and examination of sample texts, we defined 32 wordphrase patterns. Figure 2 shows an analysis of an example sentence in morpheme and wordphrase tags. Figure 3 illustrates a typical Markov model based on morpheme level de pendencies. The composition of wordphrase networks and inter-wordphrase network such as in Figure 4 is shown in Figure 5. Comparing Figures 3, 4, and 5, we can easily conclude that the network in Figure 5 will have the largest discriminating power of the three methods illustrated in the figures. One shortcoming of the combined method is that the size of network is also the largest, and this implies that bigger training corpus is needed to achieve the same level of estimation accuracy. The deterioration of tagging speed and the increased network, however, should not be critical since Viterbi algorithm runs relatively fast to find optimal word sequence. In the final network as in the Figure 5, each state is assigned with a composite tag (wordphrase tag, morpheme tag). Let t denote a composite tag. If lexical table is defined at each state, the following defines an auto matic tagging algorithm where T(w) is an optimal tag sequence for given sentence w.
443
AN HMM POS TAGGER FOR KOREAN
Example sentence pyenhwauy soktoka maywu ppalumul alkey twiesta. (One came to know the speed of change was very rapid.) M o r p h e m e tagging pyenhwa/Noun uy / Adnominal-Particle sokto/Noun ka /Subjective-Particle maywu/dүr ppalu/Verb m /Nominalising-ending ul/Objective-Particle al/Verb key / Auxiliary-connective-ending twi /Auxiliary-verb ess /Post-final-ending ta /Sentence-final-ending Wordphrase tagging pyenhwauy /NM soktoka/NS maywu/A ppalumul / P O alkey/PC twiesta/PF Fig. 2: Sentence
analysis in morpheme
and wordphrase
tags
444
JUNG H. SHIN, YOUNG S. HAN & KEY-SUN CHOI CONTENT WORD TAG
N Ρ A M I
s
DESCRIPTION
Nominals Predicates Adverbials Adnominals Interjections Symbols
FUNCTION WORD TAG
S A 0 M Y
x F
DESCRIPTION
Subjective Particle Adverbial Particle Objective Particle Adnominal Particle Connective Particle Auxiliary Particle Connective Ending Sentence Ending
Table 1: Tag symbols of content and function words
Fig. 3: Morpheme level Markov model
By integrating wordphrase structure into HMM network, we use not only the category of word but the case of its wordphrase to select POS tag. The additional information results in the increased accuracy. For ex ample, 'casirí has duplicate interpretation 'casirí (common noun) which means 'self-confidence' and 'casin' (pronoun) meaning 'oneself. Because these two categories have similar usage distributions conditioned by other categories, conventional bigram model will not discriminate them. With wordphrase types in consideration, we can find that 'casin' (common noun) is more often used in NO, on the other hand, 'casirn'(pronoun) is used more likely in NM and NS. Furthermore, some categories have different usage patterns according to the wordphrases they are attached to. Indeed, such discriminations are reflected to the trained probabilities, where we find that P(casin |NM, pronoun) is 0.096180, while P(casin\NMi common noun) is
AN HMM POS TAGGER FOR KOREAN
445
Fig. 4: Wordphrase level Markov model
Fig. 5: Markov model with morpheme and wordphrasal relations 0.009506. The Particle 'was' which has duplicate role of conjunctive and auxiliary case, is another example which can be resolved by considering wordphrase case. Let us consider the case 'kunye' '(pronoun,"her")+ 'wa'(auxiliary, "with") 'kyelhon (action common noun,"marriage")+'ha'(verb-derived suffix,"do") + 'ta' (final ending). Because 'wa'(conjunctive) has higher likelihood with noun compared to that of 'wa'(auxiliary particle), 'wa'(conjunctive) is se lected incorrectly when only POS tag relations is used. However, when we consider wordphrase case of elhon(action common noun,"marriage") as PF, we find that iwa)(auxiliary,"with") has higher likelihood with PF compared to 'wa'(conjunctive). As an another measure for more accurate tagging, we extended the lex ical probability. By defining lexical tables at each edge, we can extend the depth of dependency of lexical probability such that the occurrence of a word is conditioned by the previous tag as well as the current one. The
446
JUNG H. SHIN, YOUNG S. HAN & KEY-SUN CHOI
extension from Equation 1 gives the following algorithm:
Much larger training texts will be needed to estimate P(w\ti-1). To deal with the data sparseness, we used the well known interpolation method as equation 3. The interpolation coefficient λ is computed using the deleted interpolation algorithm (Rabiner 1989; Jelinek & Mercer 1980).
Many ambiguous morphemes can be further handled by using the category of the preceding morphemes. In particular, when a morpheme is analysed into two forms that have the same category, they depend only on the lexical probability, 'cwun' is such a case, 'cwun' is analysed into 'cwuta'(verb, "to give") + 'n'(adnominal ending,TENSE) and 'cwulta'(verb, "to reduce")+ 'n'(adnominal ending,TENSE). As both 'cwuta' and 'cwulta' are verb, they have the same transition probability. However, when we consider preceding particles, the discrimination power can be enhanced. For example, adverbial particle and objective particle are more likely to take place before 'cwuta', and 'cwulta' is often follows subjective particle. From the trained model, we find that P('cwuta'│jca,pv) is 0.003882 and P('cwuta'│jca,pv) is 0.000162. Consequently, considering the relation with the category of the preceding morpheme is useful to discriminate categories. 3
Experiments
The goal of the experiment is to find how much improvement is achieved in tagging accuracy by the proposed method compared to morpheme level hid den Markov models. We discuss possible extensions of HMM and compare experimental results of them in different size of training data. 3.1
Test data
For training and testing the models, we have used the KIBS 1 tagged corpus (1995). We divided tagged corpus into three parts. 1
KIBS (Korea Information Base System) project aims at constructing resources for Korean language processing, including treebank, tagged corpus and raw corpus, and developing analysing tools for Korean.
AN HMM POS TAGGER FOR KOREAN Training Data 475,090 237,548 119,772 59,889 Model Model Model Model
1: 2: 3: 4:
447
Model2 Model 3 Model 4 0.82 0.86 0.35 0.75 0.87 0.32 0.71 0.78 0.31 0.64 0.74 0.29
Bigram model of morphemes Trigram model of morphemes Model of morphemes and wordphrase relations Model of morphemes and wordphrase relations with extended lexical probability. Table 2: Interpolation coefficient
• a set of 476,090 tagged wordphrases, the training data, which is used to build our models. • a set of 10,698 tagged wordphrases, the data, which is used to estimate the interpolation coefficient. • a set of 10,702 tagged wordphrases, the test data, which is used to test he models. Tagging Korean texts is necessarily preceded by some depth of morpholo gical analysis for the simplification of dictionaries of hidden Markov model. Many words are used as deeply inflected forms, and non-trivial morpholo gical rules are often needed to recover them. This makes it unreliable to evaluate tagging models using completely foreign texts since the tagging critically depends on the quality of the morphological analysis. To avoid noises that caused by fault morphology analysis, we excluded the sentences that doe not contain valid analysis candidates from test data. In other words, the recall of morphological analysis in the test is set to be 100%. The trained hidden Markov network reflecting both morpheme and word phrase relations contains 712 nodes and 28553 edges. The number of partof-speech tags is 52, and the average ambiguity of each wordphrase is 5.06. 3.2
Results
The experiments consist of comparisons of five tagging models. We adopt bigram model of morpheme as an initial model. It is extend to trigram model, which is used generally in practical English taggers (Merialdo 1994). To minimise the effect of data sparseness, we interpolate trigram distribu tions with bigram distributions as shown in equation 4.
448
JUNG H. SHIN, YOUNG S. HAN & KEY-SUN CHOI
Fig. 6: Comparison of tagging accuracy with increase of training corpus
The interpolation coefficients are summarised in Table 2. The coefficient of model 2 (trigam) is defined in 4 and other models use the coefficient defined in 3. As the size of training data increases, the coefficients tend to give stronger support to the original model parameters. When we assume the degree of training corpus for each model is proportion to the size of interpolation coefficient, model 4 has more chance to improvement with the increase of training corpus. In case of small training corpus below 200,000 wordphrase, model 3 (with morpheme and wordpharse relation) achieved highest accuracy. As the training data increases, our proposed method, model 4 in Table 2 out performed other methods. As is shown in Figure 6, the proposed model excels the popular morpheme based model by more than 0.53%. The wordphrase model that is our method short of the extension of lexical probability gave more accurate results than bigram and trigram models over all data sites. This implies that the model is insensitive to training data size despite the increased network size. Thus, the extension of lexical tables must be the sources of the data sparseness of model 4 with smaller training data.
AN HMM POS TAGGER FOR KOREAN 4
449
Conclusions
We proposed a Korean tagging model that takes wordphrasal relations as a backbone and expands lexical probability. As a result, our model gives rise to the increase of the network size but with higher accuracy. Another merit of the proposed method is that the whole process including extracting wordphrases and constructing a network, is executed automatically without human intervention. With larger corpus our model is expected to perform even better as the network saturates. This paper introduced an important issue that may be fundamental to the more elaborate taggers for Korean or other similar languages. REFERENCES Charniak, Eugene, Curtis Hendrickson, Neil Jacobson & Mike Perkowits. 1993. "Equations for Part-of-Speech Tagging". Proceedings of the National Con ference on Artificial Intelligence, 784-789. Menlo Park, Calif.: MIT Press. Jelinek, Frederick & Robert L. Mercer. 1980. "Interpolated estimation of Markov source parameters from sparse data". Proceedings of the Workshop on Pat tern Recognition in Practice, 381-397. Kupiec, Julian. 1992. "Robust Part-of-Speech Tagging Using a Hidden Markov Model". Computer Speech and Language 6:1.225-242. Lee, Sang H., Jae H. Kim, Jung M. Cho & Jung Y. Seo. 1995. "Korean Morpho logical Analysis Sharing Partial Analyses". Proceedings of the International Conference on Computer Processing of Oriental Language, 164-173. Hawaii, U.S.A. Lee, Wun J., Key-Sun Choi & Gil C. Kim. 1993. "Design and Implementa tion of an Automatic Tagging System for Korean Texts". Proceedings of the 20th Spring Conference of the Korean Information Science Society, 805-808. Seoul, Korea. [In Korean.] Lim, Chul S. 1994. A Korean Part-of-Speech Tagging System using Hidden Markov Model. M.Sc. thesis. KAIST, Taejon, Korea. [In Korean.] Merialdo, Bernard. 1994. "Tagging English Text with a Probabilistic Model". Computational Linguistics 20:2.155-168. Nam, Key S. 1986. Grammar for Standard Korean. Seoul, Korea: Top Press. [In Korean.] Rainber, Lawrence R. 1990. "A Tutorial on Hidden Markov Models and Selected Application in Speech Recognition". Readings in Speech Recognition ed. by Alex Waibel & K. Lee, 267-296. San Mateo, Calif.: Morgan Kaufmann.
A Multimodal Environment for Telecommunication Specifications IVAN RETAN*, M Å N S ENGSTEDT** & B J Ö R N GAMBÄCK*
* Telia Research AB, **Ericsson Telecommunication Systems Lab., * Computerlinguistik, Universität des Saarlandes Abstract This is a description of the rationale and basic technical framework underlying VINST, a Visual and Natural language Specification Tool. The system is intended for interactive specification of the functional behaviour of telecommunication services by users not possessing indepth technical knowledge of telecommunication systems. In order to obtain the desired level of abstraction needed to accomplish this functionality, a natural language component together with a visual language interface based on a finite state automata metaphor have been integrated. The multimodal specification produced by VINST is translated to an underlying formal specification language that is further refined or transformed in a number of steps in the design process. Further-more, the specification is validated by means of sim ulation and paraphrasing. Both the specification phase and the val idation phase are carried out in the same multimodal environment. This integration of modalities provides for synergy effects including complementary expressiveness and cross-modal paraphrasing. 1
Introduction
One of the initial phases in the specification of a new telecommunication system is that of requirements engineering. The required system is nor mally described informally by the customer in a requirement specification consisting of informal text and figures, which is handed over to a design department. The major and obvious drawback of this process is that it involves a step of manual 'compilation' to telecommunication oriented im plementation or specification languages. This compilation or interpretation will have to be undertaken by a technical specialist who will have to take great care to realise the intentions of the customer through several cycles of implementation, validation and verification while tackling ambiguity, in consistency and incompleteness in the informal specification. The VINST research project, as carried out by Ericsson Telecom and partners, aimed at involving the customer in an actual formal specification
452
IVAN BRETAN, MÅNS ENGSTEDT & BJÖRN GAMBÄCK
process, replacing the informal requirement specification with a computer ised tool which provides support for describing telecommunication systems in a constrained and rigorous manner. The result of using this tool is in prin ciple directly compilable into a formal specification for a telephone service, although the customer will generally only describe parts of the function ality of the entire system. However, specification of these parts may be critical and very time-consuming, and may vary extensively from customer to customer. After user studies of prototype specification systems it was decided that a multimodal environment would be most appropriate for this task. As shown in Figure 1, the two main modalities of the user interface would be natural language (NL — initially only keyboard input) and a visual language (VL) using icons and finite state automata metaphors to describe the dynamic behaviour of the system. Although, to our knowledge, no other system with a similar integrated multimodal architecture for specification of telecom services exists, other tools have been designed with similar goals. VISIONNAIRE (Henjum & Clarisse 1991) supports formalised requirements engineering for telecom munication applications using natural language, visual programming and animation. WATSON (Kelly & Nonnenman 1987) is also used for formal specification of telecom systems from natural language scenarios, while PLANDoc (McKeown et al. 1994) is used for generation of natural language paraphrasai text from telephone route planning descriptions. 2
S y s t e m overview
VINST is a tool which operates in at least three distinct modes: 1. Specification of static properties in a conceptual schema 2. Specification of dynamic properties in rules 3. Validation of specifications of which we will here be mostly concerned with (2), which in some sense is the central mode since (1) is a task which provides the general setting or constraints for dynamic specifications (where one specification of static properties can be common to many different dynamic specifications) and (3) is intended for validating (1) and (2). Validation of a specification is carried out by simulation of the rules using manually triggered events, the conceptual schema and a description of the initial state of the world. A com plementary way of validating a specification is by cross-modal paraphrasing (see below). The tasks are normally performed in the above order.
A MULTIMODAL TELECOM SPECIFICATION ENVIRONMENT
453
Fie. 1: Components and information flow in VINSΤ The goal of interacting with VINST is to produce a specification in Delphi (Höök 1993), a language dedicated to the formal description of the func tional behaviour of telecommunication systems. It is a declarative language based on first-order predicate logic and Entity-Relationship theory with a discrete model of time, where the dynamic specification is made up of a set of rules consisting of event, pre- and post-conditions and where the static part consists of a set of axioms and a conceptual schema. As can be seen in Figure 1, in order to produce Delphi rules, the VINST user can either work with the Natural Language (NL) modality or the Visual Language (VL) modality, which will both give rise to the same kind of internal representations in a language which we simply will refer to as IL (Internal Language), which is modelled relatively closely around Delphi. IL is basically an abstract syntax tree, either represented in Prolog (ILP) or in Smalltalk (ILS). The translation of natural language proceeds via several internal rep resentation languages for NL, here collectively referred to as NIL, which contains various types of linguistic information, such as morphological and syntactical features. Likewise, the translation of VL is performed via the Visual Internal Language, VIL, which is IL annotated with information on
454
IVAN BRETAN, MÅNS ENGSTEDT & BJÖRN GAMBÄCK
how to present the representation visually as icons and automata. VIL and NIL can thus be seen as a supersets of IL. These translations meet in the common internal representation language IL, not containing any modality specific information, which is finally translated to a Delphi representation called Delphi Textual Language (DTL). The information contained in NIL and VIL expressions that is not part of IL has to be added by the generation processes. In Figure 1, all arrows showing translations are bidirectional, hence in dicating the possibility of translating in any direction between the repres entation languages VL, NL and Delphi. The user creates the specification in either modality, upon which trans lation via IL to Delphi takes place, as well as translation into the alternate modality, i.e., paraphrasing. The guiding principle for paraphrasing is to give control to the user. Thus, when a visual expression is constructed, it can be translated into natural language upon request. Likewise, when associating natural language fragments with different parts of the specific ation, they can optionally be paraphrased or even replaced by their visual counter-parts. The visual representation is canonical in some sense, reflecting both the system's view of the world (in terms of automata) and relevant discourse objects. It might be argued that the natural language part of the system does not need to be very sophisticated, since the VL part will have relatively high expressivity with respect to the task. However, in order for NL para phrasing of VL expressions to be generally applicable, and also for providing a complete and predictable (the term habitable is sometimes used) linguistic coverage, a large-scale NLP component is highly desirable also for a system such as VINST. Although the existing initial VINST prototype makes use of a less ex tensive NLP component, an experiment to integrate VINST and a natural language processor for English based on the SRI Core Language Engine (CLE — Alshawi ed., 1992 — described further in Section 6 below) was carried out together with SRI.1 The envisaged complete VINST system (as outlined in this paper) would of course allow for the user to make more extensive use of the NL modality.
1
Most of the adaptation of the CLE to the VINST domain, including conversion from its internal logical format to ALF, was carried out by Manny Rayner and Richard Crouch, SRI International, Cambridge, England.
A MULTIMODAL TELECOM SPECIFICATION ENVIRONMENT
455
Fig. 2: A VINST automaton and NL rule
3
Automata and rules
In the NL part of the system, rules can be formulated using conditionals which mirror the structure of the target Delphi rule, consisting of an event (for instance triggered by an action performed by the user of the service), a number of conditions (pre-conditions) and conclusions (post-conditions). In the visual language part, the same type of rule can be created using icons and visual counterparts of states and transitions in a finite state automata. In Figure 2, a two-state automaton representing two Delphi rules for a part of the service Basic Call is shown together with a rule formulated in NL which corresponds to the upper transition from the state idle to the state ready for digits. As can be seen from the figure, an automaton state can correspond both to pre- and post-conditional parts of Delphi rules whereas the automaton transitions correspond to events of Delphi rules. Since each transition in the automaton correspond to a Delphi rule, the automaton in Figure 2 can be translated to two corresponding Delphi rules. The main principle behind the automata metaphor is that it conveys the
456
IVAN BRETAN, MÅNS ENGSTEDT & BJÖRN GAMBÄCK
temporal flow of the service, where states correspond to moments in time and transitions are triggered by events (such as picking up the receiver) which increment the temporal counter of the system. Also, the automata metaphor provides for cyclical specifications (a state can have transitions directed both to it and from it), which normally will be less explicit in a natural language-only specification. 4
The visual language
The basic building blocks of the visual language of VINST are icons that are used together with the graphical rendering of states and transitions. The icons, or the visual vocabulary, can be seen as part of the conceptual schema, and visualise different primitive or derived concepts of the domain. Some icons are parameterised, where the parameters normally corres pond to the predicate-argument structure of a lexical entry. For instance, the icon corresponding to the entry for the verb "dial" will normally have two initially empty slots, which can be filled with other symbols represent ing a subscriber and a telephone number, for the subject and direct object respectively. Other icons can be seen as representing properties, and can consequently be added as visual annotations to a main icon representing a principal object. A one-to-one correspondence between visual symbols and lexical entries is not a necessary property of the system. In fact, due to the specificity of pictures, one icon (such as a telephone) can express arbitrarily complex states-of-affairs (for instance that there is an idle subscriber who is 'onhook'). Conversely, since one of the major points of a multimodal system is complementary expressiveness, certain VL expressions will be significantly less direct to formulate than their NL counterparts, or in some cases even impossible. The visual language of VINST does not have general mechan isms to support the formulation of quantification, disjunction and negation, for which the user is deferred to the natural language component. The visual language could of course be enhanced to handle a more complex logic, probably at the expense of intuitiveness. 5
Modality synergy
As observed by for instance Cohen et al. (1989), direct manipulation based languages provide a controlled and guided interaction style complementary to that of natural language interfaces, while support for the type of com-
A MULTIMODAL TELECOM SPECIFICATION ENVIRONMENT
457
Α "Α d o e s n ' t have a hotnumber b u t he has a r e d i r e c t i o n number." |
Fig. 3: NL fragment within an automaton state plex compositional semantics that NL interfaces normally exhibit generally stretches the limits for what a visual language can express without attaining the same degree of difficulty as a high-level programming language. In addi tion to the cases where natural language simply is more convenient, such as when expressing complex quantification, the need for modality integration showed up in user studies when the VL symbol library did not contain icons which matched exactly what the user wanted to express, which could be due to problems of specificity or just suboptimal icon design. We can envision users switching from VL to NL when the former way of expressing services is judged to be too blunt (due to the specificity problem) or involving too many interactive steps. Another significant observation from these studies is the increased un derstanding of the specification created when it was paraphrased in the alternate, non-input modality. When paraphrasing the VL specification in natural language a 'search-light' effect is obtained, where the opaque linguistic coverage of the NL component is illuminated and both domain specific vocabulary and grammatical preferences are revealed. There is ample evidence (as reported by Karlgren 1992, among others) that such system-generated language will be picked up by the user and recycled in the continued dialogue, increasing the efficiency of the interaction. The notion of synergy, where access to several modalities gives a func tionality which cannot be obtained through using only one of them, can be taken one step further if more integrated mixing of modalities is considered. This is realised in VINST with the support for placing NL fragments in differ ent parts of a visually specified automaton in order to 'tag' these fragments with respect to a point in time in the execution of the service. For instance, a user specifying something in NL which is cumbersome to visualise, and who places this text within a particular state, as in Figure 3, does not need to say anything about at which point in time this happens,
458
IVAN BRETAN, MÅNS ENGSTEDT & BJÖRN GAMBÄCK
Fig. 4: The NL architecture of VINST or what event brought about this fact, since all this is given by the visual context. 6
Architecture of the NL component
The NL component of VINST is divided into two parts, a front-end and a back-end. The front-end translates NL to IL, the modality-independent Internal Language. IL can in turn be translated either to VL or to Delphi, or be used as the starting point for NL-generation. IL to VL translation typically requires generation of layout information for the resulting visual description, a difficult problem a discussion of which is out of the scope of the present paper. The main functionality of the NL front-end part is to translate, in both directions, between natural language and IL via the intermediate represent ation ALF (Application-specific Logical Form). The back-end translates IL to Delphi Textual Language, DTL. This separation of the system in a frontend and a back-end minimises changes in the system caused by changes in VL or in Delphi. The front-end and the back-end consist of a number of sub-processing steps as shown in Figure 4. The first step involves different types of word-
A MULTIMODAL TELECOM SPECIFICATION ENVIRONMENT
459
level processes, such as tokenisation, lexical analysis, and inflectional mor phology. The output from the morphological component is a lattice, con taining all possible sequences of inflected words. This is used by the syntactic parser, an LR-parser which works with unification-based grammar rules to produce an implicit parse tree (deriv ation tree with rule annotations). Semantic rules are applied to this tree, resulting in one or several pseudo-logical representations in a format known as Quasi Logical Forms (QLF — Alshawi & van Eijck 1989). A QLF carries different amounts of informational content in different stages of processing. In the interpretation step, the QLFs undergo scoping and reference resol ution. Scoping is here taken to mean the mapping from determiners, modals and certain adverbs to quantifiers or operators and determining their scope using linguistically motivated scope preferences. The resolution stage deals with referring expressions, elliptic phrases, and semantically vague relations (such as the ones derived from "have" and "is"). The final step in the linguistic analysis chain involves mapping the scoped, resolved QLF into the application-specific logical format, ALF, a fairly standard extension of first-order predicate logic with some higherorder operators. This is done by means of a machinery operating using declarative rewrite rules in a process such as Abductive Equivalential Trans lation (AET — Rayner 1993). This process translates a resolved QLF into an ALF according to a domain theory which describes equivalences between logical formulae containing linguistic predicates and formulae containing Delphi related predications. For the purposes of VINST, the ALF is further translated into an ab stract syntactic representation of a Delphi rule, i.e., the IL representation. The Delphi translation step, the back-end, translates an IL form to the fi nal Delphi Textual Language, DTL, representation. This can be seen as a mapping from abstract to concrete syntax; however, as noted above (see Section 2), the internal language is at present mainly a notational variant of Delphi, with rather uninteresting variations depending on the actual pro gramming language (Prolog or Smalltalk, for NL or VL, respectively) it is represented in. Going in the other direction of processing, most modules are reversible, although some cannot be used as such for practical purposes due to the high amount of indeterminism involved; for example, the first step could probably make use of rules derived from machine-learning techniques to map ALFs into 'standard' QLF fragments. Since the grammar formalism of the CLE is fully reversible, the final
460
IVAN BRETAN, MÅNS ENGSTEDT & BJÖRN GAMBÄCK
step would on the other hand just involve specifying which of the (analysis) grammar rules that should be allowed to be invoked by the generator. These would then by compiled in a way distinct from the format used by the parser. 7
A simple translation example
As mentioned, Delphi rules consist of an event, a number of conditions and conclusions. The informal semantics of a rule is that if an event occurs at a point of time T and the conditions are valid at the same point of time T then the conclusions will be valid at the next point of time T + 1. A Delphi rule can be described as one conditional NL sentence with an if-part and a then-part as described in Figure 2, although there are of course many possible alternative ways of conveying the same information, e.g., as a small discourse: An idle subscriber is onhook. The subscriber makes offhook. He becomes offhook and gets dialtone. The first sentence contains some pre-conditions and the second the event of the rule. The third sentence contains the post-conditions or the conclu sions of the corresponding Delphi rule. This discourse refers to a number of concepts such as subscriber, on-hook and idle. These concepts, their lex ical realisations, their relations to other concepts, to visual symbols and to Delphi expressions must already have been defined in the domain model (which includes both a conceptual schema and a domain theory). From the three sentences above, the following ALF formula is generated (with the uppercase letters being variables of the standard Prolog type and bindings, while the letters t are indicating the time points of the different events, states and conditions): exists([A,B,C,D,E,F], cond(subscriber,A,[t=T]), state(be_idle,B,A,[t=T]), state(be_onhook, , A,[t=T]), event(make_offhook,D, A,[t=T]), cond(get,E,A,dialtone,[t=T+l]), state(be_offhook,F,A,[t=T+l])
)
This formula is obtained by using (conditional) AET equivalences of the following type relating linguistic predicates and target Delphi predicates:
A MULTIMODAL TELECOM SPECIFICATION ENVIRONMENT
461
make_offhook(Event, Person) event(make_offhook,Event,Person) <- subscriber(Person).
The ALF formula is then translated via an IL expression not shown here either to a visual counter-part similar to the upper transition of the auto maton in Figure 2, or to the final DTL formula: WHEN offhook(A) IS DETECTED IF subscriber(A) AND idle(A) AND onhook(A) CONCLUDE dialtone(A) AND offhook(A); END; As a test of the CLE-based VINST prototype, a small corpus for a rep resentative telecommunication service, "Call Forward on Busy," has been collected. This service can be expressed as 15 Delphi rules, each of which four subjects specified by means of one or several sentences. The domain theory created contained 96 equivalences, of which 84 were lexical (trans lating specific word-senses). However, most of these equivalences were not specific to the service "Call Forward on Busy." The results were quite encouraging as the coverage obtained was suffi cient to specify and validate the service in question; however, whether this indicates that the amount of work needed to actually turn VINST into a useful system is feasible is of course still an open-ended question. 8
Conclusions
The paper has described a multimodal tool for interactive specification of telecommunication services. The tool allows the user to integrate natural and visual languages in order to produce descriptions in an underlying formal specification language. Furthermore, the specification is validated by means of simulation and paraphrasing. Both the specification phase and the validation phase are carried out in the same multimodal environ ment. This integration of modalities provides for synergy effects including complementary expressiveness and cross-modal paraphrasing.
462
IVAN BRETAN, MÅNS ENGSTEDT & BJÖRN GAMBÄCK
Acknowledgements. We would like to thank the other members of the VINST project team, including Stefan Preifelt, Niklas Björnerstedt, Roy Clarke, Her cules Dalianis, Anders Holm, Jussi Karlgren, Erik Knudsen, Eva Lapins, Christer Samuelsson, Jonas Walles, and several others; Manny Rayner and Dick Crouch, SRI; and an anonymous referee for very substantial comments. REFERENCES Alshawi, Hiyan, ed. 1992. The Core Language Engine. Cambridge, Mass.: MIT Press. & Jaan van Eijck. 1989. "Logical Forms in the Core Language Engine". Proceedings of 27th Annual Meeting of the Association for Computational Linguistics (ACĽ89), 25-32. Vancouver, British Columbia, Canada. Cohen, Philip R., M. Dalrymple, D.B. Moran, F.C.N. Pereira, J.W. Sullivan, R.A. Garyan Jr., J.L. Schlossberg & S.W. Tyler. 1989. "Synergistic Use of Direct Manipulation and Natural Language". Proceedings of Human Factors in Computing Systems (CHI 89), 227-233. Austin, Texas. Henjum, Olaf I. & Oliver B.H. Clarisse. 1991. "Confirming Customer Expecta tions". Proceedings of the National Communications Forum, 657-664. Rosemont, Illinois. Höök, Hans. 1993. Delphi — A General Description of the Language. Sweden: Ellemtel Utvecklings AB.
Älvsjö,
Karlgren, Jussi. 1992. The Interaction of Discourse Modality and User Expect ations in Human-Computer Dialog. Licentiate Thesis. Dept. of Computer and Systems Sciences, Univ. of Stockholm, Sweden. Kelly, Van E. & Uwe Nonnenmann. 1987. "Inferring Formal Software Specific ations from Episodic Descriptions". Proceedings of the 6th National Confer ence on Artificial Intelligence (ΑΑΑI'87), 127-132. Seattle, Washington. McKeown, Kathleen, Karen Kukich & James Shaw. 1994. "Practical Issues in Automatic Documentation Generation". Proceedings of the 4th Conference on Applied Natural Language Processing (ANLP'94), 7-14. Stuttgart, Ger many. Rayner, Manny. 1993. Abductive Equivalential Translation and its Application to Database Interfacing. Ph.D. dissertation, Dept. of Computer and Systems Sciences, Univ. of Stockholm, Sweden.
List and Addresses of Contributors Eneko Agirre Euskal Herriko Unibertsitatea P.K. 649, 20080 Donostia Basque Country (Spain) j [email protected]. es Iñaki Alegria Informatika Fakultatea Euskal Herriko Unibertsitatea P.K. 649, 20080 Donostia Basque Country (Spain) a c p a l l o i @ s i . e h u . es Sofia Ananiadou Centre for Computational Linguistics UMIST, P .O . B o x 88 Manchester M60 1QD, U.K. eff ie @ccl. u m i s t . . uk M. Victoria Arranz Centre for Computational Linguistics UMIST, P..Box 88 Manchester M60 1QD, U.K. v i c t o r i a @ c c l . u m i s t . . uk Xabier Artola Informatika Fakultatea Euskal Herriko Unibertsitatea P.K. 649, 20080 Donostia Basque Country (Spain) j [email protected] Roberto Basili Università di Tor Vergata, Roma Via della Ricerca Scientifica Roma, Italy [email protected] Ismail Biskri ISHA - LALIC 96 bld Raspail F-75006 Paris, France [email protected]
Christian Boitet Université Joseph Fourier GETA, IMAG-campus, BP53 150, rue de la Chimie F-38041 Grenoble Cedex 9, France Christian.BoitetOimag.fr Kalina ontcheva Bulgarian Academy of Sciences Centre for Informatics & Computer Technology Linguistic Modelling Laboratory Acad. G. Bonchev Str. 25A BG-1113 Sofia, Bulgaria kalinaObgcict.acad.bg Ivan Bretan Telia Research AB S-136 80 Haninge Sweden ivan.bretanQhaninge.trab.se Key-Sun Choi Dept of Computer Science Korea Advanced Institute of Science & Technology Ku-Sung Dong 373-1, Yu-Sung Ku Taejon, Korea kschoi@world. k a i s t . . k r Marcel Cori Université Paris 7 Case 7003, 2 place Jussieu F-75251 Paris Cedex 05, France mcoQccr.j ussieu.fr Jean-Pierre Déscles ISHA - LALIC 96, bid Raspail F-75006 Paris, France [email protected]
464
LIST AND
ADDRESSES
Måns Engstedt Ericsson Telecommunication Systems Lab. Box 1505, S-126 25 Stockholm, Sweden m a n s . e n g s t e d t @ u a b . e r i c s s o n . se Olivier Ferret L I M S I - C N R S , B . P . 133 F-91403 Orsay Cedex, France ferretQlimsi.fr Michel de Fornel EHESS, CELITH 54 boulevard Raspail F-75006 Paris, France Björn Gambäck Computerlinguistik Universität des Saarlandes Postfach 151150 D-660 41 Saarbrücken, Germany bj o r n . g a m b a c k @ c o l i . u n i - s b . d e Brigitte Grau L I M S I - C N R S , B.P. 133 F-91403 Orsay Cedex, France grauQlimsi.fr Udo Hahn CLIF - Computational Linguistics Research Group Freiburg University, Europaplatz 1 D-79085 Freiburg, Germany [email protected] Young S. Han Dept of Computer Science Korea Advanced Institute of Science and Technology Ku-Sung Dong 373-1, Yu-Sung Ku Taejon, Korea [email protected]..kr
OF
CONTRIBUTORS
Matthew F. Hurst University of Edinburgh Human Communication Research Centre 2, Buccleuch Place Edinburgh EH8 9LW, U.K. [email protected] Yasushi Ishikawa Human Media Technology Dept Interface Technology Lab Information Technology R&D Center MITSUBISHI Electric Corporation 5-1-1, Ofuna, Kamakura Kanagawa 247, Japan [email protected] Akira Ito Kansai Advanced Research Center Communications Research Laboratory 588-2 Iwaoka, Nishi-ku Kobe 651-24, Japan [email protected] Daniel B. Jones Centre for Computational Linguistics UMIST, P..Box 88 Manchester M60 1QD, U.K. [email protected]..uk Aravind . Joshi Dept of Computer & Information Science University of Pennsylvania 200 S. 33rd Street, Philadelphia PA 19104-6389 U.S.A. j [email protected] Mihoko Kitamura Oki Electric Industry Co., Ltd. Kansai Laboratory Cristal Tower, 1-2-27 Shiromi, Chuo Osaka 540, Japan k i t a @ k a n s a i. o k i . . j p
LIST AND ADDRESSES Hideki Kozima Kansai Advanced Research Center, Communications Research Laboratory 588-2 Iwaoka, Nishi-ku Kobe 651-24, Japan xkozimaQcrl.go. jp Geert-Jan M. Kruijff University of Twente, P. 0 . Box 217 7500 AE Enschede, The Netherlands kruij ffQcs.utwente.nl
OF
CONTRIBUTORS
465
Akito Nagai Media Technology Dept Interface Technology Lab Information Technology R&D Center MITSUBISHI Electric Corporation 5-1-1, Ofuna, Kamakura Kanagawa 247, Japan [email protected]
Jean-Marie Marandin CNRS URA 1028 Case 7003, 2 place Jussieu F-75251 Paris Cedex 05, France marandinQlinguist.jussieu.fr
Kunio Nakajima Human Media Technology Dept Interface Technology Lab Information Technology R&D Center MITSUBISHI Electric Corporation 5-1-1, Ofuna, Kamakura Kanagawa 247, Japan kunioQmedia.isl.melco. co. j p
Yuji Matsumoto Graduate School of Information Science Nara Institute of Science & Technology 8916-5 Takayama, Ikoma Nara 630-01, Japan matsuQis. a i s t - n a r a . ac. j p
Nicolas Nicolov Dept of Artificial Intelligence University of Edinburgh 80 South Bridge Edinburgh EHI IHN, U.K. nicolasQdai. ed.ac.uk
Chris Mellish Dept of Artificial Intelligence University of Edinburgh 80 South Bridge Edinburgh EHI IHN, U.K. .mellishQed. . uk
Tadashi Nomoto Advanced Research Laboratory Hitachi Ltd. 2520 Hatoyama Saitama 350-03 Japan nomotoQharl.hitacni. co. jp
Ruslan Mitkov School of Languages & European Studies University of Wolverhampton Stafford Street Wolverhampton WV1 1SB, U.K. r.mitkovQwlv.ac.uk
Harris V. Papageorgiou Institute for Language & Speech Processing - ILSP Margari 22, 115 25 Athens, Greece harispQtheseas. softlab. .ntua.gr xarisQilsp.gr
466
LIST AND ADDRESSES
OF
CONTRIBUTORS
Maria Teresa Pazienza Università di Tor Vergata, Roma Via della Ricerca Scientifica 1-00133 Roma, Italy pazienzaQinfo.utovrm.it
Christer Samuelsson Universität des Saarlandes FR 8.7, Computerlinguistik D-66041 Saarbrücken, Germany christerQcoli.uni-sb.de
Ian Radford Centre for Computational Linguistics UMIST, P. O. Box 88 Manchester M60 1QD, U.K. i a n r @ c c l . u m i s t . .uk
Kepa Sarasola Informatika Fakultatea Euskal Herriko Unibertsitatea P.K. 649 E-20080 Donostia Basque Country, Spain j [email protected]. es
Wiebke Ramm Universität des Saarlandes FR 8.6 Angewandte Sprachwissenschaft sowie Übersetzen und Dolmetschen D-66041 Saarbrücken, Germany wiebke@dude. uni-sb.de Allan Ramsay Centre for Computational Linguistics UMIST, P..Box 88 Manchester M60 1QD, U.K. allan@ccl. umist. .uk German Rigau Universitat Politécnica de Catalunya Pau Gargallo 5 E-08028 Barcelona, Spain g.rigauQlsi.upe.es
Jan Schaake University of Twente, P. . Box 217 NL-7500 AE Enschede The Netherlands [email protected] Reinhard Schäler Dept of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland [email protected] Jung H. Shin Korea R&D Information Center, KIST Yu-Sung Post Office, P. O. Box 122 Taejon, South Korea j hshinQworId.kaist. .kr
Graeme Ritchie Dept of Artificial Intelligence University of Edinburgh 80 South Bridge Edinburgh EHI IHN, U.K. graeme@dai. e d . .uk
Khalil Sima'an Utrecht University, Trans 10 NL-3512 Utrecht The Netherlands [email protected]
Michelangelo Della Rocca Università di Tor Vergata, Roma Via della Ricerca Scientifica 1-00133 Roma, Italy wnuserQlouis.intecs. i t
Harold Somers Centre for Computational Linguistics UMIST, P . O . B o x 88 Manchester M60 1QD, U.K. [email protected]. .uk
LIST AND ADDRESSES Michael Strube CLIF - Computational Linguistics Research Group Freiburg University, Europaplatz 1 D-79085 Freiburg, Germany s t r u b e Q c o l i n g . u n i - f r e i b u r g . de Małgorzata E. Styś Computer Laboratory University of Cambridge New Museums Site, Pembroke Street Cambridge CB2 3QG, U.K. m . s t y s Q c l . cam.ac.uk Mutsuko Tomokiyo ATR Interpreting Telecommunications 2-2 Hikari-dai, Seika-cho Kyoto 619-02, Japan [email protected]. co. j p Jun-ichi Tsujii Centre for Computational Linguistics UMIST, P..Box 88 Manchester M60 1QD, U.K. t s u j i i Q c c l . u m i s t . .uk Paola Velardi Dipartimento di Scienza dell'Informazione via Salaria 113 Università "La Sapienza" 1-00198 Roma, Italy velardiQdsi.uniromal.it
OF
CONTRIBUTORS
467
Alex Waibel School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, U.S.A. waibelQcs.emù.edu Ye-Yi Wang School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, U.S.A. yywQcs.cmu.edu Ching-Long Yeh Dept of Computer Science & Engineering Tatung Institute of Technology 40 Chungshan North Road, Section 3 Taipei, 104, Taiwan chingyehQcse.ttit. edu.tw Stefan S. Zemke Dept of Computer & Information Science Linköping University S-58183 Linköping, Sweden stezeQida.liu.se Michael Zock Langage & Cognition L I M S I - C N R S , .. 133 F-91403 Orsay, France zockQlimsi.fr
Index of Subjects and Terms A. abductive equivalential translation (AET) 459 ACL-European Corpus Initiative 429 activation propagation 179 actor computation model 92 adaptive scaling 117 adjunction 4, 278 agenda-based control 282, 288 aggregation 176, 182 alignment 417, 428, 432 ambiguity 185, 187 kernel ~ 187, 193, 196, 197, 199, 200, 202 labelling- 187, 197, 201, 202 ~ occurrence 193, 196, 202 ~ of segmentation 199 ~ pattern 197 ~ scope 194, 196, 199 ~ type 186,187,193,194,197, 202 ambiguous representation 187,190, 191 anaphor 85 functional ~ 85 nominal ~ 85, 353 resolution ~ 225 anchor points 433 antecedent 225-228, 230 applicability semantics 280, 285
application-specific logical form (ALF) 458 approximate generation 273, 291 association 112 ATIS 8, 43 attention 123 augmented transition network (ATN) 275 automatic learning 393 B. Basque morphology 98 bilingual vocabulary 427 Brown corpus 163 C. c-command 226 canonical orders 264 case library 177 case marking 17 categorial grammar (CG) 11, 72, 94 center 227, 230 forward-looking ~ 89, 90 ~ tracking 225, 229, 231 centering extensions 214 centering model 89, 213 certainty factor 228 chart generation 273, 286, 288 chart parsing 59, 63, 288 Chinese language 353, 431 classification 291
470
INDEX OF SUBJECTS AND TERMS
clause recognition 420 clustering 134 clustering of words 114, 116 coherence, local 86, 93, 94 combinatory categorial grammar (CCG) 5, 71 complementary expressiveness 454 complex sentence 159 conceptual density 164 conceptual distance 163 conceptual graph 175 conceptual graphs (CGs) 274, 366 conceptual proximity 87 connectionist transfer 393 content determination 273 context sensitivity 112 context set 356 contingency matrix 429 controlled language 59 coordinate transformation 117 Core Language Engine (CLE) 299, 454 coreferring 262 corpora 125 correspondences 320 many-to-many ~ 333 one-to-one ~ 333 cross-modal paraphrasing 454 CYK 35 D. d-tree 277, 280 d-tree grammar (DTG) 277, 347 data oriented parsing 35 data oriented parsing (DOP) 35 DB-ΜΑΤ 365
definite articles 214 Delphi 453 dependency tree 4
derivation 283, 290 derivation tree 4 derived tree 4 disambiguation 35 discourse representation theory (DRT) 94 discourse segmentation 356 discourse segments 356 discourse structure 353 distributed representation 398 domain model 460 domain of locality 277 domain theory 459 dynamic alignment 133 E. Earley algorithm 54 ellipsis 85 entity-relationship theory 453 episodic memory 176 error detection 60 E U R O T R A 380,
381
example-based methods 295, 309 explanation-based learning 295 explanation-based learning (EBL) 9,313 extraposition 21 F. f-structure 393 false alarm 154 finite state automata (FSA) 63, 452 finite state transducer (FST) 11 fixed word order 15 focus 260 focus constraints 343 foot feature principle 22 free word order 15
INDEX OF SUBJECTS AND TERMS French language 48, 57, 72, 83, 175, 201, 202, 319, 386, 429, 434 full description 354 functional sentence perspective (FSP) 217 G. generalisability 394 generalised phrase structure gram mar
(GPSG) 18, 22,
301
generation strategies 288, 306, 330 generic structure potential (GSP) 254 generic tasks 267 genre 254 German language 15, 89, 229, 248, 367, 386, 393, 431 global strategy 328 grammar inversion 301 grammar transformation 301 H. habitable 454 head grammars 5 head-corner parsing 16, 29 head-driven phrase structure gram mar
(HPSG) 12, 94
hidden Markov model () 153, 439 human memory 123 hybrid knowledge representation 318 hypertheme progression 261
471
initial reference 354 initiating points 433-435 intended referent 354 inter-stratal constraints 254 interactive disambiguation 186, 202 interlingua 381 interrupted tree 53 island-driven search 152 J. Japanese language 149, 188, 200, 202, 235, 239, 240, 244, 383-386, 388, 431 K. K-vec 428, 429, 431, 433, 434, 436 knowledge acquisition (KA) 125 KOMET-PENMAN
254
Korean tagger 439 L. language checking technology 60 LDOCE 113 LDV 113 learnability 394 lexical ambiguity resolution 161 lexical functional grammar (LFG) 94, 393 lexical gap 287 lexical transducers 99 lexical tuning 138 lexicalised tree-adjoining grammar (LTAG) 3, 4
I. icons 456 incremental consumption 275 incremental learning 174 information retrieval 235
L I F E 289
linear indexed grammar 5 linear precedence rules 19 linearisation 342 linguistic variants 102
472
INDEX OF SUBJECTS AND TERMS
linguistics cognitive ~ 325 structural ~ 319, 325 long distance dependencies 100 LP-rules 19 LR parsing 297 M. machine translation () 15, 377 machine-aided translation (MAT) 365 mapping rules 280, 281, 317, 319, 323, 325, 338 maximal join 275, 276 meaning-text theory () 276 memoing 273, 288 memorisation 182 message passing 92 metafunction 249 method of development 252 minimal distinguishing descriptions 355 morphological generation 102 morphology 97 morphotactics 100 multimodal synergy 457 multimodality 452 MUMBLE 277
mutual information 129, 429 N. natural language generation (NLG) 273, 295, 317, 353, 365 noisy 428, 432 nominal anaphors 354 non-hierarchical representations 274 nucleus 263 O. ontology engineering 87
optimisation 306 P. P-vector 113 parallel corpus 427-429, 434 parallel progression 261 parallelism-correlations 328, 329, 345 parse forrest 41 PARSETALK 90
parsing 35, 92, 95, 191 partial ~ 149 parsing self-repairs 52 part-of-speech (POS) 128 partial parsing 418 PATR 62, 302, 362 pattern matching 276, 317, 318, 324, 328 patterns prototypical ~ 319, 336 phrase lattice 150 phrase spotting 152 P L A N D O C 452
Polish word order 217 possible translation 382 pragmatic knowledge 175 Prague school 217, 259 preference-based generation 286 presupposition 345 principal component analysis 114 probabilistic parsing 35 PROLOG 283, 289, 367, 453,
459,
460 P R O T E C T O R 277, 289,
347
Q. Q-vector 114 Quasi Logical Form (QLF) 459
473
INDEX OF SUBJECTS AND TERMS R. reduced form 354 reduction 354 referential distance 214 referring expressions 354 register 254 relaxed unification 68 relevance 266 requirements engineering 451 reversibility 459 reversible grammars 295 rheme 217 rhetorical relations 263 rhetorical structure theory (RST) 263 right edge principle 50, 53 robust parsing 60, 291 R O S E T T A 381
S. satellite 263 scalability 394 scaling factor 118 SEATS 59 segments 263 self-repair 48, 53 semantic concordance 163 semantic distance 111, 289 semantic head-driven generation 290, 299 semantic interpretation 150 semantic network 113, 275, 276, 285 semantic space 112 semantic subspace 116 semantic vector 112 SemCor 163 sentence generation 281, 295, 299, 328
sentence planning 273 sequential progression 261 similarity measure 179 sister-adjunction 278, 281, 283 skewing 432 Slavic languages 217 SMALLTALK 95, 453,
459
S N E P S 276
specification validation 452 specificity 456 speech recognition 149 speech understanding 150 spelling correction 98, 102 spoken language system 149 SPOKESMAN 277
spontaneous speech 149 spreading activation 113 SPUD 277
statistical techniques 127 stochastic tree-substitution gram mar 35 strata, language stratification 254 structures conceptual ~ 274, 280, 319, 320, 322, 328, 345 linguistic ~ 273, 319, 320, 327, 328 syntactic ~ 280, 322, 333, 345 sublanguage 125 subsequent reference 355 subsertion 278, 281, 283 substitution 4, 37, 278 supertags 7 surface realisation 273, 299, 306 syntactic patterns 277, 280 ~ recognition of 317, 320, 325, 328, 335, 337 system network 254
474
INDEX OF SUBJECTS AND TERMS
systemic-functional linguistics (SFL) 249 T. tactical generation 273 tagger 420, 439 targeted detection 61 targeted errors 65 taxonomy 138 technical documentation 59 tecnical documentation 276 text categorisation 235 text representation 174 textual ellipsis 85-95 resolution of ~ 87, 92-94 thematic progression 261 thematic roles 17, 227, 276, 322 theme 217, 248 theme/rheme 89 topic 260 topic-focus articulation 259 topic/comment 89, 90 topicalisation constraints 342, 345 tree descriptions 277 tree-adjoining grammar (TAG) 3, 35, 94, 277 tree-substitution grammar (TSG) 3, 35 turn of a dialogue 259 two-level morphology 97
U. uncertainty reasoning 225 unknown words 159 user lexicon 105 utterance path traversal 275 V. variable bag estimation (VBE) 428, 431, 433-436 verb second German verbs 25 VERBMOBIL 277
VINST 451 VISIONNAIRE 452
visual language 452, 456 vocabulary estimation 427, 428 W. WATSON 452
wip 277 word classification 138 word distance 111 word order 15 word sense disambiguation 161 word similarity 111 WordNet 162 wordphrase 440 X. XTAG 5, 289 Z. zero anaphor 353